OpenAI has introduced three new realtime audio models through its API platform, expanding its push into conversational AI agents and voice-based software interfaces. The release includes GPT-Realtime-2, a new voice model with GPT-5-level reasoning capabilities, GPT-Realtime-Translate for live multilingual speech translation, and GPT-Realtime-Whisper for low-latency streaming transcription.
The company said the models are designed to support a new generation of voice applications capable of reasoning through requests, using external tools during conversations, translating speech live, and handling continuous spoken interaction in real time.
GPT-Realtime-2 is positioned as OpenAI’s most advanced voice interaction model so far. The system supports live conversational workflows where the AI can process interruptions, maintain long context windows, call tools in parallel, and continue conversations naturally while tasks are being completed in the background.
OpenAI expanded the model’s context window from 32,000 to 128,000 tokens and introduced adjustable reasoning levels ranging from minimal to “xhigh,” allowing developers to balance latency against reasoning depth. The company said GPT-Realtime-2 scored 96.6% on the Big Bench Audio Intelligence benchmark, compared with 81.4% for GPT-Realtime-1.5.
The company also introduced GPT-Realtime-Translate, a live speech translation model supporting more than 70 input languages and 13 output languages. OpenAI said the model is designed for customer support, international business communication, events, education, and multilingual voice interfaces where conversations need to continue naturally across languages without noticeable delays.
GPT-Realtime-Whisper, meanwhile, focuses on streaming speech recognition. The model transcribes spoken audio as conversations happen, allowing developers to build live captioning systems, meeting assistants, support tools, and voice-driven enterprise workflows with lower latency.
OpenAI said companies including Zillow, Intercom, Priceline, Deutsche Telekom, and Vimeo have already tested the new models in production-oriented voice systems.
“What stood out about GPT-Realtime-2 was the intelligence and tool-calling reliability it brings to complex voice interactions,” said Zillow SVP and Head of AI Josh Weisberg, who said the model improved call success rates during adversarial testing.
OpenAI Pushes Beyond Text-Based Interfaces
The release reflects OpenAI’s broader strategy of moving AI interaction away from chat windows and toward continuous voice-based systems integrated directly into software products and workflows.
Rather than functioning as simple speech interfaces layered on top of chatbots, the new models are designed to operate as realtime agents capable of reasoning, retrieving information, executing actions, and maintaining conversational continuity simultaneously.
OpenAI described three emerging categories for voice AI systems:
- voice-to-action workflows where agents complete tasks directly from spoken instructions,
- systems-to-voice interfaces where software proactively communicates updates through speech,
- voice-to-voice interactions involving live multilingual translation between users.
The company highlighted examples such as AI travel assistants capable of managing itinerary changes conversationally and multilingual customer service systems that translate discussions in real time while preserving natural speech flow.
OpenAI also emphasized production safeguards around the Realtime API, including active classifiers that can interrupt sessions violating safety policies and support for additional developer-defined guardrails through the Agents SDK.
The models are available immediately through OpenAI’s Realtime API. GPT-Realtime-2 is priced at $32 per million audio input tokens and $64 per million output tokens, while GPT-Realtime-Translate costs $0.034 per minute and GPT-Realtime-Whisper costs $0.017 per minute.
AI & Machine Learning, News