Key takeaways
- GPT-Realtime-2 (May 2026) brings GPT-5-class reasoning to speech-to-speech, scoring 96.6% on Big Bench Audio with a 128K context window and adjustable reasoning effort
- Native WebSocket architecture eliminates ASR→LLM→TTS pipeline latency and preserves speech nuance
- Production features include MCP server support, image inputs, SIP telephony, async function calling, plus new companion models for live translation and streaming transcription
FAQ
What is OpenAI Realtime API?
OpenAI Realtime API is a speech-to-speech API that processes and generates audio directly through a single model, enabling low-latency voice agents with natural interruption handling and tool calling.
How much does OpenAI Realtime API cost?
GPT-Realtime-2 costs $32/1M audio input tokens and $64/1M audio output tokens ($0.40/1M cached input). Roughly $0.06/min input and $0.24/min output, making a balanced conversation approximately $0.15-0.20/minute before caching.
What voices are available?
The Realtime API launched GA with 10 voices, including two exclusives (Cedar and Marin), and OpenAI has since added five more voices alongside prompt-caching price cuts.
Executive Summary
OpenAI Realtime API is the production-ready speech-to-speech API from OpenAI, enabling developers to build low-latency voice agents that can see, hear, and speak in realtime. Unlike traditional pipelines that chain ASR→LLM→TTS, the Realtime API processes audio directly through a single model, reducing latency and preserving nuance in speech. In May 2026, OpenAI shipped GPT-Realtime-2, the first voice model with GPT-5-class reasoning in the audio pipeline, alongside dedicated live-translation and streaming-transcription models.
| Attribute | Value |
|---|---|
| Company | OpenAI |
| Launched | October 2024 (beta), August 2025 (GA) |
| Model | gpt-realtime-2 (plus gpt-realtime-translate, gpt-realtime-whisper) |
| Connection | WebSocket |
| Status | Generally Available |
Product Overview
The Realtime API was first introduced in public beta in October 2024 and has since been used by thousands of developers to build production voice agents. The August 2025 GA release introduced the gpt-realtime model, which showed significant improvements in instruction following, function calling, and natural speech quality.
The API is optimized for real-world tasks like customer support, personal assistance, and education—trained in collaboration with customers to excel at production voice agent use cases.
May 2026: The GPT-Realtime-2 Generation
On May 7, 2026, OpenAI released three new realtime models:
| Model | What it does | Pricing |
|---|---|---|
| gpt-realtime-2 | Speech-to-speech with GPT-5-class reasoning, 128K context (up from 32K), adjustable reasoning effort (minimal → xhigh) | $32/$64 per 1M audio tokens |
| gpt-realtime-translate | Live speech translation: 70+ input languages → 13 output languages, keeps pace with the speaker | $0.034/minute |
| gpt-realtime-whisper | Streaming speech-to-text, transcribes live as the speaker talks | $0.017/minute |
GPT-Realtime-2 adds preambles (the agent says "one moment" before starting a long task) and parallel tool calls with audio feedback ("checking your calendar now"), so the conversation keeps moving while the model reasons or calls tools. The model lineage ran gpt-realtime (GA, Aug 2025) → gpt-realtime-1.5 (early 2026) → gpt-realtime-2 (May 2026). OpenAI has also added five new voices and cut effective costs via prompt caching on audio input.
Key Capabilities
| Capability | Description |
|---|---|
| Speech-to-Speech | Single model processes audio input and generates audio output |
| Natural Interruptions | Handles barge-in and turn-taking natively |
| MCP Server Support | Connect remote MCP servers for tool access |
| Image Inputs | Add images/screenshots to conversations |
| SIP Telephony | Direct integration with phone networks |
| Async Functions | Continue conversations during long-running function calls |
Voices
| Voice | Description | Availability |
|---|---|---|
| Marin | Optimized for natural speech | Realtime API exclusive |
| Cedar | Optimized for natural speech | Realtime API exclusive |
| 8 Existing | Updated for improved quality | All APIs |
| 5 New (2026) | Added post-GA for more variety | Realtime API |
Technical Architecture
The Realtime API uses a WebSocket connection for bidirectional streaming of audio and events. This architecture enables true real-time interaction without the latency penalties of request-response patterns.
┌─────────────────────────────────────────────┐
│ Client Application │
├─────────────────────────────────────────────┤
│ WebSocket Connection │
│ ┌───────────┐ ┌───────────┐ │
│ │Audio Input│ ←──→ │Audio Output│ │
│ └───────────┘ └───────────┘ │
├─────────────────────────────────────────────┤
│ gpt-realtime-2 Model │
│ ┌─────────────────────────────────────┐ │
│ │ Speech Understanding + Generation │ │
│ │ + Tool Calling + MCP Integration │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
Performance Benchmarks
| Benchmark | gpt-realtime-2 | gpt-realtime-1.5 |
|---|---|---|
| Big Bench Audio (reasoning, high effort) | 96.6% | 81.4% |
| Audio MultiChallenge (instruction following, xhigh effort) | 48.5% | 34.7% |
| Zillow adversarial call-success | 95% | 69% |
GPT-Realtime-2's 96.6% on Big Bench Audio is a 15.2-point jump over its predecessor and closes most of the historical reasoning gap between speech-to-speech models and text pipelines. For reference, the original gpt-realtime GA model scored 82.8% on Big Bench Audio and 66.5% on ComplexFuncBench function calling.
Strengths
- GPT-5-class reasoning in audio — GPT-Realtime-2 reasons natively in the voice pipeline with adjustable effort, scoring 96.6% on Big Bench Audio
- Single-model architecture — Eliminates ASR→LLM→TTS pipeline latency; preserves speech nuance and emotion
- 128K context window — Quadrupled from 32K, maintains context across long conversations
- Production-ready — GA status with thousands of production deployments; battle-tested at scale
- MCP support — Native integration with Model Context Protocol for tool access
- SIP telephony — Direct integration with phone networks, PBX systems, and desk phones
- Fluid tool use — Preambles and parallel tool calls with audio feedback keep conversations moving during long-running work
- Image inputs — Ground conversations in visual context (screenshots, photos)
- Companion models — gpt-realtime-translate (70+ → 13 languages live) and gpt-realtime-whisper (streaming STT) cover adjacent voice workloads
Cautions
- Premium pricing — ~$0.15-0.20/minute before caching is expensive compared to traditional STT+LLM+TTS pipelines
- Unpredictable costs — Token-based billing means long or contentious conversations burn far more than short FAQ exchanges; press coverage noted that a "lengthy, emotionally charged conversation where a user repeatedly insults the AI could rack up significant token consumption" at $64/1M output tokens
- OpenAI lock-in — No self-hosting option; dependent on OpenAI infrastructure and policies
- Limited voices — ~15 voices total; fewer options than ElevenLabs' 10,000+ library
- No video input — GPT-Realtime-2 supports image inputs only, unlike Gemini Live which also accepts video
- WebSocket complexity — Requires more sophisticated client implementation than REST APIs
- No emotion detection — Unlike Hume AI, doesn't analyze emotional cues in speech
Pricing & Licensing
| Component (gpt-realtime-2) | Price |
|---|---|
| Audio Input | $32/1M tokens (~$0.06/min) |
| Audio Output | $64/1M tokens (~$0.24/min) |
| Cached Audio Input | $0.40/1M tokens |
| Text Input / Output | $4 / $24 per 1M tokens |
| Image Input | $5/1M tokens |
| gpt-realtime-translate | $0.034/minute |
| gpt-realtime-whisper | $0.017/minute |
All gpt-realtime-2 rates per OpenAI's developer pricing docs as of June 2026.
Effective cost: A balanced conversation (50% input, 50% output) costs approximately $0.15-0.20 per minute before caching. Prompt caching at $0.40/1M cached audio input tokens (an 80x discount versus uncached input) materially reduces cost for agents with long, stable system prompts.
What Developers Say
Hacker News reaction to the GPT-Realtime-2 generation has been positive on capability, with cost and modality caveats:
"gpt is super good, seemingly more robust. Gemini connects faster." — keizo, Hacker News, May 2026
"GPT-Realtime-2 is the frontend. It has a tool to start a (Claude Code/Codex) task, and a tool to check the progress" — zdql, Hacker News, May 2026, describing a voice-driven coding-agent orchestrator
"GPT-Realtime-2 only supports image inputs (unlike Gemini Live which also supports video inputs)." — Show HN author, May 2026
On the skeptical side, launch coverage framed the cost model bluntly — BigGo Finance headlined it "A Voice Model That Thinks, Talks, and Costs a Fortune If You Insult It," warning that token-based billing makes long, heated conversations expensive at $64/1M output tokens. Sam Altman, for his part, claimed younger users "prefer voice interaction, especially when dumping large amounts of background information in one go."
Competitive Positioning
Direct Competitors
| Competitor | Differentiation |
|---|---|
| AWS Nova Sonic | Nova Sonic offers Bedrock integration and lower pricing; Realtime API has better instruction following and MCP support |
| ElevenLabs | ElevenLabs has 10K+ voices and turn-taking detection; Realtime API has stronger reasoning and function calling |
| Vapi | Vapi orchestrates multiple providers including Realtime API; OpenAI offers direct access without middleware |
| LiveKit Agents | LiveKit is open-source and provider-agnostic; Realtime API is single-vendor but more integrated |
| Gemini Live | Gemini Live connects faster and accepts video input; GPT-Realtime-2 is more robust with stronger reasoning and tool calling |
When to Choose OpenAI Realtime API
- Choose Realtime API when: You need the best instruction following and function calling accuracy, want MCP integration, or are already invested in OpenAI ecosystem
- Choose AWS Nova Sonic when: You need Bedrock integration or lower pricing
- Choose ElevenLabs when: Voice quality and variety are paramount
- Choose LiveKit when: You want open-source flexibility and provider choice
Ideal Customer Profile
Best fit:
- Teams building production voice agents requiring high accuracy
- Applications needing tool calling and MCP integration
- Enterprises already using OpenAI APIs
- Use cases requiring SIP telephony integration
- Customer support, personal assistants, educational applications
Poor fit:
- Cost-sensitive applications with high call volumes
- Teams requiring voice cloning or extensive voice variety
- Organizations needing self-hosted deployment
- Use cases requiring emotion detection/analysis
Viability Assessment
| Factor | Assessment |
|---|---|
| Financial Health | Strong — OpenAI is well-capitalized with enterprise adoption |
| Market Position | Leader — First major lab to ship production speech-to-speech API |
| Innovation Pace | Rapid — Three model generations in nine months (gpt-realtime → 1.5 → 2), plus translate/whisper companions |
| Ecosystem | Extensive — Large developer community, SDKs, documentation |
| Long-term Outlook | Positive — Core product for OpenAI's agent strategy |
As of June 2026, the Realtime API powers thousands of production voice agents, with enterprise benchmarks like the Zillow adversarial call-success eval (95% for GPT-Realtime-2, up from 69%) signaling real customer-facing deployment. Developers are already shipping GPT-Realtime-2 voice frontends for everything from site navigation to coding-agent orchestration.
Bottom Line
OpenAI Realtime API is the most capable speech-to-speech API available, and the gap widened with GPT-Realtime-2: 96.6% on Big Bench Audio reasoning, 48.5% on Audio MultiChallenge instruction following, a 128K context window, and adjustable reasoning effort — the first speech-to-speech model with GPT-5-class reasoning. Combined with MCP support, SIP telephony, image inputs, preambles, parallel tool calls, and dedicated translate/whisper companion models, it's a complete platform for production voice agents.
The trade-offs are premium pricing (~$0.15-0.20/minute before caching, with unpredictable costs on long calls) and vendor lock-in. For teams prioritizing accuracy and capabilities over cost, it's the clear leader. For cost-sensitive applications or those requiring self-hosting, alternatives like LiveKit Agents or AWS Nova Sonic may be better fits.
Recommended for: Teams building production voice agents requiring high accuracy, tool calling, and enterprise integrations.
Not recommended for: Cost-sensitive high-volume applications, teams needing voice variety or video input, or organizations requiring self-hosted deployment.
Outlook: OpenAI is iterating fastest in this category — three model generations in nine months and a clear push to make voice a first-class agent interface. Expect continued price-structure refinement (caching discounts) rather than headline rate cuts, and tighter coupling with OpenAI's broader agent stack.
Research by Ry Walker Research
Sources
- [1] OpenAI Realtime API Documentation
- [2] Introducing gpt-realtime Blog Post
- [3] OpenAI Pricing Page
- [4] OpenAI Realtime API Beta Announcement
- [5] Advancing Voice Intelligence with New Models in the API (OpenAI, May 2026)
- [6] OpenAI Developer Docs Pricing
- [7] GPT-Realtime-2: A Voice Model with GPT-5-Class Reasoning (DataCamp)
- [8] BigGo Finance: OpenAI Unleashes GPT-Realtime-2
- [9] Hacker News Comments on GPT-Realtime-2
- [10] OpenAI API Changelog