Key takeaways
- First major speech-to-speech API from a leading AI lab, now GA with gpt-realtime model showing 82.8% reasoning accuracy
- Native WebSocket architecture eliminates ASR→LLM→TTS pipeline latency and preserves speech nuance
- Production features include MCP server support, image inputs, SIP telephony, and async function calling
FAQ
What is OpenAI Realtime API?
OpenAI Realtime API is a speech-to-speech API that processes and generates audio directly through a single model, enabling low-latency voice agents with natural interruption handling and tool calling.
How much does OpenAI Realtime API cost?
gpt-realtime costs $32/1M audio input tokens and $64/1M audio output tokens. Roughly $0.06/min input and $0.24/min output, making a balanced conversation approximately $0.15-0.20/minute.
What voices are available?
10 voices including two new exclusive voices (Cedar and Marin) plus 8 updated existing voices optimized for natural-sounding speech.
Executive Summary
OpenAI Realtime API is the production-ready speech-to-speech API from OpenAI, enabling developers to build low-latency voice agents that can see, hear, and speak in realtime. Unlike traditional pipelines that chain ASR→LLM→TTS, the Realtime API processes audio directly through a single model, reducing latency and preserving nuance in speech.
| Attribute | Value |
|---|---|
| Company | OpenAI |
| Launched | October 2024 (beta), 2025 (GA) |
| Model | gpt-realtime |
| Connection | WebSocket |
| Status | Generally Available |
Product Overview
The Realtime API was first introduced in public beta in October 2024 and has since been used by thousands of developers to build production voice agents. The GA release includes the new gpt-realtime model, which shows significant improvements in instruction following, function calling, and natural speech quality.
The API is optimized for real-world tasks like customer support, personal assistance, and education—trained in collaboration with customers to excel at production voice agent use cases.
Key Capabilities
| Capability | Description |
|---|---|
| Speech-to-Speech | Single model processes audio input and generates audio output |
| Natural Interruptions | Handles barge-in and turn-taking natively |
| MCP Server Support | Connect remote MCP servers for tool access |
| Image Inputs | Add images/screenshots to conversations |
| SIP Telephony | Direct integration with phone networks |
| Async Functions | Continue conversations during long-running function calls |
Voices
| Voice | Description | Availability |
|---|---|---|
| Marin | New, optimized for natural speech | Realtime API exclusive |
| Cedar | New, optimized for natural speech | Realtime API exclusive |
| 8 Existing | Updated for improved quality | All APIs |
Technical Architecture
The Realtime API uses a WebSocket connection for bidirectional streaming of audio and events. This architecture enables true real-time interaction without the latency penalties of request-response patterns.
┌─────────────────────────────────────────────┐
│ Client Application │
├─────────────────────────────────────────────┤
│ WebSocket Connection │
│ ┌───────────┐ ┌───────────┐ │
│ │Audio Input│ ←──→ │Audio Output│ │
│ └───────────┘ └───────────┘ │
├─────────────────────────────────────────────┤
│ gpt-realtime Model │
│ ┌─────────────────────────────────────┐ │
│ │ Speech Understanding + Generation │ │
│ │ + Tool Calling + MCP Integration │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
Performance Benchmarks
| Benchmark | gpt-realtime | Previous Model |
|---|---|---|
| Big Bench Audio (reasoning) | 82.8% | 65.6% |
| MultiChallenge (instruction following) | 30.5% | 20.6% |
| ComplexFuncBench (function calling) | 66.5% | 49.7% |
Strengths
- Single-model architecture — Eliminates ASR→LLM→TTS pipeline latency; preserves speech nuance and emotion
- Production-ready — GA status with thousands of production deployments; battle-tested at scale
- MCP support — Native integration with Model Context Protocol for tool access
- SIP telephony — Direct integration with phone networks, PBX systems, and desk phones
- Instruction following — 30.5% accuracy on MultiChallenge benchmark; follows fine-grained voice instructions
- Function calling — 66.5% accuracy on ComplexFuncBench; async function calls don't interrupt conversation flow
- Image inputs — Ground conversations in visual context (screenshots, photos)
Cautions
- Premium pricing — ~$0.15-0.20/minute is expensive compared to traditional STT+LLM+TTS pipelines
- OpenAI lock-in — No self-hosting option; dependent on OpenAI infrastructure and policies
- Limited voices — 10 voices total; fewer options than ElevenLabs' 10,000+ library
- Token-based billing — Complex pricing model based on audio tokens, not simple per-minute rates
- WebSocket complexity — Requires more sophisticated client implementation than REST APIs
- No emotion detection — Unlike Hume AI, doesn't analyze emotional cues in speech
Pricing & Licensing
| Component | Price |
|---|---|
| Audio Input | $32/1M tokens (~$0.06/min) |
| Audio Output | $64/1M tokens (~$0.24/min) |
| Cached Input | $0.40/1M tokens |
| Text Input | Standard GPT-4o rates |
| Text Output | Standard GPT-4o rates |
Effective cost: A balanced conversation (50% input, 50% output) costs approximately $0.15-0.20 per minute.
Note: gpt-realtime pricing is 20% lower than the previous gpt-4o-realtime-preview model.
Competitive Positioning
Direct Competitors
| Competitor | Differentiation |
|---|---|
| AWS Nova Sonic | Nova Sonic offers Bedrock integration and lower pricing; Realtime API has better instruction following and MCP support |
| ElevenLabs | ElevenLabs has 10K+ voices and turn-taking detection; Realtime API has stronger reasoning and function calling |
| Vapi | Vapi orchestrates multiple providers including Realtime API; OpenAI offers direct access without middleware |
| LiveKit Agents | LiveKit is open-source and provider-agnostic; Realtime API is single-vendor but more integrated |
When to Choose OpenAI Realtime API
- Choose Realtime API when: You need the best instruction following and function calling accuracy, want MCP integration, or are already invested in OpenAI ecosystem
- Choose AWS Nova Sonic when: You need Bedrock integration or lower pricing
- Choose ElevenLabs when: Voice quality and variety are paramount
- Choose LiveKit when: You want open-source flexibility and provider choice
Ideal Customer Profile
Best fit:
- Teams building production voice agents requiring high accuracy
- Applications needing tool calling and MCP integration
- Enterprises already using OpenAI APIs
- Use cases requiring SIP telephony integration
- Customer support, personal assistants, educational applications
Poor fit:
- Cost-sensitive applications with high call volumes
- Teams requiring voice cloning or extensive voice variety
- Organizations needing self-hosted deployment
- Use cases requiring emotion detection/analysis
Viability Assessment
| Factor | Assessment |
|---|---|
| Financial Health | Strong — OpenAI is well-capitalized with enterprise adoption |
| Market Position | Leader — First major lab to ship production speech-to-speech API |
| Innovation Pace | Rapid — Regular model updates, new features (MCP, SIP, images) |
| Ecosystem | Extensive — Large developer community, SDKs, documentation |
| Long-term Outlook | Positive — Core product for OpenAI's agent strategy |
Bottom Line
OpenAI Realtime API is the most capable speech-to-speech API available, with best-in-class instruction following (30.5% on MultiChallenge) and function calling (66.5% on ComplexFuncBench). The gpt-realtime model represents a significant improvement over earlier versions, and the addition of MCP support, SIP telephony, and image inputs makes it a complete platform for production voice agents.
The trade-offs are premium pricing (~$0.15-0.20/minute) and vendor lock-in. For teams prioritizing accuracy and capabilities over cost, it's the clear leader. For cost-sensitive applications or those requiring self-hosting, alternatives like LiveKit Agents or AWS Nova Sonic may be better fits.
Recommended for: Teams building production voice agents requiring high accuracy, tool calling, and enterprise integrations.
Not recommended for: Cost-sensitive high-volume applications, teams needing voice variety, or organizations requiring self-hosted deployment.
Research by Ry Walker Research