← Back to research
·6 min read·company

OpenAI Realtime API

OpenAI Realtime API is the leading speech-to-speech API enabling low-latency, production-ready voice agents with native interruption handling, MCP support, and SIP telephony.

Key takeaways

  • First major speech-to-speech API from a leading AI lab, now GA with gpt-realtime model showing 82.8% reasoning accuracy
  • Native WebSocket architecture eliminates ASR→LLM→TTS pipeline latency and preserves speech nuance
  • Production features include MCP server support, image inputs, SIP telephony, and async function calling

FAQ

What is OpenAI Realtime API?

OpenAI Realtime API is a speech-to-speech API that processes and generates audio directly through a single model, enabling low-latency voice agents with natural interruption handling and tool calling.

How much does OpenAI Realtime API cost?

gpt-realtime costs $32/1M audio input tokens and $64/1M audio output tokens. Roughly $0.06/min input and $0.24/min output, making a balanced conversation approximately $0.15-0.20/minute.

What voices are available?

10 voices including two new exclusive voices (Cedar and Marin) plus 8 updated existing voices optimized for natural-sounding speech.

Executive Summary

OpenAI Realtime API is the production-ready speech-to-speech API from OpenAI, enabling developers to build low-latency voice agents that can see, hear, and speak in realtime. Unlike traditional pipelines that chain ASR→LLM→TTS, the Realtime API processes audio directly through a single model, reducing latency and preserving nuance in speech.

AttributeValue
CompanyOpenAI
LaunchedOctober 2024 (beta), 2025 (GA)
Modelgpt-realtime
ConnectionWebSocket
StatusGenerally Available

Product Overview

The Realtime API was first introduced in public beta in October 2024 and has since been used by thousands of developers to build production voice agents. The GA release includes the new gpt-realtime model, which shows significant improvements in instruction following, function calling, and natural speech quality.

The API is optimized for real-world tasks like customer support, personal assistance, and education—trained in collaboration with customers to excel at production voice agent use cases.

Key Capabilities

CapabilityDescription
Speech-to-SpeechSingle model processes audio input and generates audio output
Natural InterruptionsHandles barge-in and turn-taking natively
MCP Server SupportConnect remote MCP servers for tool access
Image InputsAdd images/screenshots to conversations
SIP TelephonyDirect integration with phone networks
Async FunctionsContinue conversations during long-running function calls

Voices

VoiceDescriptionAvailability
MarinNew, optimized for natural speechRealtime API exclusive
CedarNew, optimized for natural speechRealtime API exclusive
8 ExistingUpdated for improved qualityAll APIs

Technical Architecture

The Realtime API uses a WebSocket connection for bidirectional streaming of audio and events. This architecture enables true real-time interaction without the latency penalties of request-response patterns.

┌─────────────────────────────────────────────┐
│              Client Application             │
├─────────────────────────────────────────────┤
│         WebSocket Connection                │
│    ┌───────────┐      ┌───────────┐        │
│    │Audio Input│ ←──→ │Audio Output│        │
│    └───────────┘      └───────────┘        │
├─────────────────────────────────────────────┤
│           gpt-realtime Model                │
│  ┌─────────────────────────────────────┐   │
│  │ Speech Understanding + Generation   │   │
│  │ + Tool Calling + MCP Integration    │   │
│  └─────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Performance Benchmarks

Benchmarkgpt-realtimePrevious Model
Big Bench Audio (reasoning)82.8%65.6%
MultiChallenge (instruction following)30.5%20.6%
ComplexFuncBench (function calling)66.5%49.7%

Strengths

  • Single-model architecture — Eliminates ASR→LLM→TTS pipeline latency; preserves speech nuance and emotion
  • Production-ready — GA status with thousands of production deployments; battle-tested at scale
  • MCP support — Native integration with Model Context Protocol for tool access
  • SIP telephony — Direct integration with phone networks, PBX systems, and desk phones
  • Instruction following — 30.5% accuracy on MultiChallenge benchmark; follows fine-grained voice instructions
  • Function calling — 66.5% accuracy on ComplexFuncBench; async function calls don't interrupt conversation flow
  • Image inputs — Ground conversations in visual context (screenshots, photos)

Cautions

  • Premium pricing — ~$0.15-0.20/minute is expensive compared to traditional STT+LLM+TTS pipelines
  • OpenAI lock-in — No self-hosting option; dependent on OpenAI infrastructure and policies
  • Limited voices — 10 voices total; fewer options than ElevenLabs' 10,000+ library
  • Token-based billing — Complex pricing model based on audio tokens, not simple per-minute rates
  • WebSocket complexity — Requires more sophisticated client implementation than REST APIs
  • No emotion detection — Unlike Hume AI, doesn't analyze emotional cues in speech

Pricing & Licensing

ComponentPrice
Audio Input$32/1M tokens (~$0.06/min)
Audio Output$64/1M tokens (~$0.24/min)
Cached Input$0.40/1M tokens
Text InputStandard GPT-4o rates
Text OutputStandard GPT-4o rates

Effective cost: A balanced conversation (50% input, 50% output) costs approximately $0.15-0.20 per minute.

Note: gpt-realtime pricing is 20% lower than the previous gpt-4o-realtime-preview model.


Competitive Positioning

Direct Competitors

CompetitorDifferentiation
AWS Nova SonicNova Sonic offers Bedrock integration and lower pricing; Realtime API has better instruction following and MCP support
ElevenLabsElevenLabs has 10K+ voices and turn-taking detection; Realtime API has stronger reasoning and function calling
VapiVapi orchestrates multiple providers including Realtime API; OpenAI offers direct access without middleware
LiveKit AgentsLiveKit is open-source and provider-agnostic; Realtime API is single-vendor but more integrated

When to Choose OpenAI Realtime API

  • Choose Realtime API when: You need the best instruction following and function calling accuracy, want MCP integration, or are already invested in OpenAI ecosystem
  • Choose AWS Nova Sonic when: You need Bedrock integration or lower pricing
  • Choose ElevenLabs when: Voice quality and variety are paramount
  • Choose LiveKit when: You want open-source flexibility and provider choice

Ideal Customer Profile

Best fit:

  • Teams building production voice agents requiring high accuracy
  • Applications needing tool calling and MCP integration
  • Enterprises already using OpenAI APIs
  • Use cases requiring SIP telephony integration
  • Customer support, personal assistants, educational applications

Poor fit:

  • Cost-sensitive applications with high call volumes
  • Teams requiring voice cloning or extensive voice variety
  • Organizations needing self-hosted deployment
  • Use cases requiring emotion detection/analysis

Viability Assessment

FactorAssessment
Financial HealthStrong — OpenAI is well-capitalized with enterprise adoption
Market PositionLeader — First major lab to ship production speech-to-speech API
Innovation PaceRapid — Regular model updates, new features (MCP, SIP, images)
EcosystemExtensive — Large developer community, SDKs, documentation
Long-term OutlookPositive — Core product for OpenAI's agent strategy

Bottom Line

OpenAI Realtime API is the most capable speech-to-speech API available, with best-in-class instruction following (30.5% on MultiChallenge) and function calling (66.5% on ComplexFuncBench). The gpt-realtime model represents a significant improvement over earlier versions, and the addition of MCP support, SIP telephony, and image inputs makes it a complete platform for production voice agents.

The trade-offs are premium pricing (~$0.15-0.20/minute) and vendor lock-in. For teams prioritizing accuracy and capabilities over cost, it's the clear leader. For cost-sensitive applications or those requiring self-hosting, alternatives like LiveKit Agents or AWS Nova Sonic may be better fits.

Recommended for: Teams building production voice agents requiring high accuracy, tool calling, and enterprise integrations.

Not recommended for: Cost-sensitive high-volume applications, teams needing voice variety, or organizations requiring self-hosted deployment.


Research by Ry Walker Research