← Back to research
·10 min read·company

OpenAI Realtime API

OpenAI Realtime API is the leading speech-to-speech API enabling low-latency, production-ready voice agents with native interruption handling, MCP support, SIP telephony, and the GPT-5-class GPT-Realtime-2 model.

Key takeaways

  • GPT-Realtime-2 (May 2026) brings GPT-5-class reasoning to speech-to-speech, scoring 96.6% on Big Bench Audio with a 128K context window and adjustable reasoning effort
  • Native WebSocket architecture eliminates ASR→LLM→TTS pipeline latency and preserves speech nuance
  • Production features include MCP server support, image inputs, SIP telephony, async function calling, plus new companion models for live translation and streaming transcription

FAQ

What is OpenAI Realtime API?

OpenAI Realtime API is a speech-to-speech API that processes and generates audio directly through a single model, enabling low-latency voice agents with natural interruption handling and tool calling.

How much does OpenAI Realtime API cost?

GPT-Realtime-2 costs $32/1M audio input tokens and $64/1M audio output tokens ($0.40/1M cached input). Roughly $0.06/min input and $0.24/min output, making a balanced conversation approximately $0.15-0.20/minute before caching.

What voices are available?

The Realtime API launched GA with 10 voices, including two exclusives (Cedar and Marin), and OpenAI has since added five more voices alongside prompt-caching price cuts.

Executive Summary

OpenAI Realtime API is the production-ready speech-to-speech API from OpenAI, enabling developers to build low-latency voice agents that can see, hear, and speak in realtime. Unlike traditional pipelines that chain ASR→LLM→TTS, the Realtime API processes audio directly through a single model, reducing latency and preserving nuance in speech. In May 2026, OpenAI shipped GPT-Realtime-2, the first voice model with GPT-5-class reasoning in the audio pipeline, alongside dedicated live-translation and streaming-transcription models.

AttributeValue
CompanyOpenAI
LaunchedOctober 2024 (beta), August 2025 (GA)
Modelgpt-realtime-2 (plus gpt-realtime-translate, gpt-realtime-whisper)
ConnectionWebSocket
StatusGenerally Available

Product Overview

The Realtime API was first introduced in public beta in October 2024 and has since been used by thousands of developers to build production voice agents. The August 2025 GA release introduced the gpt-realtime model, which showed significant improvements in instruction following, function calling, and natural speech quality.

The API is optimized for real-world tasks like customer support, personal assistance, and education—trained in collaboration with customers to excel at production voice agent use cases.

May 2026: The GPT-Realtime-2 Generation

On May 7, 2026, OpenAI released three new realtime models:

ModelWhat it doesPricing
gpt-realtime-2Speech-to-speech with GPT-5-class reasoning, 128K context (up from 32K), adjustable reasoning effort (minimal → xhigh)$32/$64 per 1M audio tokens
gpt-realtime-translateLive speech translation: 70+ input languages → 13 output languages, keeps pace with the speaker$0.034/minute
gpt-realtime-whisperStreaming speech-to-text, transcribes live as the speaker talks$0.017/minute

GPT-Realtime-2 adds preambles (the agent says "one moment" before starting a long task) and parallel tool calls with audio feedback ("checking your calendar now"), so the conversation keeps moving while the model reasons or calls tools. The model lineage ran gpt-realtime (GA, Aug 2025) → gpt-realtime-1.5 (early 2026) → gpt-realtime-2 (May 2026). OpenAI has also added five new voices and cut effective costs via prompt caching on audio input.

Key Capabilities

CapabilityDescription
Speech-to-SpeechSingle model processes audio input and generates audio output
Natural InterruptionsHandles barge-in and turn-taking natively
MCP Server SupportConnect remote MCP servers for tool access
Image InputsAdd images/screenshots to conversations
SIP TelephonyDirect integration with phone networks
Async FunctionsContinue conversations during long-running function calls

Voices

VoiceDescriptionAvailability
MarinOptimized for natural speechRealtime API exclusive
CedarOptimized for natural speechRealtime API exclusive
8 ExistingUpdated for improved qualityAll APIs
5 New (2026)Added post-GA for more varietyRealtime API

Technical Architecture

The Realtime API uses a WebSocket connection for bidirectional streaming of audio and events. This architecture enables true real-time interaction without the latency penalties of request-response patterns.

┌─────────────────────────────────────────────┐
│              Client Application             │
├─────────────────────────────────────────────┤
│         WebSocket Connection                │
│    ┌───────────┐      ┌───────────┐        │
│    │Audio Input│ ←──→ │Audio Output│        │
│    └───────────┘      └───────────┘        │
├─────────────────────────────────────────────┤
│          gpt-realtime-2 Model               │
│  ┌─────────────────────────────────────┐   │
│  │ Speech Understanding + Generation   │   │
│  │ + Tool Calling + MCP Integration    │   │
│  └─────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Performance Benchmarks

Benchmarkgpt-realtime-2gpt-realtime-1.5
Big Bench Audio (reasoning, high effort)96.6%81.4%
Audio MultiChallenge (instruction following, xhigh effort)48.5%34.7%
Zillow adversarial call-success95%69%

GPT-Realtime-2's 96.6% on Big Bench Audio is a 15.2-point jump over its predecessor and closes most of the historical reasoning gap between speech-to-speech models and text pipelines. For reference, the original gpt-realtime GA model scored 82.8% on Big Bench Audio and 66.5% on ComplexFuncBench function calling.


Strengths

  • GPT-5-class reasoning in audio — GPT-Realtime-2 reasons natively in the voice pipeline with adjustable effort, scoring 96.6% on Big Bench Audio
  • Single-model architecture — Eliminates ASR→LLM→TTS pipeline latency; preserves speech nuance and emotion
  • 128K context window — Quadrupled from 32K, maintains context across long conversations
  • Production-ready — GA status with thousands of production deployments; battle-tested at scale
  • MCP support — Native integration with Model Context Protocol for tool access
  • SIP telephony — Direct integration with phone networks, PBX systems, and desk phones
  • Fluid tool use — Preambles and parallel tool calls with audio feedback keep conversations moving during long-running work
  • Image inputs — Ground conversations in visual context (screenshots, photos)
  • Companion models — gpt-realtime-translate (70+ → 13 languages live) and gpt-realtime-whisper (streaming STT) cover adjacent voice workloads

Cautions

  • Premium pricing — ~$0.15-0.20/minute before caching is expensive compared to traditional STT+LLM+TTS pipelines
  • Unpredictable costs — Token-based billing means long or contentious conversations burn far more than short FAQ exchanges; press coverage noted that a "lengthy, emotionally charged conversation where a user repeatedly insults the AI could rack up significant token consumption" at $64/1M output tokens
  • OpenAI lock-in — No self-hosting option; dependent on OpenAI infrastructure and policies
  • Limited voices — ~15 voices total; fewer options than ElevenLabs' 10,000+ library
  • No video input — GPT-Realtime-2 supports image inputs only, unlike Gemini Live which also accepts video
  • WebSocket complexity — Requires more sophisticated client implementation than REST APIs
  • No emotion detection — Unlike Hume AI, doesn't analyze emotional cues in speech

Pricing & Licensing

Component (gpt-realtime-2)Price
Audio Input$32/1M tokens (~$0.06/min)
Audio Output$64/1M tokens (~$0.24/min)
Cached Audio Input$0.40/1M tokens
Text Input / Output$4 / $24 per 1M tokens
Image Input$5/1M tokens
gpt-realtime-translate$0.034/minute
gpt-realtime-whisper$0.017/minute

All gpt-realtime-2 rates per OpenAI's developer pricing docs as of June 2026.

Effective cost: A balanced conversation (50% input, 50% output) costs approximately $0.15-0.20 per minute before caching. Prompt caching at $0.40/1M cached audio input tokens (an 80x discount versus uncached input) materially reduces cost for agents with long, stable system prompts.


What Developers Say

Hacker News reaction to the GPT-Realtime-2 generation has been positive on capability, with cost and modality caveats:

"gpt is super good, seemingly more robust. Gemini connects faster." — keizo, Hacker News, May 2026

"GPT-Realtime-2 is the frontend. It has a tool to start a (Claude Code/Codex) task, and a tool to check the progress" — zdql, Hacker News, May 2026, describing a voice-driven coding-agent orchestrator

"GPT-Realtime-2 only supports image inputs (unlike Gemini Live which also supports video inputs)." — Show HN author, May 2026

On the skeptical side, launch coverage framed the cost model bluntly — BigGo Finance headlined it "A Voice Model That Thinks, Talks, and Costs a Fortune If You Insult It," warning that token-based billing makes long, heated conversations expensive at $64/1M output tokens. Sam Altman, for his part, claimed younger users "prefer voice interaction, especially when dumping large amounts of background information in one go."


Competitive Positioning

Direct Competitors

CompetitorDifferentiation
AWS Nova SonicNova Sonic offers Bedrock integration and lower pricing; Realtime API has better instruction following and MCP support
ElevenLabsElevenLabs has 10K+ voices and turn-taking detection; Realtime API has stronger reasoning and function calling
VapiVapi orchestrates multiple providers including Realtime API; OpenAI offers direct access without middleware
LiveKit AgentsLiveKit is open-source and provider-agnostic; Realtime API is single-vendor but more integrated
Gemini LiveGemini Live connects faster and accepts video input; GPT-Realtime-2 is more robust with stronger reasoning and tool calling

When to Choose OpenAI Realtime API

  • Choose Realtime API when: You need the best instruction following and function calling accuracy, want MCP integration, or are already invested in OpenAI ecosystem
  • Choose AWS Nova Sonic when: You need Bedrock integration or lower pricing
  • Choose ElevenLabs when: Voice quality and variety are paramount
  • Choose LiveKit when: You want open-source flexibility and provider choice

Ideal Customer Profile

Best fit:

  • Teams building production voice agents requiring high accuracy
  • Applications needing tool calling and MCP integration
  • Enterprises already using OpenAI APIs
  • Use cases requiring SIP telephony integration
  • Customer support, personal assistants, educational applications

Poor fit:

  • Cost-sensitive applications with high call volumes
  • Teams requiring voice cloning or extensive voice variety
  • Organizations needing self-hosted deployment
  • Use cases requiring emotion detection/analysis

Viability Assessment

FactorAssessment
Financial HealthStrong — OpenAI is well-capitalized with enterprise adoption
Market PositionLeader — First major lab to ship production speech-to-speech API
Innovation PaceRapid — Three model generations in nine months (gpt-realtime → 1.5 → 2), plus translate/whisper companions
EcosystemExtensive — Large developer community, SDKs, documentation
Long-term OutlookPositive — Core product for OpenAI's agent strategy

As of June 2026, the Realtime API powers thousands of production voice agents, with enterprise benchmarks like the Zillow adversarial call-success eval (95% for GPT-Realtime-2, up from 69%) signaling real customer-facing deployment. Developers are already shipping GPT-Realtime-2 voice frontends for everything from site navigation to coding-agent orchestration.


Bottom Line

OpenAI Realtime API is the most capable speech-to-speech API available, and the gap widened with GPT-Realtime-2: 96.6% on Big Bench Audio reasoning, 48.5% on Audio MultiChallenge instruction following, a 128K context window, and adjustable reasoning effort — the first speech-to-speech model with GPT-5-class reasoning. Combined with MCP support, SIP telephony, image inputs, preambles, parallel tool calls, and dedicated translate/whisper companion models, it's a complete platform for production voice agents.

The trade-offs are premium pricing (~$0.15-0.20/minute before caching, with unpredictable costs on long calls) and vendor lock-in. For teams prioritizing accuracy and capabilities over cost, it's the clear leader. For cost-sensitive applications or those requiring self-hosting, alternatives like LiveKit Agents or AWS Nova Sonic may be better fits.

Recommended for: Teams building production voice agents requiring high accuracy, tool calling, and enterprise integrations.

Not recommended for: Cost-sensitive high-volume applications, teams needing voice variety or video input, or organizations requiring self-hosted deployment.

Outlook: OpenAI is iterating fastest in this category — three model generations in nine months and a clear push to make voice a first-class agent interface. Expect continued price-structure refinement (caching discounts) rather than headline rate cuts, and tighter coupling with OpenAI's broader agent stack.


Research by Ry Walker Research