OpenAI Realtime API | Ry Walker Research

Key takeaways

First major speech-to-speech API from a leading AI lab, now GA with gpt-realtime model showing 82.8% reasoning accuracy
Native WebSocket architecture eliminates ASR→LLM→TTS pipeline latency and preserves speech nuance
Production features include MCP server support, image inputs, SIP telephony, and async function calling

FAQ

What is OpenAI Realtime API?

OpenAI Realtime API is a speech-to-speech API that processes and generates audio directly through a single model, enabling low-latency voice agents with natural interruption handling and tool calling.

How much does OpenAI Realtime API cost?

gpt-realtime costs $32/1M audio input tokens and $64/1M audio output tokens. Roughly $0.06/min input and $0.24/min output, making a balanced conversation approximately $0.15-0.20/minute.

What voices are available?

10 voices including two new exclusive voices (Cedar and Marin) plus 8 updated existing voices optimized for natural-sounding speech.

Executive Summary

OpenAI Realtime API is the production-ready speech-to-speech API from OpenAI, enabling developers to build low-latency voice agents that can see, hear, and speak in realtime. Unlike traditional pipelines that chain ASR→LLM→TTS, the Realtime API processes audio directly through a single model, reducing latency and preserving nuance in speech.

Attribute	Value
Company	OpenAI
Launched	October 2024 (beta), 2025 (GA)
Model	gpt-realtime
Connection	WebSocket
Status	Generally Available

Product Overview

The Realtime API was first introduced in public beta in October 2024 and has since been used by thousands of developers to build production voice agents. The GA release includes the new gpt-realtime model, which shows significant improvements in instruction following, function calling, and natural speech quality.

The API is optimized for real-world tasks like customer support, personal assistance, and education—trained in collaboration with customers to excel at production voice agent use cases.

Key Capabilities

Capability	Description
Speech-to-Speech	Single model processes audio input and generates audio output
Natural Interruptions	Handles barge-in and turn-taking natively
MCP Server Support	Connect remote MCP servers for tool access
Image Inputs	Add images/screenshots to conversations
SIP Telephony	Direct integration with phone networks
Async Functions	Continue conversations during long-running function calls

Voices

Voice	Description	Availability
Marin	New, optimized for natural speech	Realtime API exclusive
Cedar	New, optimized for natural speech	Realtime API exclusive
8 Existing	Updated for improved quality	All APIs

Technical Architecture

The Realtime API uses a WebSocket connection for bidirectional streaming of audio and events. This architecture enables true real-time interaction without the latency penalties of request-response patterns.

┌─────────────────────────────────────────────┐
│              Client Application             │
├─────────────────────────────────────────────┤
│         WebSocket Connection                │
│    ┌───────────┐      ┌───────────┐        │
│    │Audio Input│ ←──→ │Audio Output│        │
│    └───────────┘      └───────────┘        │
├─────────────────────────────────────────────┤
│           gpt-realtime Model                │
│  ┌─────────────────────────────────────┐   │
│  │ Speech Understanding + Generation   │   │
│  │ + Tool Calling + MCP Integration    │   │
│  └─────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Performance Benchmarks

Benchmark	gpt-realtime	Previous Model
Big Bench Audio (reasoning)	82.8%	65.6%
MultiChallenge (instruction following)	30.5%	20.6%
ComplexFuncBench (function calling)	66.5%	49.7%

Strengths

Single-model architecture — Eliminates ASR→LLM→TTS pipeline latency; preserves speech nuance and emotion
Production-ready — GA status with thousands of production deployments; battle-tested at scale
MCP support — Native integration with Model Context Protocol for tool access
SIP telephony — Direct integration with phone networks, PBX systems, and desk phones
Instruction following — 30.5% accuracy on MultiChallenge benchmark; follows fine-grained voice instructions
Function calling — 66.5% accuracy on ComplexFuncBench; async function calls don't interrupt conversation flow
Image inputs — Ground conversations in visual context (screenshots, photos)

Cautions

Premium pricing — ~$0.15-0.20/minute is expensive compared to traditional STT+LLM+TTS pipelines
OpenAI lock-in — No self-hosting option; dependent on OpenAI infrastructure and policies
Limited voices — 10 voices total; fewer options than ElevenLabs' 10,000+ library
Token-based billing — Complex pricing model based on audio tokens, not simple per-minute rates
WebSocket complexity — Requires more sophisticated client implementation than REST APIs
No emotion detection — Unlike Hume AI, doesn't analyze emotional cues in speech

Pricing & Licensing

Component	Price
Audio Input	$32/1M tokens (~$0.06/min)
Audio Output	$64/1M tokens (~$0.24/min)
Cached Input	$0.40/1M tokens
Text Input	Standard GPT-4o rates
Text Output	Standard GPT-4o rates

Effective cost: A balanced conversation (50% input, 50% output) costs approximately $0.15-0.20 per minute.

Note: gpt-realtime pricing is 20% lower than the previous gpt-4o-realtime-preview model.

Competitive Positioning

Direct Competitors

Competitor	Differentiation
AWS Nova Sonic	Nova Sonic offers Bedrock integration and lower pricing; Realtime API has better instruction following and MCP support
ElevenLabs	ElevenLabs has 10K+ voices and turn-taking detection; Realtime API has stronger reasoning and function calling
Vapi	Vapi orchestrates multiple providers including Realtime API; OpenAI offers direct access without middleware
LiveKit Agents	LiveKit is open-source and provider-agnostic; Realtime API is single-vendor but more integrated

When to Choose OpenAI Realtime API

Choose Realtime API when: You need the best instruction following and function calling accuracy, want MCP integration, or are already invested in OpenAI ecosystem
Choose AWS Nova Sonic when: You need Bedrock integration or lower pricing
Choose ElevenLabs when: Voice quality and variety are paramount
Choose LiveKit when: You want open-source flexibility and provider choice

Ideal Customer Profile

Best fit:

Teams building production voice agents requiring high accuracy
Applications needing tool calling and MCP integration
Enterprises already using OpenAI APIs
Use cases requiring SIP telephony integration
Customer support, personal assistants, educational applications

Poor fit:

Cost-sensitive applications with high call volumes
Teams requiring voice cloning or extensive voice variety
Organizations needing self-hosted deployment
Use cases requiring emotion detection/analysis

Viability Assessment

Factor	Assessment
Financial Health	Strong — OpenAI is well-capitalized with enterprise adoption
Market Position	Leader — First major lab to ship production speech-to-speech API
Innovation Pace	Rapid — Regular model updates, new features (MCP, SIP, images)
Ecosystem	Extensive — Large developer community, SDKs, documentation
Long-term Outlook	Positive — Core product for OpenAI's agent strategy

Bottom Line

OpenAI Realtime API is the most capable speech-to-speech API available, with best-in-class instruction following (30.5% on MultiChallenge) and function calling (66.5% on ComplexFuncBench). The gpt-realtime model represents a significant improvement over earlier versions, and the addition of MCP support, SIP telephony, and image inputs makes it a complete platform for production voice agents.

The trade-offs are premium pricing (~$0.15-0.20/minute) and vendor lock-in. For teams prioritizing accuracy and capabilities over cost, it's the clear leader. For cost-sensitive applications or those requiring self-hosting, alternatives like LiveKit Agents or AWS Nova Sonic may be better fits.

Recommended for: Teams building production voice agents requiring high accuracy, tool calling, and enterprise integrations.

Not recommended for: Cost-sensitive high-volume applications, teams needing voice variety, or organizations requiring self-hosted deployment.

Research by Ry Walker Research

Sources