Key takeaways
- All three hyperscaler-adjacent labs except Anthropic now ship native speech-to-speech APIs — Gemini Live went GA at I/O 2026, joining OpenAI Realtime (now gpt-realtime-2) and AWS Nova 2 Sonic
- The money validated the orchestration layer: Vapi raised $50M at ~$500M after winning Amazon Ring over 40 rivals, LiveKit hit a $1B valuation, and ElevenLabs closed $500M at $11B
- The category is consolidating into strata — speech-to-speech models, speech stacks (TTS/STT specialists like Cartesia, Gradium, Rime), and orchestration frameworks (Vapi, Retell, LiveKit, Pipecat)
- Advertised per-minute prices are fiction: real-world all-in costs run 2-3x the headline number on nearly every platform — Vapi $0.13-0.33, Retell $0.13-0.20+, OpenAI $0.15-0.20
FAQ
What's the best voice API for AI agents?
OpenAI Realtime (gpt-realtime-2) for accuracy and tool calling, Gemini Live for Google-stack teams, ElevenLabs Agents for voice quality, Vapi/Retell for phone automation, LiveKit Agents or Pipecat for open-source orchestration.
What does speech-to-speech mean?
Speech-to-speech (S2S) models process audio input and generate audio output directly, eliminating the traditional ASR→LLM→TTS pipeline for lower latency and better nuance preservation.
Which voice APIs support self-hosting?
NVIDIA PersonaPlex (open weights), LiveKit Agents and Pipecat (open-source frameworks), Deepgram (on-premise option), and Rime (on-prem at 120ms) support self-hosted deployment.
What's the cheapest voice API?
AWS Nova 2 Sonic (~$0.015/min) and Gemini Live (~$0.005/min in, $0.018/min out) are the cheapest S2S models. For pipelines, LiveKit or Pipecat at $0.01/min agent session plus at-cost providers. Budget 2-3x any advertised number for real-world all-in costs.
Executive Summary
The voice AI infrastructure category grew up fast: every major lab except Anthropic now ships a native speech-to-speech API, the orchestration layer got billion-dollar validation, and a stratum of speech-model specialists raised serious capital. This report compares 13 platforms across three layers — S2S models, speech stacks, and orchestration.
Key Findings:
- The lab tier is complete (minus one) — Gemini Live went GA on Vertex AI at I/O 2026, joining OpenAI Realtime (now gpt-realtime-2, 96.6% Big Bench Audio) and AWS Nova 2 Sonic; Anthropic still has no Realtime equivalent[1][2]
- The orchestration layer got its valuations — Vapi raised $50M at ~$500M after Amazon Ring chose it over 40 rivals (1B+ calls handled); LiveKit hit $1B with Tesla and xAI as customers; Pipecat hit v1.0 as the open alternative[3][4][5]
- ElevenLabs became ElevenAgents — renamed its platform, closed $500M at $11B (Sequoia), 2M+ agents created, 70+ languages[6]
- The speech-stack stratum emerged — Cartesia (SSM-based Sonic-3.5, #1 on the Speech Arena leaderboard, ~$191M raised), Gradium (Kyutai/Moshi spinoff, $70M seed), and Rime (100M+ phone conversations/month for Domino's and Wingstop) join Deepgram as the building-block tier[7][8][9]
- Capital efficiency outlier — Retell AI reached profitability on $5.1M raised while shipping an automated QA product (Assure)[10]
- Real costs run 2-3x advertised — independent reviews put Vapi at $0.13-0.33/min and Retell at $0.13-0.20+/min all-in versus $0.05-0.07 headline prices
Strategic Planning Assumptions:
- By 2027, speech-to-speech becomes the default architecture for voice agents — with pipeline orchestration surviving where provider choice and cost control matter
- By 2028, full-duplex (simultaneous listening/speaking) will be table stakes
- The eval/observability layer (Cekura and peers) becomes its own category as deployments mature
Market Definition
Agentic voice APIs enable AI agents to communicate through speech, providing:
- Speech Understanding — Converting spoken input to meaning (beyond transcription)
- Voice Generation — Producing natural-sounding speech output
- Conversation Management — Handling turns, interruptions, and context
- Tool Integration — Connecting voice to actions (function calling, MCP)
Key distinction from traditional telephony: These APIs are designed for AI-native applications, not human-to-human call routing. They optimize for agent intelligence, natural interaction, and programmatic control.
The three strata:
- Speech-to-speech models — OpenAI Realtime, Gemini Live, Nova 2 Sonic, PersonaPlex
- Speech stacks (TTS/STT building blocks) — ElevenLabs, Cartesia, Gradium, Rime, Deepgram Aura
- Orchestration platforms & frameworks — Vapi, Retell AI, LiveKit Agents, Pipecat
Comparison Matrix
| Platform | Layer | Self-Host | Latency | Pricing | Maturity |
|---|---|---|---|---|---|
| OpenAI Realtime | S2S model | — | Native S2S, effort-tunable | $32/$64 per 1M audio tokens (~$0.15-0.20/min) | GA, 3rd model generation |
| Gemini Live | S2S model | — | Real-time WebSocket | $3/$12 per 1M audio (~$0.005 in / $0.018 out per min) | GA on Vertex (I/O 2026) |
| AWS Nova 2 Sonic | S2S model | — | Low (no published ms) | $3/$12 per 1M speech tokens (~$0.015/min) | GA, 4 regions |
| PersonaPlex | S2S model (open) | ✅ | ~70ms speaker-switch | Free weights + GPU costs | Research v1.0, frozen since Jan |
| ElevenLabs Agents | Speech stack + platform | — | Sub-second (vendor) | ~$0.10/min ($0.08 Business) + LLM tokens | GA, $11B, 2M+ agents |
| Cartesia | Speech stack | Partial (on-device Sonic) | 40-90ms TTFB | $4-239/mo credits; Line agents $0.06/min | GA, #1 Speech Arena |
| Gradium | Speech stack | Edge SDK | "Ultra-low" (no published benchmarks) | Credit-based, $0-1,615/mo tiers | ~2 quarters old, production |
| Rime | Speech stack (TTS) | ✅ (120ms on-prem) | 120ms on-prem / ~200ms cloud | $0.03-0.05/1K chars | 100M+ calls/month |
| Deepgram Aura | Speech stack | ✅ | Sub-200ms | $0.030/1K chars; Voice Agent $0.041-0.163/min | GA, $1.3B valuation |
| Vapi | Orchestration | — | 500-800ms measured | $0.05/min + providers at cost ($0.13-0.33 real) | Series B, ~$500M, 1B+ calls |
| Retell AI | Orchestration | — | ~600ms claimed | $0.07+/min advertised ($0.13-0.20+ real) | Profitable on $5.1M |
| LiveKit Agents | Framework | ✅ | WebRTC + semantic turn detection | $0.01/min session + providers; free self-host | v1.5, $1B, powers ChatGPT voice |
| Pipecat | Framework | ✅ | WebRTC via Daily | Free (BSD-2); Cloud $0.01/min | v1.3, 12.7K stars |
Cost Reality Check
Advertised per-minute prices are the start of the bill, not the end:
| Platform | Advertised | Real-World All-In | The Gap |
|---|---|---|---|
| Vapi | $0.05/min | $0.13-0.33/min | Providers at cost + concurrency lines + add-ons (HIPAA $2K/mo) |
| Retell AI | $0.07/min | $0.13-0.20+/min | Voice surcharges (ElevenLabs +$0.04/min), Assure QA +$0.10/min |
| OpenAI Realtime | Token rates | $0.15-0.20/min | Audio token accumulation; caching ($0.40/1M) is the main lever |
| ElevenLabs Agents | $0.10/min | $0.10 + LLM tokens | LLM metered separately |
| LiveKit / Pipecat | $0.01/min | $0.03-0.15/min | Provider stack dominates; cheapest with budget components |
Cost optimization: the cheapest credible stacks remain self-hosted frameworks (LiveKit, Pipecat) with budget providers (~$0.03-0.05/min), or Gemini Live / Nova 2 Sonic for managed S2S at ~$0.015-0.02/min.
Product Profiles
Speech-to-Speech Models
OpenAI Realtime API — the accuracy leader, three model generations in nine months[2]
- gpt-realtime-2 (May 2026): 96.6% Big Bench Audio, 128K context, tunable reasoning effort
- Companions: gpt-realtime-translate (70+ input languages) and gpt-realtime-whisper (streaming STT)
- MCP servers, image input, SIP calling, parallel tool calls
- ⚠️ Premium pricing; unpredictable token costs without caching
Gemini Live API — Google's entry, GA with SLAs[1]
- Gemini 2.5 Flash Native Audio: affective dialogue, proactive audio, barge-in, vision input, 70 languages
- GA on Vertex AI (I/O 2026) with multi-region failover; Shopify and UWM in production
- Cheapest lab S2S (~$0.005/min in, $0.018/min out)
- ⚠️ Developer-API models still preview-suffixed; no first-party telephony; key agent controls not configurable
AWS Nova 2 Sonic — Bedrock-native, the enterprise budget option[11]
- $3/$12 per 1M speech tokens (~$0.015/min) — ~80% below GPT-class pricing
- 7 languages with code-switching polyglot voices, async tool calling, DTMF
- 4 regions; native AWS security/compliance/billing
- ⚠️ Standard tier only; sparse community adoption
NVIDIA PersonaPlex — open-weights full-duplex[12]
- ~70ms speaker-switch (vs ~1,260ms for Gemini Live); voice + role prompting; Moshi lineage
- ~316K monthly HuggingFace downloads; community Apple Silicon port
- ⚠️ Research-grade v1.0, no updates since January; productization path is Nemotron 3 VoiceChat
Speech Stacks
ElevenLabs Agents — voice-quality leader, now a full agent platform[6]
- Renamed from Conversational AI; $500M Series D at $11B (Feb 2026); 2M+ agents, 33M+ conversations
- 70+ languages, visual workflow builder, testing suite, Salesforce/Zendesk/HubSpot integrations
- ⚠️ ~$0.10/min plus separately-metered LLM tokens; cloud-only
Cartesia — state-space models, the latency benchmark[7]
- Sonic-3.5 #1 on Artificial Analysis Speech Arena (May 2026); 40ms turbo TTFB; 42 languages
- Ink STT + Line voice-agent platform ($0.06/min); ~$191M raised (KP, Index, Lightspeed, NVIDIA)
- ⚠️ Reports of long-utterance artifacts; Line platform is young
Gradium — the Kyutai/Moshi commercial spinoff[8]
- $70M seed (FirstMark + Eurazeo, with Xavier Niel, Eric Schmidt); production STT/TTS in 5 languages, 237 voices
- Full-duplex research lineage (Moshi: 7B, ~160ms)
- ⚠️ No published latency benchmarks; credit pricing obscures per-minute costs; two quarters old
Rime — phone-scale TTS specialist[9]
- Arcana v3 (Feb 2026); trained on conversational speech, not narration; 10 languages with code-switching
- 100M+ phone conversations/month — Domino's, Wingstop; 120ms on-prem deployment
- ⚠️ $5.5M seed (thin capital); headline metrics vendor-reported; smaller voice library than hyperscalers
Deepgram Aura — enterprise TTS + Voice Agent API[13]
- Aura-2 in 7 languages; sub-200ms; domain-tuned pronunciation; Voice Agent API GA ($0.041-0.163/min)
- $1.3B valuation; 200K+ developers claimed; on-premise option
- ⚠️ Voice quality trails ElevenLabs/Cartesia in community comparisons
Orchestration
Vapi — the enterprise-validated orchestrator[3]
- $50M Series B at ~$500M (Peak XV, May 2026); 1B+ calls; Amazon Ring won over 40 rivals; Intuit, ServiceTitan
- Mix-and-match providers at cost ($0 BYOK markup); strongest telephony (SIP, BYOC); Squads multi-assistant orchestration
- ⚠️ Independent measurements put production latency at 500-800ms vs sub-500ms claims; costs stack quickly
Retell AI — capital-efficient phone automation[10]
- Profitable on $5.1M raised; Assure automated QA (GA Jan 2026); HIPAA/SOC2/GDPR; 99.99% SLA
- Transparent component pricing; no-code builder plus API
- ⚠️ Real-world costs escalate to $0.13-0.20+/min; support complaints in reviews
LiveKit Agents — the open framework with the biggest logos[4]
- $100M Series C at $1B (Jan 2026); powers ChatGPT voice; Tesla, xAI, Salesforce
- Apache 2.0, 10.9K stars; semantic turn detection; Agent Builder no-code; LiveKit Inference bundled models
- ⚠️ DIY complexity self-hosted; community reports cloud costs add up
Pipecat — the vendor-neutral open alternative[5]
- Daily's BSD-2 framework, 12.7K stars; v1.0 (April) → v1.3 multi-agent (May 2026); ~90 integrations
- Pipecat Cloud $0.01/min active; NVIDIA partnership distribution
- ⚠️ No latency SLA; deployment-at-scale is the known hard part for both open frameworks
Architecture Patterns
Speech-to-Speech vs Pipeline
Speech-to-Speech (OpenAI Realtime, Gemini Live, Nova 2 Sonic, PersonaPlex):
- Single model for understanding + generation
- Lower latency, better nuance preservation
- More integrated but less modular
Pipeline (Vapi, Retell, LiveKit, Pipecat + speech stacks):
- Separate STT → LLM → TTS components
- Provider flexibility, cost optimization
- Higher latency, more moving parts
What changed since February: the price floor for S2S collapsed (Gemini Live and Nova 2 Sonic at ~$0.015-0.02/min), removing "S2S is the premium option" as a rule of thumb. Pipelines now win on control, not cost.
Full-Duplex vs Turn-Based
Full-Duplex (PersonaPlex, Moshi/Gradium lineage): simultaneous listening and speaking, natural backchanneling — still mostly research-grade. Turn-Based (most production systems): semantic turn detection (LiveKit) and proactive audio (Gemini) are closing the naturalness gap from the turn-based side.
Notable Others / Did Not Meet Prerequisites
- Hume AI — emotion-focused voice AI; real products (EVI, Octave) but no major in-window event; watchlist
- Bland AI — phone agents; no fresh funding/launch news surfaced this cycle
- Sesame — research attention but no shipped developer API found
- Cekura (YC F24) — voice-agent testing/observability, launched March 2026 — the eval layer emerging on top of this category; watchlist for a future category
- Amazon Lex / Twilio Voice AI / Play.ht — legacy intent-slot architecture, orchestration-only, and TTS-only respectively
Strategic Recommendations
By Use Case
| Use Case | Recommended | Runner-Up |
|---|---|---|
| High-accuracy agents | OpenAI Realtime (gpt-realtime-2) | Gemini Live |
| Google Cloud stack | Gemini Live | — |
| AWS enterprise | AWS Nova 2 Sonic | — |
| Voice quality priority | ElevenLabs Agents | Cartesia |
| Lowest TTS latency | Cartesia (40ms) | Rime (on-prem 120ms) |
| Phone automation | Vapi | Retell AI |
| Phone-scale TTS | Rime | Deepgram Aura |
| Self-hosted deployment | LiveKit Agents | Pipecat |
| Open source framework | Pipecat (BSD-2) | LiveKit (Apache 2.0) |
| Full-duplex research | PersonaPlex | Gradium (Moshi lineage) |
| Cost optimization | Nova 2 Sonic / Gemini Live (S2S) | LiveKit/Pipecat + budget providers |
| European data residency | Gradium | ElevenLabs (EU residency) |
By Team Profile
Enterprise with compliance needs: → ElevenLabs (HIPAA), AWS Nova 2 Sonic (Bedrock compliance), Vapi (HIPAA add-on), or Deepgram/Rime (on-premise)
Startup iterating quickly: → Retell AI (no-code + API, transparent components) or Vapi (provider flexibility, enterprise-proven)
Developer team wanting control: → LiveKit Agents or Pipecat — both 1.0+, both open source; LiveKit for WebRTC depth, Pipecat for vendor neutrality
Phone-first call center: → Vapi (Amazon Ring-validated) or Retell AI (QA tooling built in), with Rime as the TTS workhorse
Market Outlook
Near-Term (2026)
More providers adding speech-to-speech models— Done: Gemini Live GA completes the lab tier; watch only Anthropic- The eval/observability layer (Cekura et al.) matures as production deployments expose quality drift
- Price pressure flows downhill from Gemini/Nova S2S pricing into the orchestration layer
Medium-Term (2027)
- Speech-to-speech becomes the default architecture; pipelines hold regulated and cost-controlled niches
- Full-duplex moves from research (PersonaPlex, Moshi lineage) to production
- MCP becomes the standard tool-calling surface across voice platforms
Long-Term (2028+)
- Category consolidation around 3-4 leaders per stratum
- Integration with agent orchestration platforms (Tembo, etc.)
- On-device voice AI (Cartesia on-device Sonic, Gradium edge SDK) reduces cloud dependency
Bottom Line
13 platforms serve the agentic voice market across three strata:
| Platform | Best For | Key Differentiator |
|---|---|---|
| OpenAI Realtime | Accuracy-critical agents | gpt-realtime-2, 96.6% BBA, MCP + SIP |
| Gemini Live | Google-stack teams | GA with SLAs, cheapest lab S2S |
| AWS Nova 2 Sonic | AWS enterprises | ~$0.015/min, Bedrock compliance |
| ElevenLabs Agents | Voice quality + platform | $11B, 2M+ agents, 70+ languages |
| Cartesia | Latency + quality | SSM models, 40ms TTFB, #1 Speech Arena |
| Gradium | EU / full-duplex lineage | Kyutai spinoff, $70M, 5 languages |
| Rime | Phone-scale TTS | 100M+ calls/month, 120ms on-prem |
| Deepgram Aura | Enterprise TTS + agent API | Domain pronunciation, on-premise |
| PersonaPlex | Self-hosted full-duplex | Open weights, ~70ms switching |
| Vapi | Enterprise phone agents | Amazon Ring win, 1B+ calls |
| Retell AI | Transparent phone automation | Profitable, built-in QA (Assure) |
| LiveKit Agents | Open-source control | $1B, powers ChatGPT voice |
| Pipecat | Vendor-neutral pipelines | BSD-2, ~90 integrations |
The story of this refresh: the category stratified. Labs own the S2S model layer (and collapsed its price), specialists own speech quality and latency, and orchestrators own the enterprise deployment surface — with billion-dollar valuations now at every layer. Pick your stratum first, then your vendor; and budget 2-3x the advertised per-minute price.
Research by Ry Walker Research • methodology
Disclosure: Author is CEO of Tembo, which may integrate with voice AI platforms for agent communication.