← Back to research
·12 min read·industry

Agentic Voice APIs

A comparison of 13 leading voice AI platforms for building agentic voice applications — OpenAI Realtime, Gemini Live, AWS Nova 2 Sonic, ElevenLabs, Cartesia, Gradium, Rime, NVIDIA PersonaPlex, Vapi, Retell AI, LiveKit Agents, Pipecat, and Deepgram Aura.

Key takeaways

  • All three hyperscaler-adjacent labs except Anthropic now ship native speech-to-speech APIs — Gemini Live went GA at I/O 2026, joining OpenAI Realtime (now gpt-realtime-2) and AWS Nova 2 Sonic
  • The money validated the orchestration layer: Vapi raised $50M at ~$500M after winning Amazon Ring over 40 rivals, LiveKit hit a $1B valuation, and ElevenLabs closed $500M at $11B
  • The category is consolidating into strata — speech-to-speech models, speech stacks (TTS/STT specialists like Cartesia, Gradium, Rime), and orchestration frameworks (Vapi, Retell, LiveKit, Pipecat)
  • Advertised per-minute prices are fiction: real-world all-in costs run 2-3x the headline number on nearly every platform — Vapi $0.13-0.33, Retell $0.13-0.20+, OpenAI $0.15-0.20

FAQ

What's the best voice API for AI agents?

OpenAI Realtime (gpt-realtime-2) for accuracy and tool calling, Gemini Live for Google-stack teams, ElevenLabs Agents for voice quality, Vapi/Retell for phone automation, LiveKit Agents or Pipecat for open-source orchestration.

What does speech-to-speech mean?

Speech-to-speech (S2S) models process audio input and generate audio output directly, eliminating the traditional ASR→LLM→TTS pipeline for lower latency and better nuance preservation.

Which voice APIs support self-hosting?

NVIDIA PersonaPlex (open weights), LiveKit Agents and Pipecat (open-source frameworks), Deepgram (on-premise option), and Rime (on-prem at 120ms) support self-hosted deployment.

What's the cheapest voice API?

AWS Nova 2 Sonic (~$0.015/min) and Gemini Live (~$0.005/min in, $0.018/min out) are the cheapest S2S models. For pipelines, LiveKit or Pipecat at $0.01/min agent session plus at-cost providers. Budget 2-3x any advertised number for real-world all-in costs.

Executive Summary

The voice AI infrastructure category grew up fast: every major lab except Anthropic now ships a native speech-to-speech API, the orchestration layer got billion-dollar validation, and a stratum of speech-model specialists raised serious capital. This report compares 13 platforms across three layers — S2S models, speech stacks, and orchestration.

Key Findings:

  • The lab tier is complete (minus one)Gemini Live went GA on Vertex AI at I/O 2026, joining OpenAI Realtime (now gpt-realtime-2, 96.6% Big Bench Audio) and AWS Nova 2 Sonic; Anthropic still has no Realtime equivalent[1][2]
  • The orchestration layer got its valuationsVapi raised $50M at ~$500M after Amazon Ring chose it over 40 rivals (1B+ calls handled); LiveKit hit $1B with Tesla and xAI as customers; Pipecat hit v1.0 as the open alternative[3][4][5]
  • ElevenLabs became ElevenAgents — renamed its platform, closed $500M at $11B (Sequoia), 2M+ agents created, 70+ languages[6]
  • The speech-stack stratum emergedCartesia (SSM-based Sonic-3.5, #1 on the Speech Arena leaderboard, ~$191M raised), Gradium (Kyutai/Moshi spinoff, $70M seed), and Rime (100M+ phone conversations/month for Domino's and Wingstop) join Deepgram as the building-block tier[7][8][9]
  • Capital efficiency outlierRetell AI reached profitability on $5.1M raised while shipping an automated QA product (Assure)[10]
  • Real costs run 2-3x advertised — independent reviews put Vapi at $0.13-0.33/min and Retell at $0.13-0.20+/min all-in versus $0.05-0.07 headline prices

Strategic Planning Assumptions:

  • By 2027, speech-to-speech becomes the default architecture for voice agents — with pipeline orchestration surviving where provider choice and cost control matter
  • By 2028, full-duplex (simultaneous listening/speaking) will be table stakes
  • The eval/observability layer (Cekura and peers) becomes its own category as deployments mature

Market Definition

Agentic voice APIs enable AI agents to communicate through speech, providing:

  • Speech Understanding — Converting spoken input to meaning (beyond transcription)
  • Voice Generation — Producing natural-sounding speech output
  • Conversation Management — Handling turns, interruptions, and context
  • Tool Integration — Connecting voice to actions (function calling, MCP)

Key distinction from traditional telephony: These APIs are designed for AI-native applications, not human-to-human call routing. They optimize for agent intelligence, natural interaction, and programmatic control.

The three strata:

  1. Speech-to-speech models — OpenAI Realtime, Gemini Live, Nova 2 Sonic, PersonaPlex
  2. Speech stacks (TTS/STT building blocks) — ElevenLabs, Cartesia, Gradium, Rime, Deepgram Aura
  3. Orchestration platforms & frameworks — Vapi, Retell AI, LiveKit Agents, Pipecat

Comparison Matrix

PlatformLayerSelf-HostLatencyPricingMaturity
OpenAI RealtimeS2S modelNative S2S, effort-tunable$32/$64 per 1M audio tokens (~$0.15-0.20/min)GA, 3rd model generation
Gemini LiveS2S modelReal-time WebSocket$3/$12 per 1M audio (~$0.005 in / $0.018 out per min)GA on Vertex (I/O 2026)
AWS Nova 2 SonicS2S modelLow (no published ms)$3/$12 per 1M speech tokens (~$0.015/min)GA, 4 regions
PersonaPlexS2S model (open)~70ms speaker-switchFree weights + GPU costsResearch v1.0, frozen since Jan
ElevenLabs AgentsSpeech stack + platformSub-second (vendor)~$0.10/min ($0.08 Business) + LLM tokensGA, $11B, 2M+ agents
CartesiaSpeech stackPartial (on-device Sonic)40-90ms TTFB$4-239/mo credits; Line agents $0.06/minGA, #1 Speech Arena
GradiumSpeech stackEdge SDK"Ultra-low" (no published benchmarks)Credit-based, $0-1,615/mo tiers~2 quarters old, production
RimeSpeech stack (TTS)✅ (120ms on-prem)120ms on-prem / ~200ms cloud$0.03-0.05/1K chars100M+ calls/month
Deepgram AuraSpeech stackSub-200ms$0.030/1K chars; Voice Agent $0.041-0.163/minGA, $1.3B valuation
VapiOrchestration500-800ms measured$0.05/min + providers at cost ($0.13-0.33 real)Series B, ~$500M, 1B+ calls
Retell AIOrchestration~600ms claimed$0.07+/min advertised ($0.13-0.20+ real)Profitable on $5.1M
LiveKit AgentsFrameworkWebRTC + semantic turn detection$0.01/min session + providers; free self-hostv1.5, $1B, powers ChatGPT voice
PipecatFrameworkWebRTC via DailyFree (BSD-2); Cloud $0.01/minv1.3, 12.7K stars

Cost Reality Check

Advertised per-minute prices are the start of the bill, not the end:

PlatformAdvertisedReal-World All-InThe Gap
Vapi$0.05/min$0.13-0.33/minProviders at cost + concurrency lines + add-ons (HIPAA $2K/mo)
Retell AI$0.07/min$0.13-0.20+/minVoice surcharges (ElevenLabs +$0.04/min), Assure QA +$0.10/min
OpenAI RealtimeToken rates$0.15-0.20/minAudio token accumulation; caching ($0.40/1M) is the main lever
ElevenLabs Agents$0.10/min$0.10 + LLM tokensLLM metered separately
LiveKit / Pipecat$0.01/min$0.03-0.15/minProvider stack dominates; cheapest with budget components

Cost optimization: the cheapest credible stacks remain self-hosted frameworks (LiveKit, Pipecat) with budget providers (~$0.03-0.05/min), or Gemini Live / Nova 2 Sonic for managed S2S at ~$0.015-0.02/min.


Product Profiles

Speech-to-Speech Models

OpenAI Realtime API — the accuracy leader, three model generations in nine months[2]

  • gpt-realtime-2 (May 2026): 96.6% Big Bench Audio, 128K context, tunable reasoning effort
  • Companions: gpt-realtime-translate (70+ input languages) and gpt-realtime-whisper (streaming STT)
  • MCP servers, image input, SIP calling, parallel tool calls
  • ⚠️ Premium pricing; unpredictable token costs without caching

Gemini Live API — Google's entry, GA with SLAs[1]

  • Gemini 2.5 Flash Native Audio: affective dialogue, proactive audio, barge-in, vision input, 70 languages
  • GA on Vertex AI (I/O 2026) with multi-region failover; Shopify and UWM in production
  • Cheapest lab S2S (~$0.005/min in, $0.018/min out)
  • ⚠️ Developer-API models still preview-suffixed; no first-party telephony; key agent controls not configurable

AWS Nova 2 Sonic — Bedrock-native, the enterprise budget option[11]

  • $3/$12 per 1M speech tokens (~$0.015/min) — ~80% below GPT-class pricing
  • 7 languages with code-switching polyglot voices, async tool calling, DTMF
  • 4 regions; native AWS security/compliance/billing
  • ⚠️ Standard tier only; sparse community adoption

NVIDIA PersonaPlex — open-weights full-duplex[12]

  • ~70ms speaker-switch (vs ~1,260ms for Gemini Live); voice + role prompting; Moshi lineage
  • ~316K monthly HuggingFace downloads; community Apple Silicon port
  • ⚠️ Research-grade v1.0, no updates since January; productization path is Nemotron 3 VoiceChat

Speech Stacks

ElevenLabs Agents — voice-quality leader, now a full agent platform[6]

  • Renamed from Conversational AI; $500M Series D at $11B (Feb 2026); 2M+ agents, 33M+ conversations
  • 70+ languages, visual workflow builder, testing suite, Salesforce/Zendesk/HubSpot integrations
  • ⚠️ ~$0.10/min plus separately-metered LLM tokens; cloud-only

Cartesia — state-space models, the latency benchmark[7]

  • Sonic-3.5 #1 on Artificial Analysis Speech Arena (May 2026); 40ms turbo TTFB; 42 languages
  • Ink STT + Line voice-agent platform ($0.06/min); ~$191M raised (KP, Index, Lightspeed, NVIDIA)
  • ⚠️ Reports of long-utterance artifacts; Line platform is young

Gradium — the Kyutai/Moshi commercial spinoff[8]

  • $70M seed (FirstMark + Eurazeo, with Xavier Niel, Eric Schmidt); production STT/TTS in 5 languages, 237 voices
  • Full-duplex research lineage (Moshi: 7B, ~160ms)
  • ⚠️ No published latency benchmarks; credit pricing obscures per-minute costs; two quarters old

Rime — phone-scale TTS specialist[9]

  • Arcana v3 (Feb 2026); trained on conversational speech, not narration; 10 languages with code-switching
  • 100M+ phone conversations/month — Domino's, Wingstop; 120ms on-prem deployment
  • ⚠️ $5.5M seed (thin capital); headline metrics vendor-reported; smaller voice library than hyperscalers

Deepgram Aura — enterprise TTS + Voice Agent API[13]

  • Aura-2 in 7 languages; sub-200ms; domain-tuned pronunciation; Voice Agent API GA ($0.041-0.163/min)
  • $1.3B valuation; 200K+ developers claimed; on-premise option
  • ⚠️ Voice quality trails ElevenLabs/Cartesia in community comparisons

Orchestration

Vapi — the enterprise-validated orchestrator[3]

  • $50M Series B at ~$500M (Peak XV, May 2026); 1B+ calls; Amazon Ring won over 40 rivals; Intuit, ServiceTitan
  • Mix-and-match providers at cost ($0 BYOK markup); strongest telephony (SIP, BYOC); Squads multi-assistant orchestration
  • ⚠️ Independent measurements put production latency at 500-800ms vs sub-500ms claims; costs stack quickly

Retell AI — capital-efficient phone automation[10]

  • Profitable on $5.1M raised; Assure automated QA (GA Jan 2026); HIPAA/SOC2/GDPR; 99.99% SLA
  • Transparent component pricing; no-code builder plus API
  • ⚠️ Real-world costs escalate to $0.13-0.20+/min; support complaints in reviews

LiveKit Agents — the open framework with the biggest logos[4]

  • $100M Series C at $1B (Jan 2026); powers ChatGPT voice; Tesla, xAI, Salesforce
  • Apache 2.0, 10.9K stars; semantic turn detection; Agent Builder no-code; LiveKit Inference bundled models
  • ⚠️ DIY complexity self-hosted; community reports cloud costs add up

Pipecat — the vendor-neutral open alternative[5]

  • Daily's BSD-2 framework, 12.7K stars; v1.0 (April) → v1.3 multi-agent (May 2026); ~90 integrations
  • Pipecat Cloud $0.01/min active; NVIDIA partnership distribution
  • ⚠️ No latency SLA; deployment-at-scale is the known hard part for both open frameworks

Architecture Patterns

Speech-to-Speech vs Pipeline

Speech-to-Speech (OpenAI Realtime, Gemini Live, Nova 2 Sonic, PersonaPlex):

  • Single model for understanding + generation
  • Lower latency, better nuance preservation
  • More integrated but less modular

Pipeline (Vapi, Retell, LiveKit, Pipecat + speech stacks):

  • Separate STT → LLM → TTS components
  • Provider flexibility, cost optimization
  • Higher latency, more moving parts

What changed since February: the price floor for S2S collapsed (Gemini Live and Nova 2 Sonic at ~$0.015-0.02/min), removing "S2S is the premium option" as a rule of thumb. Pipelines now win on control, not cost.

Full-Duplex vs Turn-Based

Full-Duplex (PersonaPlex, Moshi/Gradium lineage): simultaneous listening and speaking, natural backchanneling — still mostly research-grade. Turn-Based (most production systems): semantic turn detection (LiveKit) and proactive audio (Gemini) are closing the naturalness gap from the turn-based side.


Notable Others / Did Not Meet Prerequisites

  • Hume AI — emotion-focused voice AI; real products (EVI, Octave) but no major in-window event; watchlist
  • Bland AI — phone agents; no fresh funding/launch news surfaced this cycle
  • Sesame — research attention but no shipped developer API found
  • Cekura (YC F24) — voice-agent testing/observability, launched March 2026 — the eval layer emerging on top of this category; watchlist for a future category
  • Amazon Lex / Twilio Voice AI / Play.ht — legacy intent-slot architecture, orchestration-only, and TTS-only respectively

Strategic Recommendations

By Use Case

Use CaseRecommendedRunner-Up
High-accuracy agentsOpenAI Realtime (gpt-realtime-2)Gemini Live
Google Cloud stackGemini Live
AWS enterpriseAWS Nova 2 Sonic
Voice quality priorityElevenLabs AgentsCartesia
Lowest TTS latencyCartesia (40ms)Rime (on-prem 120ms)
Phone automationVapiRetell AI
Phone-scale TTSRimeDeepgram Aura
Self-hosted deploymentLiveKit AgentsPipecat
Open source frameworkPipecat (BSD-2)LiveKit (Apache 2.0)
Full-duplex researchPersonaPlexGradium (Moshi lineage)
Cost optimizationNova 2 Sonic / Gemini Live (S2S)LiveKit/Pipecat + budget providers
European data residencyGradiumElevenLabs (EU residency)

By Team Profile

Enterprise with compliance needs: → ElevenLabs (HIPAA), AWS Nova 2 Sonic (Bedrock compliance), Vapi (HIPAA add-on), or Deepgram/Rime (on-premise)

Startup iterating quickly: → Retell AI (no-code + API, transparent components) or Vapi (provider flexibility, enterprise-proven)

Developer team wanting control: → LiveKit Agents or Pipecat — both 1.0+, both open source; LiveKit for WebRTC depth, Pipecat for vendor neutrality

Phone-first call center: → Vapi (Amazon Ring-validated) or Retell AI (QA tooling built in), with Rime as the TTS workhorse


Market Outlook

Near-Term (2026)

  • More providers adding speech-to-speech modelsDone: Gemini Live GA completes the lab tier; watch only Anthropic
  • The eval/observability layer (Cekura et al.) matures as production deployments expose quality drift
  • Price pressure flows downhill from Gemini/Nova S2S pricing into the orchestration layer

Medium-Term (2027)

  • Speech-to-speech becomes the default architecture; pipelines hold regulated and cost-controlled niches
  • Full-duplex moves from research (PersonaPlex, Moshi lineage) to production
  • MCP becomes the standard tool-calling surface across voice platforms

Long-Term (2028+)

  • Category consolidation around 3-4 leaders per stratum
  • Integration with agent orchestration platforms (Tembo, etc.)
  • On-device voice AI (Cartesia on-device Sonic, Gradium edge SDK) reduces cloud dependency

Bottom Line

13 platforms serve the agentic voice market across three strata:

PlatformBest ForKey Differentiator
OpenAI RealtimeAccuracy-critical agentsgpt-realtime-2, 96.6% BBA, MCP + SIP
Gemini LiveGoogle-stack teamsGA with SLAs, cheapest lab S2S
AWS Nova 2 SonicAWS enterprises~$0.015/min, Bedrock compliance
ElevenLabs AgentsVoice quality + platform$11B, 2M+ agents, 70+ languages
CartesiaLatency + qualitySSM models, 40ms TTFB, #1 Speech Arena
GradiumEU / full-duplex lineageKyutai spinoff, $70M, 5 languages
RimePhone-scale TTS100M+ calls/month, 120ms on-prem
Deepgram AuraEnterprise TTS + agent APIDomain pronunciation, on-premise
PersonaPlexSelf-hosted full-duplexOpen weights, ~70ms switching
VapiEnterprise phone agentsAmazon Ring win, 1B+ calls
Retell AITransparent phone automationProfitable, built-in QA (Assure)
LiveKit AgentsOpen-source control$1B, powers ChatGPT voice
PipecatVendor-neutral pipelinesBSD-2, ~90 integrations

The story of this refresh: the category stratified. Labs own the S2S model layer (and collapsed its price), specialists own speech quality and latency, and orchestrators own the enterprise deployment surface — with billion-dollar valuations now at every layer. Pick your stratum first, then your vendor; and budget 2-3x the advertised per-minute price.


Research by Ry Walker Research • methodology

Disclosure: Author is CEO of Tembo, which may integrate with voice AI platforms for agent communication.