Agentic Voice APIs Compared | Ry Walker Research

Key takeaways

All three hyperscaler-adjacent labs except Anthropic now ship native speech-to-speech APIs — Gemini Live went GA at I/O 2026, joining OpenAI Realtime (now gpt-realtime-2) and AWS Nova 2 Sonic
The money validated the orchestration layer: Vapi raised $50M at ~$500M after winning Amazon Ring over 40 rivals, LiveKit hit a $1B valuation, and ElevenLabs closed $500M at $11B
The category is consolidating into strata — speech-to-speech models, speech stacks (TTS/STT specialists like Cartesia, Gradium, Rime), and orchestration frameworks (Vapi, Retell, LiveKit, Pipecat)
Advertised per-minute prices are fiction: real-world all-in costs run 2-3x the headline number on nearly every platform — Vapi $0.13-0.33, Retell $0.13-0.20+, OpenAI $0.15-0.20

FAQ

What's the best voice API for AI agents?

OpenAI Realtime (gpt-realtime-2) for accuracy and tool calling, Gemini Live for Google-stack teams, ElevenLabs Agents for voice quality, Vapi/Retell for phone automation, LiveKit Agents or Pipecat for open-source orchestration.

What does speech-to-speech mean?

Speech-to-speech (S2S) models process audio input and generate audio output directly, eliminating the traditional ASR→LLM→TTS pipeline for lower latency and better nuance preservation.

Which voice APIs support self-hosting?

NVIDIA PersonaPlex (open weights), LiveKit Agents and Pipecat (open-source frameworks), Deepgram (on-premise option), and Rime (on-prem at 120ms) support self-hosted deployment.

What's the cheapest voice API?

AWS Nova 2 Sonic (~$0.015/min) and Gemini Live (~$0.005/min in, $0.018/min out) are the cheapest S2S models. For pipelines, LiveKit or Pipecat at $0.01/min agent session plus at-cost providers. Budget 2-3x any advertised number for real-world all-in costs.

Executive Summary

The voice AI infrastructure category grew up fast: every major lab except Anthropic now ships a native speech-to-speech API, the orchestration layer got billion-dollar validation, and a stratum of speech-model specialists raised serious capital. This report compares 13 platforms across three layers — S2S models, speech stacks, and orchestration.

Key Findings:

The lab tier is complete (minus one) — Gemini Live went GA on Vertex AI at I/O 2026, joining OpenAI Realtime (now gpt-realtime-2, 96.6% Big Bench Audio) and AWS Nova 2 Sonic; Anthropic still has no Realtime equivalent^[1]^[2]
The orchestration layer got its valuations — Vapi raised $50M at ~$500M after Amazon Ring chose it over 40 rivals (1B+ calls handled); LiveKit hit $1B with Tesla and xAI as customers; Pipecat hit v1.0 as the open alternative^[3]^[4]^[5]
ElevenLabs became ElevenAgents — renamed its platform, closed $500M at $11B (Sequoia), 2M+ agents created, 70+ languages^[6]
The speech-stack stratum emerged — Cartesia (SSM-based Sonic-3.5, #1 on the Speech Arena leaderboard, ~$191M raised), Gradium (Kyutai/Moshi spinoff, $70M seed), and Rime (100M+ phone conversations/month for Domino's and Wingstop) join Deepgram as the building-block tier^[7]^[8]^[9]
Capital efficiency outlier — Retell AI reached profitability on $5.1M raised while shipping an automated QA product (Assure)^[10]
Real costs run 2-3x advertised — independent reviews put Vapi at $0.13-0.33/min and Retell at $0.13-0.20+/min all-in versus $0.05-0.07 headline prices

Strategic Planning Assumptions:

By 2027, speech-to-speech becomes the default architecture for voice agents — with pipeline orchestration surviving where provider choice and cost control matter
By 2028, full-duplex (simultaneous listening/speaking) will be table stakes
The eval/observability layer (Cekura and peers) becomes its own category as deployments mature

Market Definition

Agentic voice APIs enable AI agents to communicate through speech, providing:

Speech Understanding — Converting spoken input to meaning (beyond transcription)
Voice Generation — Producing natural-sounding speech output
Conversation Management — Handling turns, interruptions, and context
Tool Integration — Connecting voice to actions (function calling, MCP)

Key distinction from traditional telephony: These APIs are designed for AI-native applications, not human-to-human call routing. They optimize for agent intelligence, natural interaction, and programmatic control.

The three strata:

Speech-to-speech models — OpenAI Realtime, Gemini Live, Nova 2 Sonic, PersonaPlex
Speech stacks (TTS/STT building blocks) — ElevenLabs, Cartesia, Gradium, Rime, Deepgram Aura
Orchestration platforms & frameworks — Vapi, Retell AI, LiveKit Agents, Pipecat

Comparison Matrix

Platform	Layer	Self-Host	Latency	Pricing	Maturity
OpenAI Realtime	S2S model	—	Native S2S, effort-tunable	$32/$64 per 1M audio tokens (~$0.15-0.20/min)	GA, 3rd model generation
Gemini Live	S2S model	—	Real-time WebSocket	$3/$12 per 1M audio (~$0.005 in / $0.018 out per min)	GA on Vertex (I/O 2026)
AWS Nova 2 Sonic	S2S model	—	Low (no published ms)	$3/$12 per 1M speech tokens (~$0.015/min)	GA, 4 regions
PersonaPlex	S2S model (open)	✅	~70ms speaker-switch	Free weights + GPU costs	Research v1.0, frozen since Jan
ElevenLabs Agents	Speech stack + platform	—	Sub-second (vendor)	~$0.10/min ($0.08 Business) + LLM tokens	GA, $11B, 2M+ agents
Cartesia	Speech stack	Partial (on-device Sonic)	40-90ms TTFB	$4-239/mo credits; Line agents $0.06/min	GA, #1 Speech Arena
Gradium	Speech stack	Edge SDK	"Ultra-low" (no published benchmarks)	Credit-based, $0-1,615/mo tiers	~2 quarters old, production
Rime	Speech stack (TTS)	✅ (120ms on-prem)	120ms on-prem / ~200ms cloud	$0.03-0.05/1K chars	100M+ calls/month
Deepgram Aura	Speech stack	✅	Sub-200ms	$0.030/1K chars; Voice Agent $0.041-0.163/min	GA, $1.3B valuation
Vapi	Orchestration	—	500-800ms measured	$0.05/min + providers at cost ($0.13-0.33 real)	Series B, ~$500M, 1B+ calls
Retell AI	Orchestration	—	~600ms claimed	$0.07+/min advertised ($0.13-0.20+ real)	Profitable on $5.1M
LiveKit Agents	Framework	✅	WebRTC + semantic turn detection	$0.01/min session + providers; free self-host	v1.5, $1B, powers ChatGPT voice
Pipecat	Framework	✅	WebRTC via Daily	Free (BSD-2); Cloud $0.01/min	v1.3, 12.7K stars

Cost Reality Check

Advertised per-minute prices are the start of the bill, not the end:

Platform	Advertised	Real-World All-In	The Gap
Vapi	$0.05/min	$0.13-0.33/min	Providers at cost + concurrency lines + add-ons (HIPAA $2K/mo)
Retell AI	$0.07/min	$0.13-0.20+/min	Voice surcharges (ElevenLabs +$0.04/min), Assure QA +$0.10/min
OpenAI Realtime	Token rates	$0.15-0.20/min	Audio token accumulation; caching ($0.40/1M) is the main lever
ElevenLabs Agents	$0.10/min	$0.10 + LLM tokens	LLM metered separately
LiveKit / Pipecat	$0.01/min	$0.03-0.15/min	Provider stack dominates; cheapest with budget components

Cost optimization: the cheapest credible stacks remain self-hosted frameworks (LiveKit, Pipecat) with budget providers (~$0.03-0.05/min), or Gemini Live / Nova 2 Sonic for managed S2S at ~$0.015-0.02/min.

Product Profiles

Speech-to-Speech Models

OpenAI Realtime API — the accuracy leader, three model generations in nine months^[2]

gpt-realtime-2 (May 2026): 96.6% Big Bench Audio, 128K context, tunable reasoning effort
Companions: gpt-realtime-translate (70+ input languages) and gpt-realtime-whisper (streaming STT)
MCP servers, image input, SIP calling, parallel tool calls
⚠️ Premium pricing; unpredictable token costs without caching

Gemini Live API — Google's entry, GA with SLAs^[1]

Gemini 2.5 Flash Native Audio: affective dialogue, proactive audio, barge-in, vision input, 70 languages
GA on Vertex AI (I/O 2026) with multi-region failover; Shopify and UWM in production
Cheapest lab S2S (~$0.005/min in, $0.018/min out)
⚠️ Developer-API models still preview-suffixed; no first-party telephony; key agent controls not configurable

AWS Nova 2 Sonic — Bedrock-native, the enterprise budget option^[11]

$3/$12 per 1M speech tokens (~$0.015/min) — ~80% below GPT-class pricing
7 languages with code-switching polyglot voices, async tool calling, DTMF
4 regions; native AWS security/compliance/billing
⚠️ Standard tier only; sparse community adoption

NVIDIA PersonaPlex — open-weights full-duplex^[12]

~70ms speaker-switch (vs ~1,260ms for Gemini Live); voice + role prompting; Moshi lineage
~316K monthly HuggingFace downloads; community Apple Silicon port
⚠️ Research-grade v1.0, no updates since January; productization path is Nemotron 3 VoiceChat

Speech Stacks

ElevenLabs Agents — voice-quality leader, now a full agent platform^[6]

Renamed from Conversational AI; $500M Series D at $11B (Feb 2026); 2M+ agents, 33M+ conversations
70+ languages, visual workflow builder, testing suite, Salesforce/Zendesk/HubSpot integrations
⚠️ ~$0.10/min plus separately-metered LLM tokens; cloud-only

Cartesia — state-space models, the latency benchmark^[7]

Sonic-3.5 #1 on Artificial Analysis Speech Arena (May 2026); 40ms turbo TTFB; 42 languages
Ink STT + Line voice-agent platform ($0.06/min); ~$191M raised (KP, Index, Lightspeed, NVIDIA)
⚠️ Reports of long-utterance artifacts; Line platform is young

Gradium — the Kyutai/Moshi commercial spinoff^[8]

$70M seed (FirstMark + Eurazeo, with Xavier Niel, Eric Schmidt); production STT/TTS in 5 languages, 237 voices
Full-duplex research lineage (Moshi: 7B, ~160ms)
⚠️ No published latency benchmarks; credit pricing obscures per-minute costs; two quarters old

Rime — phone-scale TTS specialist^[9]

Arcana v3 (Feb 2026); trained on conversational speech, not narration; 10 languages with code-switching
100M+ phone conversations/month — Domino's, Wingstop; 120ms on-prem deployment
⚠️ $5.5M seed (thin capital); headline metrics vendor-reported; smaller voice library than hyperscalers

Deepgram Aura — enterprise TTS + Voice Agent API^[13]

Aura-2 in 7 languages; sub-200ms; domain-tuned pronunciation; Voice Agent API GA ($0.041-0.163/min)
$1.3B valuation; 200K+ developers claimed; on-premise option
⚠️ Voice quality trails ElevenLabs/Cartesia in community comparisons

Orchestration

Vapi — the enterprise-validated orchestrator^[3]

$50M Series B at ~$500M (Peak XV, May 2026); 1B+ calls; Amazon Ring won over 40 rivals; Intuit, ServiceTitan
Mix-and-match providers at cost ($0 BYOK markup); strongest telephony (SIP, BYOC); Squads multi-assistant orchestration
⚠️ Independent measurements put production latency at 500-800ms vs sub-500ms claims; costs stack quickly

Retell AI — capital-efficient phone automation^[10]

Profitable on $5.1M raised; Assure automated QA (GA Jan 2026); HIPAA/SOC2/GDPR; 99.99% SLA
Transparent component pricing; no-code builder plus API
⚠️ Real-world costs escalate to $0.13-0.20+/min; support complaints in reviews

LiveKit Agents — the open framework with the biggest logos^[4]

$100M Series C at $1B (Jan 2026); powers ChatGPT voice; Tesla, xAI, Salesforce
Apache 2.0, 10.9K stars; semantic turn detection; Agent Builder no-code; LiveKit Inference bundled models
⚠️ DIY complexity self-hosted; community reports cloud costs add up

Pipecat — the vendor-neutral open alternative^[5]

Daily's BSD-2 framework, 12.7K stars; v1.0 (April) → v1.3 multi-agent (May 2026); ~90 integrations
Pipecat Cloud $0.01/min active; NVIDIA partnership distribution
⚠️ No latency SLA; deployment-at-scale is the known hard part for both open frameworks

Architecture Patterns

Speech-to-Speech vs Pipeline

Speech-to-Speech (OpenAI Realtime, Gemini Live, Nova 2 Sonic, PersonaPlex):

Single model for understanding + generation
Lower latency, better nuance preservation
More integrated but less modular

Pipeline (Vapi, Retell, LiveKit, Pipecat + speech stacks):

Separate STT → LLM → TTS components
Provider flexibility, cost optimization
Higher latency, more moving parts

What changed since February: the price floor for S2S collapsed (Gemini Live and Nova 2 Sonic at ~$0.015-0.02/min), removing "S2S is the premium option" as a rule of thumb. Pipelines now win on control, not cost.

Full-Duplex vs Turn-Based

Full-Duplex (PersonaPlex, Moshi/Gradium lineage): simultaneous listening and speaking, natural backchanneling — still mostly research-grade. Turn-Based (most production systems): semantic turn detection (LiveKit) and proactive audio (Gemini) are closing the naturalness gap from the turn-based side.

Notable Others / Did Not Meet Prerequisites

Hume AI — emotion-focused voice AI; real products (EVI, Octave) but no major in-window event; watchlist
Bland AI — phone agents; no fresh funding/launch news surfaced this cycle
Sesame — research attention but no shipped developer API found
Cekura (YC F24) — voice-agent testing/observability, launched March 2026 — the eval layer emerging on top of this category; watchlist for a future category
Amazon Lex / Twilio Voice AI / Play.ht — legacy intent-slot architecture, orchestration-only, and TTS-only respectively

Strategic Recommendations

By Use Case

Use Case	Recommended	Runner-Up
High-accuracy agents	OpenAI Realtime (gpt-realtime-2)	Gemini Live
Google Cloud stack	Gemini Live	—
AWS enterprise	AWS Nova 2 Sonic	—
Voice quality priority	ElevenLabs Agents	Cartesia
Lowest TTS latency	Cartesia (40ms)	Rime (on-prem 120ms)
Phone automation	Vapi	Retell AI
Phone-scale TTS	Rime	Deepgram Aura
Self-hosted deployment	LiveKit Agents	Pipecat
Open source framework	Pipecat (BSD-2)	LiveKit (Apache 2.0)
Full-duplex research	PersonaPlex	Gradium (Moshi lineage)
Cost optimization	Nova 2 Sonic / Gemini Live (S2S)	LiveKit/Pipecat + budget providers
European data residency	Gradium	ElevenLabs (EU residency)

By Team Profile

Enterprise with compliance needs: → ElevenLabs (HIPAA), AWS Nova 2 Sonic (Bedrock compliance), Vapi (HIPAA add-on), or Deepgram/Rime (on-premise)

Startup iterating quickly: → Retell AI (no-code + API, transparent components) or Vapi (provider flexibility, enterprise-proven)

Developer team wanting control: → LiveKit Agents or Pipecat — both 1.0+, both open source; LiveKit for WebRTC depth, Pipecat for vendor neutrality

Phone-first call center: → Vapi (Amazon Ring-validated) or Retell AI (QA tooling built in), with Rime as the TTS workhorse

Market Outlook

Near-Term (2026)

~~More providers adding speech-to-speech models~~ — Done: Gemini Live GA completes the lab tier; watch only Anthropic
The eval/observability layer (Cekura et al.) matures as production deployments expose quality drift
Price pressure flows downhill from Gemini/Nova S2S pricing into the orchestration layer

Medium-Term (2027)

Speech-to-speech becomes the default architecture; pipelines hold regulated and cost-controlled niches
Full-duplex moves from research (PersonaPlex, Moshi lineage) to production
MCP becomes the standard tool-calling surface across voice platforms

Long-Term (2028+)

Category consolidation around 3-4 leaders per stratum
Integration with agent orchestration platforms (Tembo, etc.)
On-device voice AI (Cartesia on-device Sonic, Gradium edge SDK) reduces cloud dependency

Bottom Line

13 platforms serve the agentic voice market across three strata:

Platform	Best For	Key Differentiator
OpenAI Realtime	Accuracy-critical agents	gpt-realtime-2, 96.6% BBA, MCP + SIP
Gemini Live	Google-stack teams	GA with SLAs, cheapest lab S2S
AWS Nova 2 Sonic	AWS enterprises	~$0.015/min, Bedrock compliance
ElevenLabs Agents	Voice quality + platform	$11B, 2M+ agents, 70+ languages
Cartesia	Latency + quality	SSM models, 40ms TTFB, #1 Speech Arena
Gradium	EU / full-duplex lineage	Kyutai spinoff, $70M, 5 languages
Rime	Phone-scale TTS	100M+ calls/month, 120ms on-prem
Deepgram Aura	Enterprise TTS + agent API	Domain pronunciation, on-premise
PersonaPlex	Self-hosted full-duplex	Open weights, ~70ms switching
Vapi	Enterprise phone agents	Amazon Ring win, 1B+ calls
Retell AI	Transparent phone automation	Profitable, built-in QA (Assure)
LiveKit Agents	Open-source control	$1B, powers ChatGPT voice
Pipecat	Vendor-neutral pipelines	BSD-2, ~90 integrations

The story of this refresh: the category stratified. Labs own the S2S model layer (and collapsed its price), specialists own speech quality and latency, and orchestrators own the enterprise deployment surface — with billion-dollar valuations now at every layer. Pick your stratum first, then your vendor; and budget 2-3x the advertised per-minute price.

Research by Ry Walker Research • methodology

Disclosure: Author is CEO of Tembo, which may integrate with voice AI platforms for agent communication.

Sources