Key takeaways
- OpenAI Realtime API leads in accuracy and function calling; ElevenLabs dominates voice quality with 10K+ voices
- The market splits between managed platforms (Vapi, Retell) and frameworks (LiveKit) vs end-to-end APIs (OpenAI, ElevenLabs)
- NVIDIA PersonaPlex brings open-source full-duplex to the market; AWS Nova 2 Sonic offers Bedrock enterprise integration
FAQ
What's the best voice API for AI agents?
OpenAI Realtime for accuracy and function calling, ElevenLabs for voice quality, Vapi/Retell for phone automation, LiveKit for open-source flexibility.
What does speech-to-speech mean?
Speech-to-speech (S2S) models process audio input and generate audio output directly, eliminating the traditional ASR→LLM→TTS pipeline for lower latency and better nuance preservation.
Which voice APIs support self-hosting?
NVIDIA PersonaPlex (fully open source), LiveKit Agents (open-source framework), and Deepgram Aura (on-premise option) support self-hosted deployment.
What's the cheapest voice API?
Retell AI starts at $0.07/min, LiveKit is $0.004/min audio (plus provider costs). OpenAI Realtime is premium at ~$0.15-0.20/min.
Executive Summary
A new infrastructure category has emerged: voice APIs for agentic applications. These platforms solve the problem of "how do AI agents speak and listen?" with solutions ranging from end-to-end speech-to-speech models to modular building blocks and orchestration frameworks.
Market Leaders (8 platforms): OpenAI Realtime API, AWS Nova 2 Sonic, ElevenLabs Conversational AI, NVIDIA PersonaPlex, Vapi, Retell AI, LiveKit Agents, Deepgram Aura
Key Findings:
- OpenAI Realtime API leads in accuracy with 82.8% reasoning and 66.5% function calling scores
- ElevenLabs dominates voice quality with 10,000+ voices and state-of-the-art turn-taking
- NVIDIA PersonaPlex brings open-source full-duplex to market, outperforming Gemini Live
- The market is splitting into speech-to-speech (end-to-end) vs orchestration (pipeline) approaches
Strategic Planning Assumptions:
- By 2027, speech-to-speech will become the default architecture for voice agents
- By 2028, full-duplex (simultaneous listening/speaking) will be table stakes
Market Definition
Agentic voice APIs enable AI agents to communicate through speech, providing:
- Speech Understanding — Converting spoken input to meaning (beyond transcription)
- Voice Generation — Producing natural-sounding speech output
- Conversation Management — Handling turns, interruptions, and context
- Tool Integration — Connecting voice to actions (function calling, MCP)
Key distinction from traditional telephony: These APIs are designed for AI-native applications, not human-to-human call routing. They optimize for agent intelligence, natural interaction, and programmatic control.
Comparison Matrix
| Platform | Architecture | Barge-In | Self-Host | Latency | Pricing |
|---|---|---|---|---|---|
| OpenAI Realtime | Speech-to-Speech | ✅ | ❌ | ~200ms | ~$0.15-0.20/min |
| AWS Nova 2 Sonic | Speech-to-Speech | ✅ | ❌ | Low | Bedrock tokens |
| ElevenLabs | S2S + Pipeline | ✅ | ❌ | ~200ms | ~$0.10/min |
| PersonaPlex | Speech-to-Speech | ✅ | ✅ | ~200ms | Free (GPU costs) |
| Vapi | Orchestration | ✅ | ❌ | ~300ms | $0.05/min + providers |
| Retell AI | Pipeline | ✅ | ❌ | ~300ms | $0.07/min + providers |
| LiveKit Agents | Framework | ✅ | ✅ | Varies | $0.004/min + providers |
| Deepgram Aura | TTS Only | N/A | ✅ | sub-200ms | $0.03/1K chars |
Barge-in = graceful handling of user interruptions Deepgram Aura is TTS building block, not complete voice agent
Cost Comparison
Detailed Pricing Table
| Platform | Free Tier | Per-Minute Cost | Monthly Minimum | Concurrent Limits |
|---|---|---|---|---|
| OpenAI Realtime | $5 API credit (new accounts) | ~$0.15-0.20 | None (pay-as-you-go) | Rate limited |
| AWS Nova 2 Sonic | AWS Free Tier (limited) | ~$0.017/min | None (pay-as-you-go) | Bedrock quotas |
| ElevenLabs | 10K credits (~10 min) | ~$0.10 | $5/mo (Starter) | Plan-based |
| PersonaPlex | Unlimited (open source) | $0 (GPU costs only) | None | Your infrastructure |
| Vapi | $10 free credit | $0.05 + providers (~$0.15-0.35 total) | None | 20 free concurrent |
| Retell AI | $10 free credit | $0.07 + providers (~$0.10-0.20 total) | None | 20 free concurrent |
| LiveKit Agents | 5K minutes free | $0.004 + providers (~$0.03-0.15 total) | None | Plan-based |
| Deepgram Aura | $200 free credit | $0.03/1K chars (~$0.02/min)† | None | Rate limited |
AWS Nova 2 Sonic: $0.0034/1K input + $0.0136/1K output speech tokens ≈ $0.017/min †Deepgram Aura is TTS only; full voice agent requires STT and LLM components
Cost at Scale (1,000 and 10,000 Minutes)
| Platform | 1,000 Minutes | 10,000 Minutes | Notes |
|---|---|---|---|
| OpenAI Realtime | $150-200 | $1,500-2,000 | Premium but all-inclusive |
| AWS Nova 2 Sonic | $17-25 | $170-250 | 80% cheaper than OpenAI |
| ElevenLabs | $100 | $1,000 | Scale plan ($330/mo) for volume |
| PersonaPlex | $50-200* | $500-2,000* | *GPU infrastructure costs only |
| Vapi | $150-350 | $1,500-3,500 | Varies heavily by provider choice |
| Retell AI | $100-200 | $1,000-2,000 | Most transparent pricing |
| LiveKit Agents | $30-150 | $300-1,500 | Lowest if using cheap providers |
| Deepgram Aura | $20† | $200† | †TTS only, add STT/LLM costs |
Component Cost Breakdown (for Pipeline Architectures)
Platforms like Vapi, Retell, and LiveKit charge platform fees plus provider costs. Here's what those provider costs typically add:
| Component | Budget Option | Mid-Tier | Premium |
|---|---|---|---|
| STT | Deepgram Nova ($0.006/min) | AssemblyAI ($0.015/min) | Whisper API ($0.02/min) |
| LLM | GPT-4.1 mini ($0.016/min) | GPT-4.1 ($0.045/min) | Claude 4.5 Sonnet ($0.08/min) |
| TTS | Deepgram Aura ($0.02/min) | Cartesia ($0.015/min) | ElevenLabs ($0.04/min) |
| Telephony | SIP trunk ($0.01/min) | Twilio ($0.015/min) | Premium ($0.02/min) |
| Total Stack | ~$0.05/min | ~$0.09/min | ~$0.16/min |
Add platform fee ($0.04-0.07/min) for Vapi/Retell
Hidden Costs to Watch
| Platform | Hidden Cost | Impact |
|---|---|---|
| OpenAI Realtime | Audio token overhead | Cached prompts charged separately |
| ElevenLabs | Credit expiration | Unused credits may expire monthly |
| Vapi | Provider markups | Some providers marked up vs direct |
| Retell AI | Add-ons | Knowledge base, PII removal add $0.005-0.01/min |
| LiveKit | Egress/recording | Recording adds per-minute costs |
| All platforms | Failed calls | Usually still billed for partial minutes |
Cost Optimization Tips
- For lowest cost: LiveKit self-hosted + Deepgram STT + cheap LLM + Deepgram TTS (~$0.03-0.05/min)
- For best value: Retell AI with GPT-4.1 mini (~$0.09/min with good quality)
- For enterprise: AWS Nova 2 Sonic with reserved capacity (volume discounts)
- For quality: ElevenLabs Scale plan (~$0.10/min with best voices)
Product Profiles
OpenAI Realtime API
The most capable speech-to-speech API with best-in-class accuracy.[1]
- 82.8% reasoning accuracy (Big Bench Audio), 66.5% function calling
- Native MCP server support, image inputs, SIP telephony
- 10 voices including exclusive Marin and Cedar
- gpt-realtime model with 20% price reduction from preview
- ⚠️ Premium pricing (~$0.15-0.20/min), no self-hosting
Best for: Production voice agents requiring high accuracy and tool integration.
Pricing: $32/1M input tokens, $64/1M output tokens (~$0.15-0.20/min balanced).
AWS Nova 2 Sonic
Bedrock-native speech-to-speech with enterprise AWS integration.[2]
- Unified speech understanding and generation in single model
- Nova 2 Sonic adds polyglot voices, 1M token context window
- Native AWS security, compliance, and unified billing
- Integrates with Connect, Lex, and AWS services
- ⚠️ AWS lock-in, limited regional availability
Best for: AWS-native enterprises needing compliant voice AI with unified billing.
Pricing: Token-based through Bedrock. Contact AWS for rates.
ElevenLabs Conversational AI
Voice quality leader with 10,000+ voices and enterprise features.[3]
- State-of-the-art turn-taking model detects "um," "ah" cues
- 10,000+ voices with voice cloning capability
- Integrated RAG, automatic language detection (32+ languages)
- HIPAA compliance, EU data residency
- ⚠️ Premium pricing, no self-hosting option
Best for: Applications where voice quality is the key differentiator.
Pricing: ~$0.10/minute for voice agents. Plans from $5/mo to Enterprise.
NVIDIA PersonaPlex
Open-source full-duplex with customizable voices and roles.[4]
- Full-duplex: listens and speaks simultaneously
- Voice prompting (audio) + role prompting (text)
- Outperforms Gemini Live on naturalness (3.90 vs 3.72)
- 100% interruption handling success rate
- ⚠️ Requires GPU infrastructure, no managed cloud
Best for: Teams with GPU infrastructure wanting self-hosted, customizable full-duplex.
Pricing: Free (open source). Self-hosting requires NVIDIA GPUs.
Vapi
Developer platform with provider flexibility and phone focus.[5]
- Supports both OpenAI Realtime and traditional STT+LLM+TTS
- Mix providers (Deepgram, ElevenLabs, OpenAI, Anthropic)
- Strong phone integration (SIP, Twilio)
- $25M+ raised from Bessemer, Y Combinator
- ⚠️ Orchestration overhead, costs can add up
Best for: Developers wanting provider flexibility and phone-first voice agents.
Pricing: $0.05/min platform + provider costs. Total: $0.10-0.35/min.
Retell AI
Transparent pricing with no-code builder for phone automation.[6]
- Modular pricing starting at $0.07/minute
- No-code agent builder + full API
- 99.99% uptime SLA, batch calling
- OpenAI Realtime support ($0.50/min)
- ⚠️ Phone-focused, smaller funding ($5.1M)
Best for: Call centers wanting transparent pricing and no-code setup.
Pricing: $0.055/min infra + $0.015/min voice + LLM + telephony. Total: $0.07-0.20/min.
LiveKit Agents
Open-source framework with full provider flexibility.[7]
- Apache 2.0 open source, self-hostable
- Plugin architecture for any STT/LLM/TTS provider
- Multi-modal (voice + video agents)
- $122.5M+ raised, OpenAI as customer
- ⚠️ More setup complexity, DIY responsibility
Best for: Developer teams wanting open-source control and provider flexibility.
Pricing: Self-host free. Cloud: $0.004/min audio + provider costs.
Deepgram Aura
Enterprise TTS building block with domain pronunciation.[8]
- Sub-200ms latency for realtime voice AI
- Domain-tuned pronunciation (healthcare, finance, legal)
- 40+ voices, on-premise deployment option
- $1.3B valuation, $230M+ raised
- ⚠️ TTS only, not complete voice agent platform
Best for: Enterprise TTS needing domain pronunciation and on-premise deployment.
Pricing: $0.030/1K characters. $200 free credit.
Architecture Patterns
Speech-to-Speech vs Pipeline
The market is splitting into two architectural approaches:
Speech-to-Speech (OpenAI Realtime, Nova 2 Sonic, PersonaPlex):
- Single model for understanding + generation
- Lower latency, better nuance preservation
- More integrated but less modular
Pipeline (Vapi, Retell, LiveKit):
- Separate STT → LLM → TTS components
- Provider flexibility, cost optimization
- Higher latency, more moving parts
Full-Duplex vs Turn-Based
Full-Duplex (PersonaPlex, emerging in others):
- Simultaneous listening and speaking
- Natural backchanneling ("uh-huh," "yeah")
- More human-like but harder to implement
Turn-Based (most current systems):
- Wait for user to finish before responding
- Simpler but less natural
- Adding interruption handling improves experience
Enterprise Feature Comparison
Speech-to-Speech Platforms
| Feature | OpenAI | Nova 2 Sonic | ElevenLabs | PersonaPlex |
|---|---|---|---|---|
| SOC2 | ✅ | ✅ | ✅ | N/A |
| HIPAA | ❓ | ✅ | ✅ | N/A |
| Self-hosting | ❌ | ❌ | ❌ | ✅ |
| Open source | ❌ | ❌ | ❌ | ✅ |
| SIP/Telephony | ✅ | ✅ | ✅ | ❌ |
| Voice Cloning | ❌ | ❌ | ✅ | ✅ |
| MCP Support | ✅ | ❌ | ❌ | ❌ |
Orchestration & Building Blocks
| Feature | Vapi | Retell | LiveKit | Deepgram |
|---|---|---|---|---|
| SOC2 | ✅ | ✅ | ✅ | ✅ |
| HIPAA | ❓ | ❓ | ❓ | ✅ |
| Self-hosting | ❌ | ❌ | ✅ | ✅ |
| Open source | ❌ | ❌ | ✅ | ❌ |
| SIP/Telephony | ✅ | ✅ | ✅ | ❌ |
| Voice Cloning | Via providers | Via providers | Via providers | ❌ |
| MCP Support | ❌ | ❌ | ❌ | N/A |
Notable Others / Did Not Meet Prerequisites
Several platforms are notable but didn't meet our criteria for this comparison:
Google Gemini Live
- Status: Consumer product, not API-accessible
- Why excluded: No programmatic API for developers; only available through Google apps
- Watch for: Potential API release could disrupt market given Google's scale
Amazon Lex
- Status: Legacy conversational AI service
- Why excluded: Not speech-to-speech native; traditional intent/slot architecture
- Better alternative: AWS Nova 2 Sonic for modern voice AI on AWS
Twilio Voice AI
- Status: Orchestration layer only
- Why excluded: Requires external STT/TTS; not a voice AI solution itself
- Better alternative: Vapi or Retell for integrated phone voice agents
Hume AI
- Status: Emotion-focused voice AI
- Why excluded: Specialized for emotional intelligence, not general voice agents
- Best for: Applications specifically needing emotion detection/response
Play.ht
- Status: TTS only
- Why excluded: Text-to-speech without conversation support
- Better alternative: ElevenLabs or Deepgram Aura for voice AI TTS
Strategic Recommendations
By Use Case
| Use Case | Recommended | Runner-Up |
|---|---|---|
| High-accuracy agents | OpenAI Realtime | ElevenLabs |
| Voice quality priority | ElevenLabs | Deepgram Aura |
| Phone automation | Retell AI | Vapi |
| Self-hosted deployment | LiveKit Agents | PersonaPlex |
| AWS enterprise | AWS Nova 2 Sonic | — |
| Open source | PersonaPlex | LiveKit |
| Full-duplex | PersonaPlex | OpenAI Realtime |
| Cost optimization | Retell AI | LiveKit |
| Provider flexibility | Vapi | LiveKit |
By Team Profile
Enterprise with compliance needs: → ElevenLabs (HIPAA), AWS Nova 2 Sonic (Bedrock compliance), or Deepgram (on-premise)
Startup iterating quickly: → Retell AI (no-code + API) or Vapi (provider flexibility)
Developer team wanting control: → LiveKit Agents (open-source) or self-hosted PersonaPlex
Phone-first call center: → Retell AI (transparent pricing) or Vapi (Realtime support)
Market Outlook
Near-Term (2026)
- OpenAI Realtime API gaining enterprise adoption
- ElevenLabs expanding conversational AI capabilities
- More providers adding speech-to-speech models
Medium-Term (2027)
- Speech-to-speech becoming default architecture
- Full-duplex moving from research to production
- MCP adoption expanding beyond OpenAI
Long-Term (2028+)
- Category consolidation around 3-4 leaders
- Integration with agent orchestration platforms (Tembo, etc.)
- On-device voice AI reducing cloud dependency
Bottom Line
8 platforms serve the agentic voice API market with distinct approaches:
| Platform | Best For | Key Differentiator |
|---|---|---|
| OpenAI Realtime | Accuracy-critical agents | Best function calling, MCP support |
| AWS Nova 2 Sonic | AWS enterprises | Bedrock compliance, unified billing |
| ElevenLabs | Voice quality priority | 10K+ voices, turn-taking model |
| PersonaPlex | Self-hosted full-duplex | Open source, customizable |
| Vapi | Provider flexibility | Orchestration + phone integration |
| Retell AI | Transparent phone automation | $0.07/min, no-code builder |
| LiveKit Agents | Open-source control | Framework, multi-modal |
| Deepgram Aura | Enterprise TTS | Domain pronunciation, on-premise |
The market is early and evolving rapidly. OpenAI Realtime has the accuracy lead, ElevenLabs owns voice quality, and open-source options (PersonaPlex, LiveKit) are gaining traction. The winners will be determined by which architecture—speech-to-speech or orchestrated pipeline—proves better for production agents at scale.
Research by Ry Walker Research
Disclosure: Author is CEO of Tembo, which may integrate with voice AI platforms for agent communication.