← Back to research
·12 min read·industry

Agentic Voice APIs

A comparison of 8 leading voice AI platforms for building agentic voice applications — OpenAI Realtime, AWS Nova 2 Sonic, ElevenLabs, NVIDIA PersonaPlex, Vapi, Retell AI, LiveKit Agents, and Deepgram Aura.

Key takeaways

  • OpenAI Realtime API leads in accuracy and function calling; ElevenLabs dominates voice quality with 10K+ voices
  • The market splits between managed platforms (Vapi, Retell) and frameworks (LiveKit) vs end-to-end APIs (OpenAI, ElevenLabs)
  • NVIDIA PersonaPlex brings open-source full-duplex to the market; AWS Nova 2 Sonic offers Bedrock enterprise integration

FAQ

What's the best voice API for AI agents?

OpenAI Realtime for accuracy and function calling, ElevenLabs for voice quality, Vapi/Retell for phone automation, LiveKit for open-source flexibility.

What does speech-to-speech mean?

Speech-to-speech (S2S) models process audio input and generate audio output directly, eliminating the traditional ASR→LLM→TTS pipeline for lower latency and better nuance preservation.

Which voice APIs support self-hosting?

NVIDIA PersonaPlex (fully open source), LiveKit Agents (open-source framework), and Deepgram Aura (on-premise option) support self-hosted deployment.

What's the cheapest voice API?

Retell AI starts at $0.07/min, LiveKit is $0.004/min audio (plus provider costs). OpenAI Realtime is premium at ~$0.15-0.20/min.

Executive Summary

A new infrastructure category has emerged: voice APIs for agentic applications. These platforms solve the problem of "how do AI agents speak and listen?" with solutions ranging from end-to-end speech-to-speech models to modular building blocks and orchestration frameworks.

Market Leaders (8 platforms): OpenAI Realtime API, AWS Nova 2 Sonic, ElevenLabs Conversational AI, NVIDIA PersonaPlex, Vapi, Retell AI, LiveKit Agents, Deepgram Aura

Key Findings:

  • OpenAI Realtime API leads in accuracy with 82.8% reasoning and 66.5% function calling scores
  • ElevenLabs dominates voice quality with 10,000+ voices and state-of-the-art turn-taking
  • NVIDIA PersonaPlex brings open-source full-duplex to market, outperforming Gemini Live
  • The market is splitting into speech-to-speech (end-to-end) vs orchestration (pipeline) approaches

Strategic Planning Assumptions:

  • By 2027, speech-to-speech will become the default architecture for voice agents
  • By 2028, full-duplex (simultaneous listening/speaking) will be table stakes

Market Definition

Agentic voice APIs enable AI agents to communicate through speech, providing:

  • Speech Understanding — Converting spoken input to meaning (beyond transcription)
  • Voice Generation — Producing natural-sounding speech output
  • Conversation Management — Handling turns, interruptions, and context
  • Tool Integration — Connecting voice to actions (function calling, MCP)

Key distinction from traditional telephony: These APIs are designed for AI-native applications, not human-to-human call routing. They optimize for agent intelligence, natural interaction, and programmatic control.


Comparison Matrix

PlatformArchitectureBarge-InSelf-HostLatencyPricing
OpenAI RealtimeSpeech-to-Speech~200ms~$0.15-0.20/min
AWS Nova 2 SonicSpeech-to-SpeechLowBedrock tokens
ElevenLabsS2S + Pipeline~200ms~$0.10/min
PersonaPlexSpeech-to-Speech~200msFree (GPU costs)
VapiOrchestration~300ms$0.05/min + providers
Retell AIPipeline~300ms$0.07/min + providers
LiveKit AgentsFrameworkVaries$0.004/min + providers
Deepgram AuraTTS OnlyN/Asub-200ms$0.03/1K chars

Barge-in = graceful handling of user interruptions Deepgram Aura is TTS building block, not complete voice agent


Cost Comparison

Detailed Pricing Table

PlatformFree TierPer-Minute CostMonthly MinimumConcurrent Limits
OpenAI Realtime$5 API credit (new accounts)~$0.15-0.20None (pay-as-you-go)Rate limited
AWS Nova 2 SonicAWS Free Tier (limited)~$0.017/minNone (pay-as-you-go)Bedrock quotas
ElevenLabs10K credits (~10 min)~$0.10$5/mo (Starter)Plan-based
PersonaPlexUnlimited (open source)$0 (GPU costs only)NoneYour infrastructure
Vapi$10 free credit$0.05 + providers (~$0.15-0.35 total)None20 free concurrent
Retell AI$10 free credit$0.07 + providers (~$0.10-0.20 total)None20 free concurrent
LiveKit Agents5K minutes free$0.004 + providers (~$0.03-0.15 total)NonePlan-based
Deepgram Aura$200 free credit$0.03/1K chars (~$0.02/min)†NoneRate limited

AWS Nova 2 Sonic: $0.0034/1K input + $0.0136/1K output speech tokens ≈ $0.017/min †Deepgram Aura is TTS only; full voice agent requires STT and LLM components

Cost at Scale (1,000 and 10,000 Minutes)

Platform1,000 Minutes10,000 MinutesNotes
OpenAI Realtime$150-200$1,500-2,000Premium but all-inclusive
AWS Nova 2 Sonic$17-25$170-25080% cheaper than OpenAI
ElevenLabs$100$1,000Scale plan ($330/mo) for volume
PersonaPlex$50-200*$500-2,000**GPU infrastructure costs only
Vapi$150-350$1,500-3,500Varies heavily by provider choice
Retell AI$100-200$1,000-2,000Most transparent pricing
LiveKit Agents$30-150$300-1,500Lowest if using cheap providers
Deepgram Aura$20†$200††TTS only, add STT/LLM costs

Component Cost Breakdown (for Pipeline Architectures)

Platforms like Vapi, Retell, and LiveKit charge platform fees plus provider costs. Here's what those provider costs typically add:

ComponentBudget OptionMid-TierPremium
STTDeepgram Nova ($0.006/min)AssemblyAI ($0.015/min)Whisper API ($0.02/min)
LLMGPT-4.1 mini ($0.016/min)GPT-4.1 ($0.045/min)Claude 4.5 Sonnet ($0.08/min)
TTSDeepgram Aura ($0.02/min)Cartesia ($0.015/min)ElevenLabs ($0.04/min)
TelephonySIP trunk ($0.01/min)Twilio ($0.015/min)Premium ($0.02/min)
Total Stack~$0.05/min~$0.09/min~$0.16/min

Add platform fee ($0.04-0.07/min) for Vapi/Retell

Hidden Costs to Watch

PlatformHidden CostImpact
OpenAI RealtimeAudio token overheadCached prompts charged separately
ElevenLabsCredit expirationUnused credits may expire monthly
VapiProvider markupsSome providers marked up vs direct
Retell AIAdd-onsKnowledge base, PII removal add $0.005-0.01/min
LiveKitEgress/recordingRecording adds per-minute costs
All platformsFailed callsUsually still billed for partial minutes

Cost Optimization Tips

  1. For lowest cost: LiveKit self-hosted + Deepgram STT + cheap LLM + Deepgram TTS (~$0.03-0.05/min)
  2. For best value: Retell AI with GPT-4.1 mini (~$0.09/min with good quality)
  3. For enterprise: AWS Nova 2 Sonic with reserved capacity (volume discounts)
  4. For quality: ElevenLabs Scale plan (~$0.10/min with best voices)

Product Profiles

OpenAI Realtime API

The most capable speech-to-speech API with best-in-class accuracy.[1]

  • 82.8% reasoning accuracy (Big Bench Audio), 66.5% function calling
  • Native MCP server support, image inputs, SIP telephony
  • 10 voices including exclusive Marin and Cedar
  • gpt-realtime model with 20% price reduction from preview
  • ⚠️ Premium pricing (~$0.15-0.20/min), no self-hosting

Best for: Production voice agents requiring high accuracy and tool integration.

Pricing: $32/1M input tokens, $64/1M output tokens (~$0.15-0.20/min balanced).


AWS Nova 2 Sonic

Bedrock-native speech-to-speech with enterprise AWS integration.[2]

  • Unified speech understanding and generation in single model
  • Nova 2 Sonic adds polyglot voices, 1M token context window
  • Native AWS security, compliance, and unified billing
  • Integrates with Connect, Lex, and AWS services
  • ⚠️ AWS lock-in, limited regional availability

Best for: AWS-native enterprises needing compliant voice AI with unified billing.

Pricing: Token-based through Bedrock. Contact AWS for rates.


ElevenLabs Conversational AI

Voice quality leader with 10,000+ voices and enterprise features.[3]

  • State-of-the-art turn-taking model detects "um," "ah" cues
  • 10,000+ voices with voice cloning capability
  • Integrated RAG, automatic language detection (32+ languages)
  • HIPAA compliance, EU data residency
  • ⚠️ Premium pricing, no self-hosting option

Best for: Applications where voice quality is the key differentiator.

Pricing: ~$0.10/minute for voice agents. Plans from $5/mo to Enterprise.


NVIDIA PersonaPlex

Open-source full-duplex with customizable voices and roles.[4]

  • Full-duplex: listens and speaks simultaneously
  • Voice prompting (audio) + role prompting (text)
  • Outperforms Gemini Live on naturalness (3.90 vs 3.72)
  • 100% interruption handling success rate
  • ⚠️ Requires GPU infrastructure, no managed cloud

Best for: Teams with GPU infrastructure wanting self-hosted, customizable full-duplex.

Pricing: Free (open source). Self-hosting requires NVIDIA GPUs.


Vapi

Developer platform with provider flexibility and phone focus.[5]

  • Supports both OpenAI Realtime and traditional STT+LLM+TTS
  • Mix providers (Deepgram, ElevenLabs, OpenAI, Anthropic)
  • Strong phone integration (SIP, Twilio)
  • $25M+ raised from Bessemer, Y Combinator
  • ⚠️ Orchestration overhead, costs can add up

Best for: Developers wanting provider flexibility and phone-first voice agents.

Pricing: $0.05/min platform + provider costs. Total: $0.10-0.35/min.


Retell AI

Transparent pricing with no-code builder for phone automation.[6]

  • Modular pricing starting at $0.07/minute
  • No-code agent builder + full API
  • 99.99% uptime SLA, batch calling
  • OpenAI Realtime support ($0.50/min)
  • ⚠️ Phone-focused, smaller funding ($5.1M)

Best for: Call centers wanting transparent pricing and no-code setup.

Pricing: $0.055/min infra + $0.015/min voice + LLM + telephony. Total: $0.07-0.20/min.


LiveKit Agents

Open-source framework with full provider flexibility.[7]

  • Apache 2.0 open source, self-hostable
  • Plugin architecture for any STT/LLM/TTS provider
  • Multi-modal (voice + video agents)
  • $122.5M+ raised, OpenAI as customer
  • ⚠️ More setup complexity, DIY responsibility

Best for: Developer teams wanting open-source control and provider flexibility.

Pricing: Self-host free. Cloud: $0.004/min audio + provider costs.


Deepgram Aura

Enterprise TTS building block with domain pronunciation.[8]

  • Sub-200ms latency for realtime voice AI
  • Domain-tuned pronunciation (healthcare, finance, legal)
  • 40+ voices, on-premise deployment option
  • $1.3B valuation, $230M+ raised
  • ⚠️ TTS only, not complete voice agent platform

Best for: Enterprise TTS needing domain pronunciation and on-premise deployment.

Pricing: $0.030/1K characters. $200 free credit.


Architecture Patterns

Speech-to-Speech vs Pipeline

The market is splitting into two architectural approaches:

Speech-to-Speech (OpenAI Realtime, Nova 2 Sonic, PersonaPlex):

  • Single model for understanding + generation
  • Lower latency, better nuance preservation
  • More integrated but less modular

Pipeline (Vapi, Retell, LiveKit):

  • Separate STT → LLM → TTS components
  • Provider flexibility, cost optimization
  • Higher latency, more moving parts

Full-Duplex vs Turn-Based

Full-Duplex (PersonaPlex, emerging in others):

  • Simultaneous listening and speaking
  • Natural backchanneling ("uh-huh," "yeah")
  • More human-like but harder to implement

Turn-Based (most current systems):

  • Wait for user to finish before responding
  • Simpler but less natural
  • Adding interruption handling improves experience

Enterprise Feature Comparison

Speech-to-Speech Platforms

FeatureOpenAINova 2 SonicElevenLabsPersonaPlex
SOC2N/A
HIPAAN/A
Self-hosting
Open source
SIP/Telephony
Voice Cloning
MCP Support

Orchestration & Building Blocks

FeatureVapiRetellLiveKitDeepgram
SOC2
HIPAA
Self-hosting
Open source
SIP/Telephony
Voice CloningVia providersVia providersVia providers
MCP SupportN/A

Notable Others / Did Not Meet Prerequisites

Several platforms are notable but didn't meet our criteria for this comparison:

Google Gemini Live

  • Status: Consumer product, not API-accessible
  • Why excluded: No programmatic API for developers; only available through Google apps
  • Watch for: Potential API release could disrupt market given Google's scale

Amazon Lex

  • Status: Legacy conversational AI service
  • Why excluded: Not speech-to-speech native; traditional intent/slot architecture
  • Better alternative: AWS Nova 2 Sonic for modern voice AI on AWS

Twilio Voice AI

  • Status: Orchestration layer only
  • Why excluded: Requires external STT/TTS; not a voice AI solution itself
  • Better alternative: Vapi or Retell for integrated phone voice agents

Hume AI

  • Status: Emotion-focused voice AI
  • Why excluded: Specialized for emotional intelligence, not general voice agents
  • Best for: Applications specifically needing emotion detection/response

Play.ht

  • Status: TTS only
  • Why excluded: Text-to-speech without conversation support
  • Better alternative: ElevenLabs or Deepgram Aura for voice AI TTS

Strategic Recommendations

By Use Case

Use CaseRecommendedRunner-Up
High-accuracy agentsOpenAI RealtimeElevenLabs
Voice quality priorityElevenLabsDeepgram Aura
Phone automationRetell AIVapi
Self-hosted deploymentLiveKit AgentsPersonaPlex
AWS enterpriseAWS Nova 2 Sonic
Open sourcePersonaPlexLiveKit
Full-duplexPersonaPlexOpenAI Realtime
Cost optimizationRetell AILiveKit
Provider flexibilityVapiLiveKit

By Team Profile

Enterprise with compliance needs: → ElevenLabs (HIPAA), AWS Nova 2 Sonic (Bedrock compliance), or Deepgram (on-premise)

Startup iterating quickly: → Retell AI (no-code + API) or Vapi (provider flexibility)

Developer team wanting control: → LiveKit Agents (open-source) or self-hosted PersonaPlex

Phone-first call center: → Retell AI (transparent pricing) or Vapi (Realtime support)


Market Outlook

Near-Term (2026)

  • OpenAI Realtime API gaining enterprise adoption
  • ElevenLabs expanding conversational AI capabilities
  • More providers adding speech-to-speech models

Medium-Term (2027)

  • Speech-to-speech becoming default architecture
  • Full-duplex moving from research to production
  • MCP adoption expanding beyond OpenAI

Long-Term (2028+)

  • Category consolidation around 3-4 leaders
  • Integration with agent orchestration platforms (Tembo, etc.)
  • On-device voice AI reducing cloud dependency

Bottom Line

8 platforms serve the agentic voice API market with distinct approaches:

PlatformBest ForKey Differentiator
OpenAI RealtimeAccuracy-critical agentsBest function calling, MCP support
AWS Nova 2 SonicAWS enterprisesBedrock compliance, unified billing
ElevenLabsVoice quality priority10K+ voices, turn-taking model
PersonaPlexSelf-hosted full-duplexOpen source, customizable
VapiProvider flexibilityOrchestration + phone integration
Retell AITransparent phone automation$0.07/min, no-code builder
LiveKit AgentsOpen-source controlFramework, multi-modal
Deepgram AuraEnterprise TTSDomain pronunciation, on-premise

The market is early and evolving rapidly. OpenAI Realtime has the accuracy lead, ElevenLabs owns voice quality, and open-source options (PersonaPlex, LiveKit) are gaining traction. The winners will be determined by which architecture—speech-to-speech or orchestrated pipeline—proves better for production agents at scale.


Research by Ry Walker Research

Disclosure: Author is CEO of Tembo, which may integrate with voice AI platforms for agent communication.