Cartesia | Ry Walker Research

Key takeaways

Raised a $100M round from Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA in November 2025 alongside the Sonic-3 launch, on top of a $64M Series A (March 2025) and $27M seed — roughly $191M disclosed in total
The architecture bet is state space models instead of transformers — the S4/Mamba research line from founders Albert Gu and Chris Ré at Stanford — yielding ~90ms model latency (~40ms on turbo variants) and an on-device deployment story competitors lack
More than a TTS API: Sonic (TTS), Ink (STT), and the Line voice agent platform make it a full agentic voice stack, with ServiceNow, Decagon, Zomato, and Retell AI among named customers
Sonic-3.5 took the #1 spot on the Artificial Analysis Speech Arena leaderboard in May 2026, ahead of Inworld and Google's Gemini Flash TTS

FAQ

What is Cartesia?

Cartesia is a voice AI company building real-time speech models on state space model architecture — Sonic for text-to-speech, Ink for speech-to-text, and Line, a platform for building and deploying voice agents.

How much does Cartesia cost?

Credit-based tiers from a free plan (20K credits/month) through Pro ($4/month), Startup ($39/month), and Scale ($239/month) to custom Enterprise; Line voice agents are billed at $0.06 per minute of call time plus $0.014/minute for Cartesia phone numbers.

What models does Cartesia offer?

Sonic-3.5 (text-to-speech, 42 languages, ~40ms time-to-first-byte on the turbo variant) and Ink-2 (streaming speech-to-text), both built on state space models rather than transformers, deployable in cloud, VPC/on-prem, and on-device.

How is Cartesia different from ElevenLabs?

ElevenLabs wins on voice-library breadth and long-form expressive narration; Cartesia is built for live conversational agents — lower turbo latency from its SSM architecture, an on-device story, and the Line agent platform — rather than content production.

Executive Summary

Cartesia is the architecture contrarian of the voice AI category: while nearly every competitor builds on transformers, Cartesia's models run on state space models (SSMs) — the S4/Mamba research line its founders created at Stanford's AI Lab — which the company credits for ultra-low latency, long-context efficiency, and the ability to run on-device.^[1]^[2] The product stack is a full agentic voice pipeline: Sonic (text-to-speech, currently Sonic-3.5), Ink (streaming speech-to-text, currently Ink-2), and Line, a platform for building and shipping enterprise voice agents on top of those models.^[1]

The capital story is among the strongest in the category. Founded in 2023 by Karan Goel (CEO), Albert Gu, Arjun Desai, Brandon Yang, and Stanford professor Chris Ré, Cartesia raised a $27M seed (December 2024), a $64M Series A led by Kleiner Perkins (March 2025), and a $100M round from Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA announced alongside Sonic-3 in November 2025 — roughly $191M disclosed in total.^[3]^[4]^[2] At the Series A the company reported 10,000+ customers including Quora, Cresta, and Rasa; the site now names ServiceNow, Decagon, Zomato, Sanas, Elise AI, and Retell AI.^[4]^[1] In May 2026, Sonic-3.5 took the #1 spot on the Artificial Analysis Speech Arena leaderboard.^[5]

Attribute	Value
Company	Cartesia (San Francisco)^[4]
Founders	Karan Goel (CEO), Albert Gu, Arjun Desai, Brandon Yang, Chris Ré — Stanford AI Lab^[4]
Founded	2023^[4]
Funding	~$191M disclosed: $27M seed (Dec 2024), $64M Series A (Mar 2025, Kleiner Perkins), $100M (Nov 2025, Kleiner Perkins/Index/Lightspeed/NVIDIA)^[3]^[4]^[2]
Named Customers	ServiceNow, Decagon, Zomato, Sanas, Elise AI, Blue Machines, Retell AI, Fundamento; earlier Quora, Cresta, Rasa^[1]^[4]
Open Source	No — proprietary models; the underlying SSM research (S4, Mamba) is published academia^[1]

Product Overview

Cartesia sells the two halves of a voice agent's audio loop as separate APIs — Sonic for speech out, Ink for speech in — and then packages them, with telephony and orchestration, as the Line platform for teams that want a managed voice agent rather than raw models.^[1] Sonic-3 launched in late 2025 with 42 languages, ~90ms model latency (~190ms end-to-end), and expressiveness features like laughter and tonal variation; turbo variants have run at ~40ms time-to-first-byte since Sonic 2.0.^[2]^[6] Sonic-3.5, the current flagship, keeps the 42 languages (including 9 Indian languages), ships 500+ voices out of the box, and leads the Artificial Analysis Speech Arena as of May 2026.^[5]

Independent benchmarking broadly corroborates the latency story: a third-party 2026 comparison measured Sonic-3.5 streaming first audio in 75–90ms over WebSocket, roughly tied with ElevenLabs' fastest Flash model but far ahead of ElevenLabs' higher-fidelity models (500–800ms), while noting Cartesia's quality is concentrated on conversational delivery rather than long-form narration.^[7]

Key Capabilities

Capability	Description
Sonic-3.5 TTS	42 languages, 500+ stock voices, laughter/emotion control, #1 on Artificial Analysis Speech Arena (May 2026)^[5]^[2]
Turbo latency	~40ms time-to-first-byte on turbo variants; ~90ms full model^[6]
Ink-2 STT	Streaming speech-to-text, billed per second of audio^[1]^[8]
Voice cloning	Instant cloning from the $4 Pro tier; "pro" cloning from the $39 Startup tier^[9]
Line voice agents	Managed platform for building/shipping enterprise voice agents, with phone numbers^[1]^[9]
Voice changing / infill	Voice-change and infill-editing endpoints introduced with Sonic 2.0^[6]

Product Surfaces

Surface	Description	Availability
TTS/STT APIs	WebSocket and REST access to Sonic and Ink	GA^[1]
Line platform	Voice agent builder with per-minute billing and phone numbers	GA^[9]
On-prem / VPC	In-region processing on customer infrastructure	Enterprise^[1]
On-device	Models for mobile, PC, and robotics deployment	Available^[1]

Technical Architecture

The differentiator is the model family itself. Cartesia's models are built on state space models rather than transformers — the research line (S4, Mamba) co-created by founders Albert Gu and Chris Ré at Stanford — which process audio as a continuous stream with constant-memory state rather than attention over a growing context window.^[1]^[2] The company credits this for both the latency numbers and the efficiency that makes on-device deployment practical; a Cartesia engineer has claimed Sonic runs on a MacBook Pro with cloning intact.^[1]^[3]

Key Technical Details

Aspect	Detail
Deployment	Managed cloud with regional endpoints; VPC/on-prem; on-device (mobile, PC, robotics)^[1]
Models	Sonic-3.5 (TTS), Ink-2 (STT); turbo TTS variants at ~40ms TTFB^[1]^[6]
Architecture	State space models (SSMs), not transformers^[1]^[2]
Integrations	Line platform telephony; used inside agent platforms like Retell AI^[9]^[1]
Open Source	Models proprietary; SSM research lineage published^[1]

Strengths

Architectural moat with a research pedigree — the founding team includes the creators of the S4/Mamba SSM line, and the latency and on-device claims flow directly from that architecture rather than from infrastructure tuning.^[4]^[2]
Independently verified quality leadership — Sonic-3.5 holds #1 on the Artificial Analysis Speech Arena as of May 2026, and third-party benchmarks confirm 75–90ms real-world first-audio latency.^[5]^[7]
Tier-one capital depth — ~$191M from Kleiner Perkins, Index, Lightspeed, and NVIDIA gives it a longer runway than nearly any pure-play voice model competitor.^[4]^[2]
Full-stack coverage — TTS, STT, and a managed agent platform (Line) from one vendor, so teams can start with raw models and graduate to hosted agents without switching providers.^[1]^[9]
Deployment flexibility competitors lack — cloud, VPC/on-prem, and on-device options cover compliance-bound and embedded use cases most API-only rivals cannot.^[1]
Cheap to start — a free tier with 20K credits and a $4/month Pro tier with instant voice cloning is among the lowest entry points in the category.^[9]

Cautions

Long-form expressiveness trails ElevenLabs — third-party testing finds Sonic-3.5 roughly tied on short conversational replies but audibly behind on character voices and narration, so content-production workloads fit poorly.^[7]
Smaller curated voice catalog — independent comparison puts Cartesia's public catalog at roughly 100 curated conversational voices versus 1,000+ at ElevenLabs, even as Artificial Analysis counts 500+ stock voices; either way, variety is not the pitch.^[7]^[5]
Thin direct community discussion — Cartesia's December 2024 Show HN drew zero comments, and most public evaluation comes from benchmark sites and competitor blogs rather than practitioner threads.^[3]
Credit-based pricing obscures unit costs — TTS bills per character and STT per second through an abstract credit system, and concurrency caps (2 TTS requests on free, 15 on the $239 tier) force upgrades independent of volume.^[9]^[8]
Line is young relative to dedicated agent platforms — Vapi, Retell, and LiveKit have years of telephony-orchestration scar tissue; notably, Retell AI is itself a named Cartesia customer, putting Line in partial competition with Cartesia's own channel.^[1]
Rapid model churn — Sonic 2.0, Sonic-3, and Sonic-3.5 shipped within roughly a year, which is a strength for quality but a real regression-testing burden for production voice agents pinned to model behavior.^[6]^[5]

What Developers Say

Direct practitioner discussion is thinner than the funding profile would suggest — the Show HN launch drew no comments — but scattered HN threads and independent benchmarks give a consistent picture: fast and high-quality, with rough edges on long output.^[3]

"Great quality, very fast — but with a few kinks to work out (going off the rails on long utterances, random sounds, etc)." — nmfisher on Hacker News^[3]

"On long-form expressive content like character voices or audiobook narration, the texture difference becomes audible." — Future AGI's 2026 ElevenLabs-vs-Cartesia benchmark^[7]

"Our TTS model, Sonic, is probably SoTA for on-device… Sonic can be run on a MacBook Pro." — kabirgoel, Cartesia engineer, on Hacker News (vendor voice; read accordingly)^[3]

"Sonic-3.5 takes the #1 spot on the Artificial Analysis Speech Arena Leaderboard, surpassing Inworld Realtime TTS 1.5 Max and Google's Gemini 3.1 Flash TTS." — Artificial Analysis^[5]

As of June 2026 there is no substantive critical HN or Reddit thread on Cartesia specifically; the most skeptical material is competitor-authored comparisons, which should be discounted accordingly.^[3]^[7]

Pricing & Licensing

Credit-based subscription tiers plus per-minute billing for Line voice agents; TTS consumes roughly one credit per character and STT one credit per second of audio.^[9]^[8]

Tier	Price	Includes
Free	$0/mo	20K credits, $1 prepaid agent usage, 2 concurrent TTS / 8 STT requests, 1 agent slot
Pro	$4/mo	100K credits (~133 TTS min), instant voice cloning, 3 agent slots
Startup	$39/mo	1.25M credits (~1,667 TTS min), pro voice cloning, 5 agent slots
Scale	$239/mo	8M credits (~10,667 TTS min), 15 concurrent TTS / 60 STT, priority support, 10 agent slots
Enterprise	Custom	Volume pricing, SSO, DPAs, BAAs, custom concurrency

Line voice agents bill $0.06 per minute of call duration plus $0.014/minute for Cartesia-provisioned phone numbers. All pricing as of June 2026.^[9]

Licensing model: Proprietary managed API and platform; no open-source model weights.^[1]

Hidden costs: Line's per-minute agent billing sits on top of the subscription via prepaid agent allowances; concurrency limits, not just credit volume, drive tier upgrades; LLM and telephony-carrier costs for full agents are additional.^[9]

Competitive Positioning

Direct Competitors

Competitor	Differentiation
ElevenLabs	The category's quality and ecosystem leader with 1,000+ voices and a mature agents platform; Cartesia counters with faster turbo latency, SSM efficiency, on-device deployment, and a lower entry price
Deepgram Aura	Enterprise STT incumbent with TTS added on, strong on-prem story and domain-tuned pronunciation; Cartesia leads on TTS quality benchmarks and expressiveness
Gradium	Fellow research-lab spinout building low-latency speech APIs with a European angle; Cartesia is further ahead on disclosed funding, leaderboard position, and a shipping agent platform
OpenAI Realtime / speech-to-speech models	End-to-end S2S removes the pipeline entirely; Cartesia's modular TTS/STT remains cheaper and more controllable for production phone agents

When to Choose Cartesia Over Alternatives

Choose Cartesia when: latency is the binding constraint for a live conversational agent, you want one vendor for TTS + STT + agent hosting, or you need on-device or in-VPC speech models.
Choose ElevenLabs when: voice variety, long-form expressive narration, or the largest voice-cloning ecosystem matter more than the last 50ms.
Choose Deepgram Aura when: STT accuracy on domain jargon and a battle-tested enterprise on-prem deployment drive the decision.
Choose Gradium when: EU-centric data handling and that team's research lineage fit your constraints.

Ideal Customer Profile

Best fit:

Voice agent platforms and contact-center AI products where time-to-first-audio is user-facing — the segment Cartesia's customer list (Decagon, Retell AI, Elise AI) already reflects^[1]
Enterprises needing in-region, VPC, or on-device speech models for compliance or embedded products (robotics, mobile)^[1]
Teams that want to start on raw TTS/STT APIs and later graduate to a managed agent platform without changing vendors^[9]

Poor fit:

Audiobook, character, and long-form content production, where independent testing finds ElevenLabs audibly ahead^[7]
Teams wanting open-source weights or self-auditable models
Builders who want a single speech-to-speech model rather than a composable TTS/STT pipeline

Viability Assessment

Factor	Assessment
Financial Health	Excellent — ~$191M raised through November 2025, with Kleiner Perkins leading twice and NVIDIA on the cap table^[4]^[2]
Market Position	Strong challenger — #1 on the Artificial Analysis Speech Arena (May 2026) and 10,000+ customers reported at the Series A, but ElevenLabs holds the ecosystem and mindshare lead^[5]^[4]
Innovation Pace	Very high — Sonic 2.0, Sonic-3, Sonic-3.5, Ink-2, and the Line platform all shipped between March 2025 and May 2026^[6]^[5]^[1]
Community/Ecosystem	Thin for the funding level — minimal HN footprint, evaluation dominated by benchmark sites and competitor content^[3]
Long-term Outlook	Strong, contingent on the SSM latency/efficiency edge surviving as transformer rivals optimize and speech-to-speech models mature^[2]^[7]

The capital position and research pedigree make Cartesia one of the safest pure-play voice model bets in the category, and the leaderboard result shows the architecture argument is producing measurable quality, not just speed.^[5]^[2] The open question is strategic: Line puts Cartesia in competition with the agent platforms (Retell, Vapi) that are also its model customers, a channel tension ElevenLabs has navigated with mixed results.^[1]

Bottom Line

Cartesia is the latency-and-architecture play in agentic voice: if you are building a live conversational agent where the gap between 40ms and 400ms is the product, its SSM-based Sonic models are the benchmark leader with the deployment flexibility (cloud, VPC, on-device) to follow you into compliance-bound or embedded territory. The trade is a smaller voice catalog, weaker long-form expressiveness, a young agent platform, and a credit-based pricing scheme that takes a spreadsheet to forecast.

Recommended for: Voice agent builders and contact-center AI products where time-to-first-audio is user-facing; enterprises needing in-region or on-device speech models; teams wanting TTS, STT, and agent hosting from one well-capitalized vendor.

Not recommended for: Long-form narration and character-voice content production; teams requiring open-source models; builders betting on end-to-end speech-to-speech rather than a modular pipeline.

Outlook: Watch whether the SSM latency edge holds as transformer TTS vendors close the gap, whether Line gains traction without alienating platform customers like Retell AI, and whether the next disclosed numbers (revenue, usage) match the ~$191M of expectations behind them.

Research by Ry Walker Research • methodology

Sources