Gradium | Ry Walker Research

Key takeaways

$70M seed announced December 2, 2025 — one of the largest European AI seeds of the year — led by FirstMark Capital and Eurazeo with Xavier Niel, DST Global Partners, and Eric Schmidt participating, three months after the company's September 2025 founding
The team is the research core of Kyutai, the nonprofit Paris lab behind Moshi — the open-source 7B full-duplex audio LLM with ~160ms response latency that beat OpenAI's Advanced Voice Mode to market in summer 2024
Production streaming STT/TTS APIs in five languages (English, French, German, Spanish, Portuguese) with 237 stock voices, instant and pro voice cloning, semantic VAD for turn-taking, and telephony audio formats
Credit-based pricing from a free tier to $1,615/month — 1 TTS character = 1 credit, 1 second of STT = 3 credits — with an on-device TTS offering for edge deployment

FAQ

What is Gradium?

Gradium is a Paris-based voice AI company, spun out of the nonprofit research lab Kyutai, that sells real-time text-to-speech, speech-to-text, and voice-cloning APIs built on its own audio language models.

How much does Gradium cost?

Credit-based monthly tiers from free (45k credits, ~1 hour of TTS) to $1,615/month (45M credits, ~1,000 hours of TTS), where 1 TTS character costs 1 credit and 1 second of STT costs 3 credits; commercial use requires a paid tier.

How is Gradium related to Kyutai and Moshi?

Gradium's founders created and led Kyutai, the nonprofit lab behind the open-source Moshi full-duplex audio model; Gradium is the commercial spinoff productionizing that research line, and the teams remain in close proximity.

How is Gradium different from Cartesia?

Both sell low-latency voice APIs built on novel in-house architectures, but Cartesia is a US company several product generations in, while Gradium is a months-old European entrant differentiating on its full-duplex audio-LLM research lineage, EU base, and five-language launch coverage.

Executive Summary

Gradium is what happens when a frontier research lab decides to invoice: the commercial spinoff of Kyutai, the Xavier Niel-backed nonprofit Paris lab whose open-source Moshi model — a 7B-parameter full-duplex audio LLM with ~160ms response latency — shipped real-time voice conversation in summer 2024, before OpenAI's Advanced Voice Mode reached the public.^[1]^[2]^[3] Founded in September 2025, Gradium came out of stealth on December 2, 2025 with a $70M seed led by FirstMark Capital and Eurazeo, joined by Niel, DST Global Partners, and Eric Schmidt.^[1]

The product is deliberately narrower than the research: production streaming speech-to-text and text-to-speech APIs — plus instant and pro voice cloning — in English, French, German, Spanish, and Portuguese, which the company says were serving paying customers within three months of inception.^[4]^[1] The founding four — Neil Zeghidour (CEO, ex-Google DeepMind), Olivier Teboul (CTO, ex-Google Brain), Laurent Mazaré (Chief Coding Officer, ex-DeepMind/Jane Street), and Alexandre Défossez (CSO, ex-Meta) — were Kyutai's audio research core, and Gradium keeps what it calls "a natural proximity" with the lab's researchers.^[5]^[2]^[4]

Attribute	Value
Company	Gradium (Paris, France)^[1]
Founders	Neil Zeghidour (CEO), Olivier Teboul (CTO), Laurent Mazaré (Chief Coding Officer), Alexandre Défossez (CSO)^[5]
Founded	September 2025; out of stealth December 2, 2025^[1]
Funding	$70M seed led by FirstMark Capital and Eurazeo; Xavier Niel, DST Global Partners, Eric Schmidt^[1]
Lineage	Commercial spinoff of nonprofit lab Kyutai (founded 2023; Moshi, Hibiki)^[2]
Open Source	Kyutai's models (Moshi) are open source on GitHub; Gradium's production models are proprietary^[3]^[6]

Product Overview

Gradium sells the unglamorous half of the voice stack: streaming transcription in, streaming synthesis out, over WebSockets, with the conversational plumbing — semantic voice-activity detection for turn-taking, adaptive delay controls, flush commands — exposed as API primitives rather than hidden inside an agent product.^[7] The target buyer is a developer assembling a real-time voice agent, dub pipeline, or game NPC, not an end user; cited verticals include health, customer support, market research, gaming NPCs, and advertising avatars.^[4]

The library ships 237 stock voices across the five launch languages, and custom voices come in two grades: instant clones from a short audio sample on every tier, and pro clones reserved for the upper tiers.^[7]^[8] A speech-to-speech WebSocket — audio in, audio out on a single connection — points at where the Moshi lineage is headed, and an on-device TTS offering targets edge deployment.^[7]^[6]

Key Capabilities

Capability	Description
Streaming TTS	WebSocket streaming synthesis with speed, temperature, and voice-similarity controls^[7]
Streaming STT	Real-time transcription with semantic VAD and flush for turn-taking^[7]
Voice cloning	Instant clones from a short sample on all tiers; pro clones on M tier and up^[7]^[8]
Speech-to-speech	Audio-in/audio-out over a single WebSocket^[7]
Languages	English, French, German, Spanish, Portuguese; more announced as coming^[1]
Telephony formats	Mu-law, A-law, and low-sample-rate PCM support^[7]
Pronunciation control	Custom pronunciation dictionaries and text-rewrite rules^[7]
On-device TTS	Edge/on-device synthesis offering^[6]

Product Surfaces

Surface	Description	Availability
WebSocket + REST APIs	Streaming TTS/STT plus one-shot HTTP synthesis; OpenAPI spec	GA^[7]
Python SDK	First-party SDK; browser access via short-lived tokens	GA^[7]
Framework integrations	LiveKit Agents, Pipecat, OpenClaw plugins	GA^[7]
Studio / demos	studio.gradium.ai playground, Gradbot, agent demo, app gallery	GA^[6]
On-device TTS	Edge SDK offering	Listed; details undisclosed^[6]

Technical Architecture

Gradium builds its own audio language models rather than wrapping anyone else's — the founding team's thesis, laid out in Amplify's profile, is that audio is the one modality where small labs beat big ones because it rewards "clever ideas efficiently executed, not just scale": Moshi reached real-time full-duplex conversation at 7B parameters trained on 2.1T tokens and 7M hours of audio, against Llama 3.1's 405B/15T.^[2] The production API models themselves are unnamed in public materials, and — notably for a latency-led pitch — Gradium publishes no time-to-first-audio benchmark numbers on its site or docs as of June 2026; "ultra-low latency" is asserted, with Moshi's ~160ms research result as the lineage proof point.^[6]^[7]^[2]

Key Technical Details

Aspect	Detail
Deployment	Managed cloud API; on-device TTS offering for edge^[6]
Models	Proprietary in-house audio LLMs descended from Kyutai's research line (Moshi, Hibiki)^[2]^[4]
Transport	WebSocket streaming (multiplexed requests per connection) + REST one-shot^[7]
Integrations	LiveKit Agents, Pipecat, OpenClaw; telephony audio formats but no native SIP/phone-number product^[7]
Open Source	Production models closed; sibling lab Kyutai's Moshi is open source^[3]

Strengths

Research pedigree that is the category's strongest — the founders built Moshi, the first open full-duplex audio LLM (~160ms, summer 2024), and ran the only dedicated open audio lab; this is the team incumbents cite, commercializing its own work.^[2]^[3]
$70M seed with name-brand conviction — FirstMark and Eurazeo leading, with Niel, DST, and Eric Schmidt participating, three months after founding, is an outlier seed for a European API company.^[1]
Production speed — streaming STT/TTS APIs serving paying customers within three months of inception, with a credit-metered self-serve funnel from day one.^[4]^[8]
European multilingual launch posture — five languages at launch (vs. the English-first norm), EU jurisdiction, and telephony-grade audio formats make it a natural fit for European contact-center and compliance-sensitive buyers.^[1]^[7]
Agent-aware API design — semantic VAD, adaptive delay controls, and WebSocket multiplexing are conversational-agent primitives, not afterthoughts, and LiveKit/Pipecat plugins slot it into the standard voice-agent stacks.^[7]

Cautions

No published latency numbers — for a company whose entire pitch is "ultra-low latency," the absence of public time-to-first-audio benchmarks on the site or docs as of June 2026 is conspicuous; buyers must benchmark themselves.^[6]^[7]
Months old, in a knife-fight category — production APIs are barely two quarters old, against ElevenLabs, Cartesia, Deepgram, and OpenAI's Realtime API, all with years of production hardening and far larger language coverage.^[1]
Five languages is launch coverage, not parity — ElevenLabs and rivals support dozens of languages; Gradium's "additional languages coming" is a roadmap promise.^[1]
Demo robustness questions — an HN commenter found the public demo produces "weird noises and random words" on nonsense input, a hallucination mode typical of audio-LLM architectures that matters in production edge cases.^[9]
No native telephony product — mu-law/A-law format support is not SIP trunking or phone numbers; teams building phone agents still need LiveKit, Pipecat, or a Vapi-style orchestrator on top.^[7]
Spinoff structure is untested — the nonprofit-lab-to-commercial-spinoff relationship (Kyutai keeps researching, Gradium productionizes "in proximity") has no governance details public, and the research the valuation is priced on lives partly outside the company.^[4]^[2]

What Developers Say

There is no dedicated HN launch thread for Gradium as of June 2026 — only low-engagement Show HN posts for third-party Go and Rust client libraries; substantive discussion happens in the 319-point February 2026 thread on Amplify's Kyutai/Gradium essay, where sentiment runs warmer on the research than on the framing.^[10]^[9]

"Moshi was an amazing tech demo... But, this piece is a fluff piece: 'underfunded' means... around $400 million" — shenberg on Hacker News, counting Kyutai's ~$330M initial backing plus Gradium's $70M^[9]

"I was suspicious that they are not mentioned, but then I realized this is a VC opinion piece" — gorgoiler on Hacker News, on the essay omitting ElevenLabs^[9]

"for a laugh enter nonsense at gradium.ai — You get all kinds of weird noises and random words" — RobMurray on Hacker News^[9]

"Gradium, a commercial company offshoot of Kyutai (open source lab), are focusing on emotion recognition and contextual emotion selection." — sofixa on Hacker News^[10]

One caveat: a Gradium GTM employee (pain_perdu) actively works HN threads — offering free credits, claiming "excellent background noise suppression" and Spanish-English code-switching — so some pro-Gradium framing in comment sections is vendor voice.^[10]

Pricing & Licensing

Pricing is credit-metered rather than per-minute: 1 character of TTS costs 1 credit and 1 second of STT costs 3 credits, across six monthly tiers.^[8]

Tier	Price	Includes
Free	$0	45k credits (~1 hr TTS / 4 hrs STT), 5 instant clones, 3 concurrent streams; non-commercial only
XS	$13/mo	225k credits (~5 hrs TTS / 21 hrs STT); commercial use unlocks here
S	$43/mo	900k credits (~20 hrs TTS / 83 hrs STT)
M	$340/mo	9M credits (~200 hrs TTS / 833 hrs STT), 5 pro clones
L	$1,615/mo	45M credits (~1,000 hrs TTS / 4,167 hrs STT), 20 pro clones, 15 concurrent streams
Tailored	Custom	Unlimited credits, clones, and concurrency

Overage runs $6.90 per additional 100k credits on XS, falling to $3.80 on L. All pricing as of June 2026.^[8]

Licensing model: Proprietary managed API; the adjacent Kyutai lab open-sources its research models (Moshi), but Gradium's production models are closed.^[3]^[6]

Hidden costs: Free-tier output is non-commercial; concurrency caps (3–15 streams below Tailored) bind real-time agent fleets early; pro voice clones require the $340/month tier; on-device TTS licensing is unpublished.^[8]^[6]

Competitive Positioning

Direct Competitors

Competitor	Differentiation
Cartesia	The closest analog — a research-born (SSM/Stanford) low-latency voice API vendor; Cartesia has years of production lead and published latency numbers, Gradium counters with full-duplex audio-LLM lineage and EU positioning
Deepgram Aura	Enterprise TTS/STT with sub-200ms published latency, domain-tuned vocabularies, and on-prem deployment; Gradium has no published benchmarks or on-prem story yet
ElevenLabs (profile)	The consumer-mindshare leader with dozens of languages and a full agent platform; Gradium is API-first infrastructure with 5 languages
OpenAI Realtime API	Bundled speech-to-speech inside the OpenAI ecosystem; Gradium is the unbundled, model-independent alternative from the team that beat Advanced Voice Mode to full-duplex^[2]

When to Choose Gradium Over Alternatives

Choose Gradium when: you want frontier audio-LLM research lineage in a self-serve API, European jurisdiction and French/German/Spanish/Portuguese quality matter, or you're betting early on the speech-to-speech direction.
Choose Cartesia when: you want the same research-born low-latency thesis with published benchmarks and production maturity.
Choose Deepgram Aura when: enterprise deployment flexibility (on-prem, private cloud) or domain-tuned vocabulary drives the decision.
Choose ElevenLabs when: voice variety, language breadth, or a bundled conversational-agent platform matters more than picking a pure API vendor.

Ideal Customer Profile

Best fit:

Voice-agent builders on LiveKit or Pipecat who want a low-latency European STT/TTS vendor with agent-native primitives (semantic VAD, flush, adaptive delay)
European products needing first-class French, German, Spanish, or Portuguese voice with EU jurisdiction
Studios, game developers, and language-learning platforms — Gradium's own named verticals — using cloning and expressive synthesis

Poor fit:

Teams needing published latency SLAs, dozens of languages, or on-prem deployment today
Phone-agent builders expecting native SIP/telephony rather than audio-format compatibility
Risk-averse buyers who require a vendor with more than two quarters of production history

Viability Assessment

Factor	Assessment
Financial Health	Exceptional for stage — $70M seed (FirstMark, Eurazeo, Niel, DST, Schmidt) three months post-founding^[1]
Market Position	Credible frontier entrant in a category owned by ElevenLabs, Cartesia, Deepgram, and OpenAI; differentiated on research lineage and Europe^[2]^[1]
Innovation Pace	High — production APIs, 237 voices, cloning, speech-to-speech WebSocket, and framework plugins within months^[4]^[7]
Community/Ecosystem	Thin but warm — no launch thread, third-party Go/Rust clients, Kyutai's open-source halo carries most of the goodwill^[10]^[3]
Long-term Outlook	Strong team and capital; depends on converting research lead into published, benchmarked production advantage before incumbents absorb full-duplex^[2]

The capital and team quality remove the usual seed-stage existence risk — this is four of the most-cited audio researchers in the world with $70M and a famously efficient lab behind them (Moshi: 7B parameters vs. frontier text models' hundreds of billions).^[1]^[2] The open question is commercial, not technical: the production APIs are young, unbenchmarked in public, and five languages deep, while the incumbents Gradium must displace are shipping weekly.^[7]^[1]

Bottom Line

Gradium is the most research-credentialed new entrant in voice APIs: the Kyutai/Moshi team — which shipped open full-duplex conversation at ~160ms before OpenAI did — now selling production STT/TTS with $70M from FirstMark, Eurazeo, Niel, and Schmidt. The product is real and self-serve today, but it is two quarters old, publishes no latency benchmarks, and covers five languages in a market where incumbents cover dozens — so the bet is on trajectory, not current spec sheet.

Recommended for: voice-agent builders (especially on LiveKit/Pipecat) who want frontier audio-model quality with EU jurisdiction and strong French/German/Spanish/Portuguese; teams positioning early for speech-to-speech.

Not recommended for: buyers needing published latency SLAs, broad language coverage, native telephony, or on-prem deployment today.

Outlook: Watch for published time-to-first-audio benchmarks, the speech-to-speech API maturing from WebSocket primitive to flagship product, language expansion beyond the launch five, and how the Kyutai/Gradium open-research-vs-closed-product boundary settles — that boundary is both the moat and the governance risk.

Research by Ry Walker Research • methodology

Sources