Key takeaways
- $70M seed announced December 2, 2025 — one of the largest European AI seeds of the year — led by FirstMark Capital and Eurazeo with Xavier Niel, DST Global Partners, and Eric Schmidt participating, three months after the company's September 2025 founding
- The team is the research core of Kyutai, the nonprofit Paris lab behind Moshi — the open-source 7B full-duplex audio LLM with ~160ms response latency that beat OpenAI's Advanced Voice Mode to market in summer 2024
- Production streaming STT/TTS APIs in five languages (English, French, German, Spanish, Portuguese) with 237 stock voices, instant and pro voice cloning, semantic VAD for turn-taking, and telephony audio formats
- Credit-based pricing from a free tier to $1,615/month — 1 TTS character = 1 credit, 1 second of STT = 3 credits — with an on-device TTS offering for edge deployment
FAQ
What is Gradium?
Gradium is a Paris-based voice AI company, spun out of the nonprofit research lab Kyutai, that sells real-time text-to-speech, speech-to-text, and voice-cloning APIs built on its own audio language models.
How much does Gradium cost?
Credit-based monthly tiers from free (45k credits, ~1 hour of TTS) to $1,615/month (45M credits, ~1,000 hours of TTS), where 1 TTS character costs 1 credit and 1 second of STT costs 3 credits; commercial use requires a paid tier.
How is Gradium related to Kyutai and Moshi?
Gradium's founders created and led Kyutai, the nonprofit lab behind the open-source Moshi full-duplex audio model; Gradium is the commercial spinoff productionizing that research line, and the teams remain in close proximity.
How is Gradium different from Cartesia?
Both sell low-latency voice APIs built on novel in-house architectures, but Cartesia is a US company several product generations in, while Gradium is a months-old European entrant differentiating on its full-duplex audio-LLM research lineage, EU base, and five-language launch coverage.
Executive Summary
Gradium is what happens when a frontier research lab decides to invoice: the commercial spinoff of Kyutai, the Xavier Niel-backed nonprofit Paris lab whose open-source Moshi model — a 7B-parameter full-duplex audio LLM with ~160ms response latency — shipped real-time voice conversation in summer 2024, before OpenAI's Advanced Voice Mode reached the public.[1][2][3] Founded in September 2025, Gradium came out of stealth on December 2, 2025 with a $70M seed led by FirstMark Capital and Eurazeo, joined by Niel, DST Global Partners, and Eric Schmidt.[1]
The product is deliberately narrower than the research: production streaming speech-to-text and text-to-speech APIs — plus instant and pro voice cloning — in English, French, German, Spanish, and Portuguese, which the company says were serving paying customers within three months of inception.[4][1] The founding four — Neil Zeghidour (CEO, ex-Google DeepMind), Olivier Teboul (CTO, ex-Google Brain), Laurent Mazaré (Chief Coding Officer, ex-DeepMind/Jane Street), and Alexandre Défossez (CSO, ex-Meta) — were Kyutai's audio research core, and Gradium keeps what it calls "a natural proximity" with the lab's researchers.[5][2][4]
| Attribute | Value |
|---|---|
| Company | Gradium (Paris, France)[1] |
| Founders | Neil Zeghidour (CEO), Olivier Teboul (CTO), Laurent Mazaré (Chief Coding Officer), Alexandre Défossez (CSO)[5] |
| Founded | September 2025; out of stealth December 2, 2025[1] |
| Funding | $70M seed led by FirstMark Capital and Eurazeo; Xavier Niel, DST Global Partners, Eric Schmidt[1] |
| Lineage | Commercial spinoff of nonprofit lab Kyutai (founded 2023; Moshi, Hibiki)[2] |
| Open Source | Kyutai's models (Moshi) are open source on GitHub; Gradium's production models are proprietary[3][6] |
Product Overview
Gradium sells the unglamorous half of the voice stack: streaming transcription in, streaming synthesis out, over WebSockets, with the conversational plumbing — semantic voice-activity detection for turn-taking, adaptive delay controls, flush commands — exposed as API primitives rather than hidden inside an agent product.[7] The target buyer is a developer assembling a real-time voice agent, dub pipeline, or game NPC, not an end user; cited verticals include health, customer support, market research, gaming NPCs, and advertising avatars.[4]
The library ships 237 stock voices across the five launch languages, and custom voices come in two grades: instant clones from a short audio sample on every tier, and pro clones reserved for the upper tiers.[7][8] A speech-to-speech WebSocket — audio in, audio out on a single connection — points at where the Moshi lineage is headed, and an on-device TTS offering targets edge deployment.[7][6]
Key Capabilities
| Capability | Description |
|---|---|
| Streaming TTS | WebSocket streaming synthesis with speed, temperature, and voice-similarity controls[7] |
| Streaming STT | Real-time transcription with semantic VAD and flush for turn-taking[7] |
| Voice cloning | Instant clones from a short sample on all tiers; pro clones on M tier and up[7][8] |
| Speech-to-speech | Audio-in/audio-out over a single WebSocket[7] |
| Languages | English, French, German, Spanish, Portuguese; more announced as coming[1] |
| Telephony formats | Mu-law, A-law, and low-sample-rate PCM support[7] |
| Pronunciation control | Custom pronunciation dictionaries and text-rewrite rules[7] |
| On-device TTS | Edge/on-device synthesis offering[6] |
Product Surfaces
| Surface | Description | Availability |
|---|---|---|
| WebSocket + REST APIs | Streaming TTS/STT plus one-shot HTTP synthesis; OpenAPI spec | GA[7] |
| Python SDK | First-party SDK; browser access via short-lived tokens | GA[7] |
| Framework integrations | LiveKit Agents, Pipecat, OpenClaw plugins | GA[7] |
| Studio / demos | studio.gradium.ai playground, Gradbot, agent demo, app gallery | GA[6] |
| On-device TTS | Edge SDK offering | Listed; details undisclosed[6] |
Technical Architecture
Gradium builds its own audio language models rather than wrapping anyone else's — the founding team's thesis, laid out in Amplify's profile, is that audio is the one modality where small labs beat big ones because it rewards "clever ideas efficiently executed, not just scale": Moshi reached real-time full-duplex conversation at 7B parameters trained on 2.1T tokens and 7M hours of audio, against Llama 3.1's 405B/15T.[2] The production API models themselves are unnamed in public materials, and — notably for a latency-led pitch — Gradium publishes no time-to-first-audio benchmark numbers on its site or docs as of June 2026; "ultra-low latency" is asserted, with Moshi's ~160ms research result as the lineage proof point.[6][7][2]
Key Technical Details
| Aspect | Detail |
|---|---|
| Deployment | Managed cloud API; on-device TTS offering for edge[6] |
| Models | Proprietary in-house audio LLMs descended from Kyutai's research line (Moshi, Hibiki)[2][4] |
| Transport | WebSocket streaming (multiplexed requests per connection) + REST one-shot[7] |
| Integrations | LiveKit Agents, Pipecat, OpenClaw; telephony audio formats but no native SIP/phone-number product[7] |
| Open Source | Production models closed; sibling lab Kyutai's Moshi is open source[3] |
Strengths
- Research pedigree that is the category's strongest — the founders built Moshi, the first open full-duplex audio LLM (~160ms, summer 2024), and ran the only dedicated open audio lab; this is the team incumbents cite, commercializing its own work.[2][3]
- $70M seed with name-brand conviction — FirstMark and Eurazeo leading, with Niel, DST, and Eric Schmidt participating, three months after founding, is an outlier seed for a European API company.[1]
- Production speed — streaming STT/TTS APIs serving paying customers within three months of inception, with a credit-metered self-serve funnel from day one.[4][8]
- European multilingual launch posture — five languages at launch (vs. the English-first norm), EU jurisdiction, and telephony-grade audio formats make it a natural fit for European contact-center and compliance-sensitive buyers.[1][7]
- Agent-aware API design — semantic VAD, adaptive delay controls, and WebSocket multiplexing are conversational-agent primitives, not afterthoughts, and LiveKit/Pipecat plugins slot it into the standard voice-agent stacks.[7]
Cautions
- No published latency numbers — for a company whose entire pitch is "ultra-low latency," the absence of public time-to-first-audio benchmarks on the site or docs as of June 2026 is conspicuous; buyers must benchmark themselves.[6][7]
- Months old, in a knife-fight category — production APIs are barely two quarters old, against ElevenLabs, Cartesia, Deepgram, and OpenAI's Realtime API, all with years of production hardening and far larger language coverage.[1]
- Five languages is launch coverage, not parity — ElevenLabs and rivals support dozens of languages; Gradium's "additional languages coming" is a roadmap promise.[1]
- Demo robustness questions — an HN commenter found the public demo produces "weird noises and random words" on nonsense input, a hallucination mode typical of audio-LLM architectures that matters in production edge cases.[9]
- No native telephony product — mu-law/A-law format support is not SIP trunking or phone numbers; teams building phone agents still need LiveKit, Pipecat, or a Vapi-style orchestrator on top.[7]
- Spinoff structure is untested — the nonprofit-lab-to-commercial-spinoff relationship (Kyutai keeps researching, Gradium productionizes "in proximity") has no governance details public, and the research the valuation is priced on lives partly outside the company.[4][2]
What Developers Say
There is no dedicated HN launch thread for Gradium as of June 2026 — only low-engagement Show HN posts for third-party Go and Rust client libraries; substantive discussion happens in the 319-point February 2026 thread on Amplify's Kyutai/Gradium essay, where sentiment runs warmer on the research than on the framing.[10][9]
"Moshi was an amazing tech demo... But, this piece is a fluff piece: 'underfunded' means... around $400 million" — shenberg on Hacker News, counting Kyutai's ~$330M initial backing plus Gradium's $70M[9]
"I was suspicious that they are not mentioned, but then I realized this is a VC opinion piece" — gorgoiler on Hacker News, on the essay omitting ElevenLabs[9]
"for a laugh enter nonsense at gradium.ai — You get all kinds of weird noises and random words" — RobMurray on Hacker News[9]
"Gradium, a commercial company offshoot of Kyutai (open source lab), are focusing on emotion recognition and contextual emotion selection." — sofixa on Hacker News[10]
One caveat: a Gradium GTM employee (pain_perdu) actively works HN threads — offering free credits, claiming "excellent background noise suppression" and Spanish-English code-switching — so some pro-Gradium framing in comment sections is vendor voice.[10]
Pricing & Licensing
Pricing is credit-metered rather than per-minute: 1 character of TTS costs 1 credit and 1 second of STT costs 3 credits, across six monthly tiers.[8]
| Tier | Price | Includes |
|---|---|---|
| Free | $0 | 45k credits (~1 hr TTS / 4 hrs STT), 5 instant clones, 3 concurrent streams; non-commercial only |
| XS | $13/mo | 225k credits (~5 hrs TTS / 21 hrs STT); commercial use unlocks here |
| S | $43/mo | 900k credits (~20 hrs TTS / 83 hrs STT) |
| M | $340/mo | 9M credits (~200 hrs TTS / 833 hrs STT), 5 pro clones |
| L | $1,615/mo | 45M credits (~1,000 hrs TTS / 4,167 hrs STT), 20 pro clones, 15 concurrent streams |
| Tailored | Custom | Unlimited credits, clones, and concurrency |
Overage runs $6.90 per additional 100k credits on XS, falling to $3.80 on L. All pricing as of June 2026.[8]
Licensing model: Proprietary managed API; the adjacent Kyutai lab open-sources its research models (Moshi), but Gradium's production models are closed.[3][6]
Hidden costs: Free-tier output is non-commercial; concurrency caps (3–15 streams below Tailored) bind real-time agent fleets early; pro voice clones require the $340/month tier; on-device TTS licensing is unpublished.[8][6]
Competitive Positioning
Direct Competitors
| Competitor | Differentiation |
|---|---|
| Cartesia | The closest analog — a research-born (SSM/Stanford) low-latency voice API vendor; Cartesia has years of production lead and published latency numbers, Gradium counters with full-duplex audio-LLM lineage and EU positioning |
| Deepgram Aura | Enterprise TTS/STT with sub-200ms published latency, domain-tuned vocabularies, and on-prem deployment; Gradium has no published benchmarks or on-prem story yet |
| ElevenLabs (profile) | The consumer-mindshare leader with dozens of languages and a full agent platform; Gradium is API-first infrastructure with 5 languages |
| OpenAI Realtime API | Bundled speech-to-speech inside the OpenAI ecosystem; Gradium is the unbundled, model-independent alternative from the team that beat Advanced Voice Mode to full-duplex[2] |
When to Choose Gradium Over Alternatives
- Choose Gradium when: you want frontier audio-LLM research lineage in a self-serve API, European jurisdiction and French/German/Spanish/Portuguese quality matter, or you're betting early on the speech-to-speech direction.
- Choose Cartesia when: you want the same research-born low-latency thesis with published benchmarks and production maturity.
- Choose Deepgram Aura when: enterprise deployment flexibility (on-prem, private cloud) or domain-tuned vocabulary drives the decision.
- Choose ElevenLabs when: voice variety, language breadth, or a bundled conversational-agent platform matters more than picking a pure API vendor.
Ideal Customer Profile
Best fit:
- Voice-agent builders on LiveKit or Pipecat who want a low-latency European STT/TTS vendor with agent-native primitives (semantic VAD, flush, adaptive delay)
- European products needing first-class French, German, Spanish, or Portuguese voice with EU jurisdiction
- Studios, game developers, and language-learning platforms — Gradium's own named verticals — using cloning and expressive synthesis
Poor fit:
- Teams needing published latency SLAs, dozens of languages, or on-prem deployment today
- Phone-agent builders expecting native SIP/telephony rather than audio-format compatibility
- Risk-averse buyers who require a vendor with more than two quarters of production history
Viability Assessment
| Factor | Assessment |
|---|---|
| Financial Health | Exceptional for stage — $70M seed (FirstMark, Eurazeo, Niel, DST, Schmidt) three months post-founding[1] |
| Market Position | Credible frontier entrant in a category owned by ElevenLabs, Cartesia, Deepgram, and OpenAI; differentiated on research lineage and Europe[2][1] |
| Innovation Pace | High — production APIs, 237 voices, cloning, speech-to-speech WebSocket, and framework plugins within months[4][7] |
| Community/Ecosystem | Thin but warm — no launch thread, third-party Go/Rust clients, Kyutai's open-source halo carries most of the goodwill[10][3] |
| Long-term Outlook | Strong team and capital; depends on converting research lead into published, benchmarked production advantage before incumbents absorb full-duplex[2] |
The capital and team quality remove the usual seed-stage existence risk — this is four of the most-cited audio researchers in the world with $70M and a famously efficient lab behind them (Moshi: 7B parameters vs. frontier text models' hundreds of billions).[1][2] The open question is commercial, not technical: the production APIs are young, unbenchmarked in public, and five languages deep, while the incumbents Gradium must displace are shipping weekly.[7][1]
Bottom Line
Gradium is the most research-credentialed new entrant in voice APIs: the Kyutai/Moshi team — which shipped open full-duplex conversation at ~160ms before OpenAI did — now selling production STT/TTS with $70M from FirstMark, Eurazeo, Niel, and Schmidt. The product is real and self-serve today, but it is two quarters old, publishes no latency benchmarks, and covers five languages in a market where incumbents cover dozens — so the bet is on trajectory, not current spec sheet.
Recommended for: voice-agent builders (especially on LiveKit/Pipecat) who want frontier audio-model quality with EU jurisdiction and strong French/German/Spanish/Portuguese; teams positioning early for speech-to-speech.
Not recommended for: buyers needing published latency SLAs, broad language coverage, native telephony, or on-prem deployment today.
Outlook: Watch for published time-to-first-audio benchmarks, the speech-to-speech API maturing from WebSocket primitive to flagship product, language expansion beyond the launch five, and how the Kyutai/Gradium open-research-vs-closed-product boundary settles — that boundary is both the moat and the governance risk.
Research by Ry Walker Research • methodology
Sources
- [1] TechCrunch: Paris-based AI voice startup Gradium nabs $70M seed
- [2] Amplify Partners: Arming the Rebels with GPUs — Gradium, Kyutai, and Audio AI
- [3] Kyutai Labs: Moshi GitHub Repository
- [4] Gradium Blog: Solving Voice
- [5] French Tech Journal: Gradium Wants To Make Voice The New Operating System for AI
- [6] Gradium Website
- [7] Gradium API Documentation
- [8] Gradium Pricing
- [9] Hacker News: Audio is the one area small labs are winning
- [10] Gradium mentions on Hacker News (Algolia search)