News

21 April 2026

xAI Just Launched Voice APIs That Could Shake Up the Entire Industry

Grok's new Speech-to-Text and Text-to-Speech APIs are priced at a fraction of ElevenLabs and Deepgram — and they were built for Tesla cars and Starlink satellites.

Delv Editorial

Delv Team

xAI Just Launched Voice APIs That Could Shake Up the Entire Industry

Quietly — and then very loudly on X — xAI dropped two new voice APIs for Grok. Speech-to-Text (STT) and Text-to-Speech (TTS) are now available via the xAI API, and the pricing is aggressive enough to make every voice AI startup nervous.

What Launched

xAI's voice offering covers both directions of audio AI:

Speech-to-Text

Batch transcription: $0.10 per hour of audio
Streaming transcription: $0.20 per hour of audio
Speaker diarization included
25+ languages supported
Word error rates that reportedly beat ElevenLabs, Deepgram, and AssemblyAI

Text-to-Speech

$4.20 per million characters
Real-time streaming via WebSocket (wss://api.x.ai/v1/realtime)
Expressive voice tags: [laugh], [sigh], <whisper>, <emphasis>
Voices trained to actually sound human

The API is OpenAI-compatible, which means existing integrations can point to a new base URL and largely just work.

Why the Pricing Matters

ElevenLabs — the reigning champion of voice AI — charges significantly more for comparable output. Deepgram and AssemblyAI are no slouches either, but both have been positioned as premium infrastructure plays. xAI is coming in at roughly 10x cheaper than ElevenLabs on TTS by some comparisons circulating on X.

This isn't a startup trying to buy market share with VC money. xAI built this voice stack to power Tesla's in-car AI and Starlink's customer support systems — meaning it's been battle-tested at real scale before it ever hit a public API.

What Makes It Different

Most TTS products sound like they're reading from a script. The expressive tag system xAI ships — [laugh], <whisper>, <emphasis> — gives developers fine-grained emotional control without requiring a full audio production workflow.

The real-time API uses a proper WebSocket session model with VAD (voice activity detection), turn-taking, and interruption handling built in. It's not a request-response wrapper over audio files — it's designed for live, conversational interfaces.

The Competitive Picture

Provider

TTS Pricing

STT (streaming)

xAI (Grok)	$4.20 / 1M chars	$0.20 / hr
ElevenLabs	~$40+ / 1M chars	N/A
Deepgram	varies	~$0.68 / hr
AssemblyAI	varies	~$0.65 / hr

(Pricing from public documentation and community benchmarks — verify current rates before production use.)

What This Means

xAI is not trying to be a voice AI company. They built voice infrastructure as internal tooling for Tesla and Starlink and are now selling access to the excess capacity — the same playbook AWS ran with cloud compute. That's a structurally different business than a startup whose entire value proposition is audio quality.

If the benchmarks hold up under real-world load, this is a meaningful moment for anyone building voice-enabled AI products. The floor just got lower.

Explore the xAI Voice API →

Delv Editorial

Delv Team

The Delv editorial team reviews AI tools, MCP servers, Agent Skills, and autonomous agents. Reviews are drafted with AI assistance and human oversight. Every install command and config snippet is verified against the source. We're independent, we don't sell tools, and we say when something isn't worth it.

AI ToolsMCPSkillsAgents

xAI Just Launched Voice APIs That Could Shake Up the Entire Industry

Grok's new Speech-to-Text and Text-to-Speech APIs are priced at a fraction of ElevenLabs and Deepgram — and they were built for Tesla cars and Starlink satellites.

By Delv Editorial21 April 2026

xAI Just Launched Voice APIs That Could Shake Up the Entire Industry

What Launched

xAI's voice offering covers both directions of audio AI:

Speech-to-Text - Batch transcription: $0.10 per hour of audio - Streaming transcription: $0.20 per hour of audio - Speaker diarization included - 25+ languages supported - Word error rates that reportedly beat ElevenLabs, Deepgram, and AssemblyAI

Text-to-Speech - $4.20 per million characters - Real-time streaming via WebSocket (wss://api.x.ai/v1/realtime) - Expressive voice tags: [laugh], [sigh], <whisper>, <emphasis> - Voices trained to actually sound human

The API is OpenAI-compatible, which means existing integrations can point to a new base URL and largely just work.

Why the Pricing Matters

What Makes It Different

Most TTS products sound like they're reading from a script. The expressive tag system xAI ships — [laugh], <whisper>, <emphasis> — gives developers fine-grained emotional control without requiring a full audio production workflow.

The Competitive Picture

| Provider | TTS Pricing | STT (streaming) | |----------|-------------|-----------------| | xAI (Grok) | $4.20 / 1M chars | $0.20 / hr | | ElevenLabs | ~$40+ / 1M chars | N/A | | Deepgram | varies | ~$0.68 / hr | | AssemblyAI | varies | ~$0.65 / hr |

(Pricing from public documentation and community benchmarks — verify current rates before production use.)

What This Means

If the benchmarks hold up under real-world load, this is a meaningful moment for anyone building voice-enabled AI products. The floor just got lower.

Explore the xAI Voice API →

It felt sudden. It wasn't. A short history of how the iceberg surfaced.

8 min read

Karpathy's actual CLAUDE.md is boring. The viral one is something else entirely.

5 min read

I installed Osaurus on my Mac this week. Here's what it actually changes.

5 min read

xAI Just Launched Voice APIs That Could Shake Up the Entire Industry

xAI Just Launched Voice APIs That Could Shake Up the Entire Industry

What Launched

Why the Pricing Matters

What Makes It Different

The Competitive Picture

What This Means

Related Articles

It felt sudden. It wasn't. A short history of how the iceberg surfaced.

Karpathy's actual CLAUDE.md is boring. The viral one is something else entirely.

I installed Osaurus on my Mac this week. Here's what it actually changes.