xAI Just Launched Voice APIs That Could Shake Up the Entire Industry
Grok's new Speech-to-Text and Text-to-Speech APIs are priced at a fraction of ElevenLabs and Deepgram — and they were built for Tesla cars and Starlink satellites.
xAI Just Launched Voice APIs That Could Shake Up the Entire Industry
Quietly — and then very loudly on X — xAI dropped two new voice APIs for Grok. Speech-to-Text (STT) and Text-to-Speech (TTS) are now available via the xAI API, and the pricing is aggressive enough to make every voice AI startup nervous.
What Launched
xAI's voice offering covers both directions of audio AI:
Speech-to-Text - Batch transcription: $0.10 per hour of audio - Streaming transcription: $0.20 per hour of audio - Speaker diarization included - 25+ languages supported - Word error rates that reportedly beat ElevenLabs, Deepgram, and AssemblyAI
Text-to-Speech - $4.20 per million characters - Real-time streaming via WebSocket (wss://api.x.ai/v1/realtime) - Expressive voice tags: [laugh], [sigh], <whisper>, <emphasis> - Voices trained to actually sound human
The API is OpenAI-compatible, which means existing integrations can point to a new base URL and largely just work.
Why the Pricing Matters
ElevenLabs — the reigning champion of voice AI — charges significantly more for comparable output. Deepgram and AssemblyAI are no slouches either, but both have been positioned as premium infrastructure plays. xAI is coming in at roughly 10x cheaper than ElevenLabs on TTS by some comparisons circulating on X.
This isn't a startup trying to buy market share with VC money. xAI built this voice stack to power Tesla's in-car AI and Starlink's customer support systems — meaning it's been battle-tested at real scale before it ever hit a public API.
What Makes It Different
Most TTS products sound like they're reading from a script. The expressive tag system xAI ships — [laugh], <whisper>, <emphasis> — gives developers fine-grained emotional control without requiring a full audio production workflow.
The real-time API uses a proper WebSocket session model with VAD (voice activity detection), turn-taking, and interruption handling built in. It's not a request-response wrapper over audio files — it's designed for live, conversational interfaces.
The Competitive Picture
| Provider | TTS Pricing | STT (streaming) | |----------|-------------|-----------------| | xAI (Grok) | $4.20 / 1M chars | $0.20 / hr | | ElevenLabs | ~$40+ / 1M chars | N/A | | Deepgram | varies | ~$0.68 / hr | | AssemblyAI | varies | ~$0.65 / hr |
(Pricing from public documentation and community benchmarks — verify current rates before production use.)
What This Means
xAI is not trying to be a voice AI company. They built voice infrastructure as internal tooling for Tesla and Starlink and are now selling access to the excess capacity — the same playbook AWS ran with cloud compute. That's a structurally different business than a startup whose entire value proposition is audio quality.
If the benchmarks hold up under real-world load, this is a meaningful moment for anyone building voice-enabled AI products. The floor just got lower.
Explore the xAI Voice API →