ToolsMarch 18, 20263 min read

Local TTS at Zero Cost

aittslocal-inferencetools

I was paying OpenAI $15 per million characters for text-to-speech. For development and testing, that adds up fast — generating audio previews, testing voice features, iterating on TTS-heavy workflows. Then I found Kokoro-82M and haven't called a TTS API since.

the numbers

Model size: 82 million parameters, 337MB on disk
Latency: Under 50ms to first audio on an M-series Mac
Quality: Broadcast-level. Not "good for a local model." Actually good.
Cost: $0. Runs entirely on your machine.
Voices: Multiple speakers, adjustable speed, natural pauses

For comparison, ElevenLabs charges $5/month for 10k characters at their cheapest tier. Google Cloud TTS is $4 per million characters. Azure is similar. Kokoro is free and sounds competitive with all of them for standard use cases.

setting it up

The model runs through Python. Installation is straightforward:

pip install kokoro-onnx soundfile
 
# Download the model (337MB, one-time)
python -c "from kokoro_onnx import Kokoro; Kokoro.from_pretrained()"

Basic usage:

from kokoro_onnx import Kokoro
import soundfile as sf
 
model = Kokoro.from_pretrained()
 
text = "The best model for the job is almost always the smallest one that works."
audio, sr = model.create(text, voice="af_heart", speed=1.0)
sf.write("output.wav", audio, sr)

That's it. No API keys. No account setup. No rate limits. No internet connection required after the initial download.

where it shines

Development and testing. Any feature that involves audio output — podcasts, accessibility features, voice interfaces, content previews — you can iterate locally without burning API credits.

Batch processing. Need to convert 500 blog posts to audio? With a cloud API that's a billing event. Locally it's a for loop and a coffee break.

Privacy-sensitive content. Medical notes, legal documents, internal communications — anything you don't want sent to a third-party API. Local inference means the data never leaves your machine.

Prototyping. Build the entire audio pipeline locally. Validate the UX, the voice quality, the pacing. Only move to a cloud API if local quality doesn't meet production requirements (it probably will).

where it doesn't

Long-form audiobook narration with emotional range — the cloud models are still better at sustained emotional consistency over 30+ minutes. Kokoro handles individual paragraphs and pages well, but the voice can flatten over very long passages.

Non-English languages. Kokoro's English quality is excellent. Other languages are functional but noticeably less natural. If you need production-quality Japanese or French TTS, the cloud APIs still win.

Ultra-low-latency streaming for real-time voice applications. Kokoro's 50ms latency is fast, but cloud streaming APIs are optimized for real-time conversational use cases with adaptive bitrate and interruption handling.

the bigger picture

Kokoro is one example of a pattern: small, specialized models that run locally and match or exceed cloud API quality for specific tasks. Whisper-tiny (39M params) for speech-to-text. DistilBERT (66M) for text classification. These aren't research toys — they're production-capable tools.

The economics are shifting. Cloud APIs charge per request, per token, per character. Local models have a one-time download cost and then run for free. For high-volume use cases, the math isn't even close.

The best model for the job is almost always the smallest one that works. And "works" keeps getting redefined downward.

Local TTS at Zero Cost

the numbers

setting it up

where it shines

where it doesn't

the bigger picture

More in Tools

My Claude Code Setup: 68 Commands, 15 Agents, 86 Skills

Supabase for Everything

Automating Everything with Hooks