Voice AI in Production
At Avoca, I helped build and ship Voice AI agents that answered real phone calls for home services businesses. Not IVR menus — actual conversational agents that scheduled appointments, answered service questions, and handled customer complaints.
The gap between a voice AI demo and a production voice AI system is enormous. Here's what that gap looks like.
the latency budget
Humans expect conversational turn-taking at about 300-500ms. Longer than 800ms and callers notice the delay. Longer than 1.5 seconds and they start saying "hello? are you there?"
Here's our latency budget for a single turn:
Speech-to-Text: 150-250ms (Deepgram streaming)
Intent + Routing: 50-100ms (classifier)
LLM Generation: 200-400ms (time to first token)
Tool Calls: 0-500ms (API lookups, calendar checks)
Text-to-Speech: 100-200ms (ElevenLabs streaming)
Network overhead: 50-100ms
Total target: 550-1550ms
Every component has to be optimized for streaming. We can't wait for the full LLM response before starting TTS. The pipeline is:
- STT streams partial transcripts as the caller speaks
- Endpointing detection decides when the caller is done talking
- LLM starts generating immediately
- TTS starts speaking the first sentence while the LLM is still generating the rest
This "streaming pipeline" architecture cuts perceived latency by ~40% compared to waiting for complete outputs at each stage.
speech-to-text: the foundation
We tested Deepgram, Whisper (OpenAI), and Google Cloud Speech-to-Text. Deepgram won for our use case.
Why Deepgram:
- Streaming WebSocket API with ~150ms latency for partial transcripts
- Custom vocabulary injection. Home services terms — "HVAC," "tankless water heater," "P-trap" — need to be in the vocabulary or they get mangled
- Endpointing control. We tuned the silence threshold to 700ms — long enough to avoid cutting off slow speakers, short enough to feel responsive
The transcription accuracy problem nobody warns you about:
- Background noise on job sites: accuracy drops 15-20%
- Heavy accents: another 10-15% degradation
- Speaker crosstalk (customer on speakerphone with family): catastrophic
- Phone audio quality (8kHz) vs microphone audio (16kHz+): noticeable gap
We added a "transcription confidence" check. Below 70% confidence, the agent asks "I want to make sure I got that right — did you say [best transcript]?" This costs a turn but prevents downstream errors.
the LLM layer
The LLM handles conversation management, intent detection, and response generation. Key constraints:
Time-to-first-token matters more than total generation time. The caller hears the first word within 300ms of the LLM starting to generate. Total response length is less important because we're streaming.
Tool call latency is the hidden killer. The LLM decides it needs to check the calendar. That's a 200-500ms API call in the middle of the conversation. The caller hears silence.
Our solution: filler phrases. While waiting for tool results, the agent says "Let me check that for you" or "One moment while I look that up." These are pre-recorded audio clips that fire immediately when a tool call starts. Simple but effective — caller satisfaction scores improved 12% after adding fillers.
async def handle_tool_call(tool_name: str, args: dict):
# Fire filler audio immediately
filler = select_filler(tool_name) # "Let me check our schedule..."
await stream_audio(filler)
# Execute tool call in parallel
result = await execute_tool(tool_name, args)
return resultfailure modes in production
The infinite loop. The agent misunderstands, the caller corrects, the agent misunderstands again. After 3 failed attempts at the same information, we escalate to a human. No exceptions. Better to transfer early than frustrate the caller.
The confident wrong answer. The agent books an appointment for Tuesday at 2pm. The caller said Thursday. The STT transcribed "Thursday" as "Tuesday" (they sound similar at 8kHz). Our confirmation step catches most of these: "I have you down for Tuesday, February 11th at 2 PM. Does that sound right?"
The context drift. Long calls (5+ minutes) cause the agent to lose track of details from early in the conversation. We solved this with a state object that gets re-injected on every turn — customer name, service type, preferred date, address. The model can't forget what's explicitly in its context.
The emotional caller. Angry customers don't follow conversational patterns. They interrupt, repeat themselves, use profanity. Our agent detects elevated sentiment and switches to a de-escalation prompt: shorter responses, more empathy phrases, faster transfer to human threshold.
text-to-speech: the uncanny valley
ElevenLabs for production. Competitive alternatives: Play.ht, Cartesia, Deepgram's TTS.
What we learned:
Voice selection is a product decision, not a tech decision. We A/B tested 6 voices. The "professional female" voice had 23% higher completion rates than the "friendly male" voice. For home services, callers expected to talk to a receptionist. Matching that expectation mattered more than voice quality metrics.
Streaming TTS with sentence-level chunking. Send each sentence to TTS as soon as the LLM generates it. Don't wait for the full response. This reduces time-to-speech by 200-400ms.
Pronunciation dictionaries are essential. "HVAC" should be "H-V-A-C," not "hvack." "Roto-Rooter" needs specific emphasis. We maintained a pronunciation dictionary of ~200 industry terms.
the numbers that mattered
After 6 months in production across multiple home services businesses:
- Average response latency: 850ms (first word of response)
- Call completion rate: 78% (caller gets what they need without human transfer)
- Booking accuracy: 94% (correct service, date, time, address)
- Customer satisfaction (post-call survey): 4.1/5.0
- Cost per call: ~$0.35 (STT + LLM + TTS + telephony)
For comparison, a human receptionist costs $15-25/hour and handles ~8-12 calls/hour. That's $1.25-3.12 per call. Voice AI at $0.35/call is a compelling ROI even at 78% completion.
The 22% that need human transfer aren't failures — they're complex situations (insurance questions, emergency scheduling, complaints) that shouldn't be automated. The AI handles the routine 78% so humans can focus on the hard 22%.
Voice AI works in production. But getting there requires solving a dozen engineering problems that demos never show you.
More in AI & LLM
The Context Window Trap
1M token context windows don't mean you should use them. When to chunk, summarize, or restructure instead of stuffing everything in.
LLM Evaluation Is Hard
Building LLM-as-a-Judge for AssessAI taught me why automated evaluation needs human calibration — and where it breaks down.
Embedding Models Compared: Cost, Quality, Latency
OpenAI ada-002 vs text-embedding-3 vs Cohere vs local models. Real benchmarks from production retrieval systems.