AI & LLMMarch 30, 20265 min read

Voice AI in Production

aivoiceproductionavoca

At Avoca, I helped build and ship Voice AI agents that answered real phone calls for home services businesses. Not IVR menus — actual conversational agents that scheduled appointments, answered service questions, and handled customer complaints.

The gap between a voice AI demo and a production voice AI system is enormous. Here's what that gap looks like.

the latency budget

Humans expect conversational turn-taking at about 300-500ms. Longer than 800ms and callers notice the delay. Longer than 1.5 seconds and they start saying "hello? are you there?"

Here's our latency budget for a single turn:

Speech-to-Text:     150-250ms (Deepgram streaming)
Intent + Routing:    50-100ms (classifier)
LLM Generation:    200-400ms (time to first token)
Tool Calls:         0-500ms  (API lookups, calendar checks)
Text-to-Speech:    100-200ms (ElevenLabs streaming)
Network overhead:   50-100ms

Total target:      550-1550ms

Every component has to be optimized for streaming. We can't wait for the full LLM response before starting TTS. The pipeline is:

STT streams partial transcripts as the caller speaks
Endpointing detection decides when the caller is done talking
LLM starts generating immediately
TTS starts speaking the first sentence while the LLM is still generating the rest

This "streaming pipeline" architecture cuts perceived latency by ~40% compared to waiting for complete outputs at each stage.

speech-to-text: the foundation

We tested Deepgram, Whisper (OpenAI), and Google Cloud Speech-to-Text. Deepgram won for our use case.

Why Deepgram:

Streaming WebSocket API with ~150ms latency for partial transcripts
Custom vocabulary injection. Home services terms — "HVAC," "tankless water heater," "P-trap" — need to be in the vocabulary or they get mangled
Endpointing control. We tuned the silence threshold to 700ms — long enough to avoid cutting off slow speakers, short enough to feel responsive

The transcription accuracy problem nobody warns you about:

Background noise on job sites: accuracy drops 15-20%
Heavy accents: another 10-15% degradation
Speaker crosstalk (customer on speakerphone with family): catastrophic
Phone audio quality (8kHz) vs microphone audio (16kHz+): noticeable gap

We added a "transcription confidence" check. Below 70% confidence, the agent asks "I want to make sure I got that right — did you say [best transcript]?" This costs a turn but prevents downstream errors.

the LLM layer

The LLM handles conversation management, intent detection, and response generation. Key constraints:

Time-to-first-token matters more than total generation time. The caller hears the first word within 300ms of the LLM starting to generate. Total response length is less important because we're streaming.

Tool call latency is the hidden killer. The LLM decides it needs to check the calendar. That's a 200-500ms API call in the middle of the conversation. The caller hears silence.

Our solution: filler phrases. While waiting for tool results, the agent says "Let me check that for you" or "One moment while I look that up." These are pre-recorded audio clips that fire immediately when a tool call starts. Simple but effective — caller satisfaction scores improved 12% after adding fillers.

async def handle_tool_call(tool_name: str, args: dict):
    # Fire filler audio immediately
    filler = select_filler(tool_name)  # "Let me check our schedule..."
    await stream_audio(filler)
 
    # Execute tool call in parallel
    result = await execute_tool(tool_name, args)
    return result

failure modes in production

The infinite loop. The agent misunderstands, the caller corrects, the agent misunderstands again. After 3 failed attempts at the same information, we escalate to a human. No exceptions. Better to transfer early than frustrate the caller.

The confident wrong answer. The agent books an appointment for Tuesday at 2pm. The caller said Thursday. The STT transcribed "Thursday" as "Tuesday" (they sound similar at 8kHz). Our confirmation step catches most of these: "I have you down for Tuesday, February 11th at 2 PM. Does that sound right?"

The context drift. Long calls (5+ minutes) cause the agent to lose track of details from early in the conversation. We solved this with a state object that gets re-injected on every turn — customer name, service type, preferred date, address. The model can't forget what's explicitly in its context.

The emotional caller. Angry customers don't follow conversational patterns. They interrupt, repeat themselves, use profanity. Our agent detects elevated sentiment and switches to a de-escalation prompt: shorter responses, more empathy phrases, faster transfer to human threshold.

text-to-speech: the uncanny valley

ElevenLabs for production. Competitive alternatives: Play.ht, Cartesia, Deepgram's TTS.

What we learned:

Voice selection is a product decision, not a tech decision. We A/B tested 6 voices. The "professional female" voice had 23% higher completion rates than the "friendly male" voice. For home services, callers expected to talk to a receptionist. Matching that expectation mattered more than voice quality metrics.

Streaming TTS with sentence-level chunking. Send each sentence to TTS as soon as the LLM generates it. Don't wait for the full response. This reduces time-to-speech by 200-400ms.

Pronunciation dictionaries are essential. "HVAC" should be "H-V-A-C," not "hvack." "Roto-Rooter" needs specific emphasis. We maintained a pronunciation dictionary of ~200 industry terms.

the numbers that mattered

After 6 months in production across multiple home services businesses:

Average response latency: 850ms (first word of response)
Call completion rate: 78% (caller gets what they need without human transfer)
Booking accuracy: 94% (correct service, date, time, address)
Customer satisfaction (post-call survey): 4.1/5.0
Cost per call: ~$0.35 (STT + LLM + TTS + telephony)

For comparison, a human receptionist costs $15-25/hour and handles ~8-12 calls/hour. That's $1.25-3.12 per call. Voice AI at $0.35/call is a compelling ROI even at 78% completion.

The 22% that need human transfer aren't failures — they're complex situations (insurance questions, emergency scheduling, complaints) that shouldn't be automated. The AI handles the routine 78% so humans can focus on the hard 22%.

Voice AI works in production. But getting there requires solving a dozen engineering problems that demos never show you.

Voice AI in Production

the latency budget

speech-to-text: the foundation

the LLM layer

failure modes in production

text-to-speech: the uncanny valley

the numbers that mattered

More in AI & LLM

The Context Window Trap

LLM Evaluation Is Hard

Embedding Models Compared: Cost, Quality, Latency