Building AI Agents That Don't Hallucinate
At Avoca, we deployed conversational AI agents that talked to real customers on the phone. Not chatbots with canned responses — actual voice agents handling appointment scheduling, service inquiries, and payment processing.
When an agent hallucinates in a chatbot, the user rolls their eyes and retypes. When an agent hallucinates on a phone call, it books a plumber for next Tuesday at an address that doesn't exist. Different stakes entirely.
Here's what we learned about keeping agents grounded.
hallucination isn't random
The first thing to understand: models don't hallucinate randomly. They hallucinate predictably, in specific situations:
- When asked about specifics they don't have. "What's the customer's address?" without the address in context. The model fills the gap with something plausible.
- When the conversation drifts from grounded context. Turn 1 has the booking details in the system prompt. By turn 15, the model has "forgotten" them in favor of the conversation flow.
- When tool results are ambiguous. The API returns 3 available slots. The model adds a 4th because the user asked for afternoon and none of the 3 were after 3pm.
- When confidence is high but knowledge is stale. The model "knows" business hours are 9-5 because that's common. Your client closes at 4pm on Fridays.
Each failure mode has a different fix.
grounding technique 1: tool use as the single source of truth
The agent should never state facts from its parametric knowledge. Every factual claim must come from a tool call.
# Bad: model answers from training data
# "Our office hours are 9 AM to 5 PM Monday through Friday"
# Good: model calls a tool to get current hours
tools = [
{
"name": "get_business_hours",
"description": "Get current business hours for the location",
"parameters": {
"type": "object",
"properties": {
"location_id": {"type": "string"},
},
"required": ["location_id"],
},
}
]
# System prompt reinforcement:
# "NEVER state business hours, pricing, availability, or policies
# from your own knowledge. ALWAYS call the appropriate tool first.
# If no tool exists for the information requested, say you'll
# need to check and transfer to a human."The system prompt tells the model what it doesn't know. This is more effective than telling it what it does know. "You don't know the schedule. Call get_available_slots." beats "Here is the schedule: ..."
grounding technique 2: context window management
Long conversations cause drift. The model starts "improvising" because the relevant context has scrolled out of its effective attention window.
We solved this with a rolling context summary:
def build_agent_context(conversation: list, booking_state: dict) -> list:
# Always include: current state of the booking
state_message = {
"role": "system",
"content": f"""Current booking state:
- Customer: {booking_state.get('customer_name', 'Unknown')}
- Service: {booking_state.get('service', 'Not selected')}
- Date/Time: {booking_state.get('datetime', 'Not scheduled')}
- Address: {booking_state.get('address', 'Not confirmed')}
- Status: {booking_state.get('status', 'In progress')}
ONLY use these values when referencing booking details.
Do NOT infer or assume any values not listed above.""",
}
# Keep last 6 turns of conversation
recent = conversation[-12:] # 6 turns = 12 messages
return [state_message] + recentThe state message gets injected fresh on every turn. Even if the conversation is 50 turns deep, the model always has the current ground truth at the top of its context.
grounding technique 3: chain-of-thought verification
Before the agent speaks, make it verify its own claims:
verification_prompt = """Before responding to the customer, verify:
1. Every fact I'm about to state — is it from a tool result or the booking state?
2. Am I adding any information that wasn't in the tool response?
3. If the customer asked something I can't verify, am I admitting that?
If any claim can't be traced to a tool result or the current state, remove it."""This adds ~200ms of latency per turn. Worth it. Our hallucination rate dropped from ~8% of turns containing at least one fabricated fact to under 1%.
grounding technique 4: constrained generation
For critical fields, don't let the model generate freely. Use structured outputs with enums:
confirm_schema = {
"type": "object",
"properties": {
"action": {
"type": "string",
"enum": ["confirm_booking", "modify_booking", "cancel", "transfer_to_human"],
},
"confirmed_datetime": {
"type": "string",
"description": "Must exactly match a slot from get_available_slots",
},
},
}The model can't invent a time slot if the enum only contains slots returned by the API. Constraint beats instruction.
the verification loop pattern
Our production agents run a three-step loop on every turn:
- Retrieve — call tools to get current facts
- Generate — draft the response
- Verify — check the draft against tool results, strip ungrounded claims
Step 3 uses a lightweight model (GPT-4o-mini) as a fact-checker against the tool results. It's fast, cheap, and catches the ~3% of responses where the primary model added something it shouldn't have.
what we still can't fully solve
Tone hallucination. The model sometimes promises things the business can't deliver — "I'll make sure that gets taken care of right away" when the actual resolution requires a 48-hour review. This isn't a factual hallucination but a commitment one. We're still iterating on this.
The broader lesson: hallucination reduction isn't a single technique. It's a defense-in-depth approach — multiple layers, each catching what the others miss.
More in AI & LLM
The Context Window Trap
1M token context windows don't mean you should use them. When to chunk, summarize, or restructure instead of stuffing everything in.
LLM Evaluation Is Hard
Building LLM-as-a-Judge for AssessAI taught me why automated evaluation needs human calibration — and where it breaks down.
Voice AI in Production
What it takes to deploy conversational Voice AI that handles real phone calls. Latency budgets, failure modes, and lessons from Avoca.