Multi-Agent Orchestration Lessons
I have 15 specialized agents running in my development setup. At Avoca, the voice AI platform used 23 agents in production across scheduling, routing, escalation, and conversation management. I've built enough multi-agent systems to know when they're the right call and when they're architecture theater.
the split/merge decision
The most important decision in multi-agent design isn't which framework to use. It's when to split one agent into two and when to keep them together.
Split when:
- Two tasks require different system prompts that conflict. A "be concise" agent and a "be thorough" agent can't share a system prompt effectively.
- Tasks have different tool sets. An agent with 30 tools performs worse than two agents with 15 each. Tool selection accuracy degrades with tool count.
- Tasks have different latency requirements. Your real-time response agent and your background analysis agent shouldn't share a call queue.
- You need independent scaling. The classification agent handles 10x the volume of the synthesis agent.
Merge when:
- The agents share more than 80% of their context. If agent B needs everything agent A just computed, they should probably be one agent with two steps.
- The communication overhead exceeds the task complexity. Two agents passing JSON back and forth to do what one agent could do in a single prompt — that's overhead, not architecture.
- You're splitting by arbitrary function boundaries. "Extraction agent" and "formatting agent" don't need to be separate. That's just two steps in a pipeline.
the patterns that worked at Avoca
Our voice AI had a specific architecture that evolved through trial and error:
Conversation Router (1 agent)
├── Intent Classifier → routes to domain agent
├── Scheduling Agent (tools: calendar, CRM)
├── FAQ Agent (tools: knowledge base)
├── Escalation Agent (tools: transfer, queue)
└── Confirmation Agent (tools: booking, SMS)
Key decisions:
One router, many specialists. The router agent does nothing but classify intent and delegate. It has no domain tools. This keeps it fast (~100ms) and accurate. We tried having the router also handle simple queries directly — accuracy dropped from 94% to 87% because it was trying to do two jobs.
Specialists share no state directly. Each specialist gets a structured handoff from the router: customer info, conversation history, classified intent. No shared memory, no shared database connections, no side channels. This made debugging straightforward — you could replay any handoff and get deterministic behavior.
The confirmation agent is separate from the booking agent. This was counterintuitive but critical. The booking agent picks a slot. The confirmation agent reads back the details and asks the customer to confirm. Separating them added a natural verification step — the confirmation agent has a "skeptical" system prompt that double-checks the booking details against the original request. This caught mismatched dates, wrong service types, and address errors.
what didn't work
Shared memory stores. We tried a Redis-based shared memory where agents could read and write state. In theory, this enabled rich collaboration. In practice, it created race conditions, stale reads, and debugging nightmares. Agent A writes a value, agent B reads it 50ms later, agent C overwrites it. We ripped it out in week 3.
Consensus mechanisms. We built a system where three agents independently evaluated a customer request and voted on the response. It was 3x the latency for a ~2% accuracy improvement. Not worth it for real-time voice. Might make sense for batch processing where latency doesn't matter.
Dynamic agent spawning. Letting the router create new agent instances on the fly sounds flexible. It was chaos. Orphaned agents, memory leaks, inconsistent configurations. Static topology, deployed as a unit, is simpler and more reliable.
orchestration framework choices
I've used LangGraph, CrewAI, and a custom orchestrator. Here's the honest take:
LangGraph works if you think in graphs. State machines with conditional edges. It's the most principled approach but the learning curve is steep and the abstractions are rigid.
CrewAI is fast to prototype but the "role playing" paradigm — agents with backstories and goals — adds a layer of indirection that makes debugging harder. When your "Senior Data Analyst" agent hallucinates, the backstory isn't helping.
Custom orchestrator is what I'd recommend for production. It's more code upfront but you control the routing, error handling, and monitoring. Here's the minimal version:
class Orchestrator:
def __init__(self, agents: dict[str, Agent]):
self.agents = agents
async def route(self, message: str, context: dict) -> str:
# Step 1: classify intent
intent = await self.agents["router"].classify(message, context)
# Step 2: delegate to specialist
specialist = self.agents.get(intent.agent_name)
if not specialist:
return await self.agents["fallback"].respond(message, context)
# Step 3: execute with timeout
try:
result = await asyncio.wait_for(
specialist.execute(message, context, intent),
timeout=10.0,
)
except asyncio.TimeoutError:
return await self.agents["fallback"].respond(message, context)
return result50 lines of Python. No framework dependencies. Full control over routing, timeouts, and fallbacks. Add logging, metrics, and error handling as needed.
the rules I follow now
- Start with one agent. Only split when you have a measured reason.
- Explicit handoffs, not shared memory. Pass structured data between agents.
- Every agent has a fallback. Timeout? Error? Unknown intent? Route to fallback. Always.
- Monitor per-agent, not per-system. If accuracy drops, you need to know which agent degraded.
- Static topology. Deploy agents as a unit. No runtime spawning.
Multi-agent systems are powerful when each agent genuinely needs different context, tools, and constraints. For everything else, a well-structured single agent with multiple steps is simpler and easier to maintain.
More in AI & LLM
The Context Window Trap
1M token context windows don't mean you should use them. When to chunk, summarize, or restructure instead of stuffing everything in.
LLM Evaluation Is Hard
Building LLM-as-a-Judge for AssessAI taught me why automated evaluation needs human calibration — and where it breaks down.
Voice AI in Production
What it takes to deploy conversational Voice AI that handles real phone calls. Latency budgets, failure modes, and lessons from Avoca.