AI & LLMMarch 2, 20265 min read

Multi-Agent Orchestration Lessons

aiagentsorchestrationarchitecture

I have 15 specialized agents running in my development setup. At Avoca, the voice AI platform used 23 agents in production across scheduling, routing, escalation, and conversation management. I've built enough multi-agent systems to know when they're the right call and when they're architecture theater.

the split/merge decision

The most important decision in multi-agent design isn't which framework to use. It's when to split one agent into two and when to keep them together.

Split when:

Two tasks require different system prompts that conflict. A "be concise" agent and a "be thorough" agent can't share a system prompt effectively.
Tasks have different tool sets. An agent with 30 tools performs worse than two agents with 15 each. Tool selection accuracy degrades with tool count.
Tasks have different latency requirements. Your real-time response agent and your background analysis agent shouldn't share a call queue.
You need independent scaling. The classification agent handles 10x the volume of the synthesis agent.

Merge when:

The agents share more than 80% of their context. If agent B needs everything agent A just computed, they should probably be one agent with two steps.
The communication overhead exceeds the task complexity. Two agents passing JSON back and forth to do what one agent could do in a single prompt — that's overhead, not architecture.
You're splitting by arbitrary function boundaries. "Extraction agent" and "formatting agent" don't need to be separate. That's just two steps in a pipeline.

the patterns that worked at Avoca

Our voice AI had a specific architecture that evolved through trial and error:

Conversation Router (1 agent)
├── Intent Classifier → routes to domain agent
├── Scheduling Agent (tools: calendar, CRM)
├── FAQ Agent (tools: knowledge base)
├── Escalation Agent (tools: transfer, queue)
└── Confirmation Agent (tools: booking, SMS)

Key decisions:

One router, many specialists. The router agent does nothing but classify intent and delegate. It has no domain tools. This keeps it fast (~100ms) and accurate. We tried having the router also handle simple queries directly — accuracy dropped from 94% to 87% because it was trying to do two jobs.

Specialists share no state directly. Each specialist gets a structured handoff from the router: customer info, conversation history, classified intent. No shared memory, no shared database connections, no side channels. This made debugging straightforward — you could replay any handoff and get deterministic behavior.

The confirmation agent is separate from the booking agent. This was counterintuitive but critical. The booking agent picks a slot. The confirmation agent reads back the details and asks the customer to confirm. Separating them added a natural verification step — the confirmation agent has a "skeptical" system prompt that double-checks the booking details against the original request. This caught mismatched dates, wrong service types, and address errors.

what didn't work

Shared memory stores. We tried a Redis-based shared memory where agents could read and write state. In theory, this enabled rich collaboration. In practice, it created race conditions, stale reads, and debugging nightmares. Agent A writes a value, agent B reads it 50ms later, agent C overwrites it. We ripped it out in week 3.

Consensus mechanisms. We built a system where three agents independently evaluated a customer request and voted on the response. It was 3x the latency for a ~2% accuracy improvement. Not worth it for real-time voice. Might make sense for batch processing where latency doesn't matter.

Dynamic agent spawning. Letting the router create new agent instances on the fly sounds flexible. It was chaos. Orphaned agents, memory leaks, inconsistent configurations. Static topology, deployed as a unit, is simpler and more reliable.

orchestration framework choices

I've used LangGraph, CrewAI, and a custom orchestrator. Here's the honest take:

LangGraph works if you think in graphs. State machines with conditional edges. It's the most principled approach but the learning curve is steep and the abstractions are rigid.

CrewAI is fast to prototype but the "role playing" paradigm — agents with backstories and goals — adds a layer of indirection that makes debugging harder. When your "Senior Data Analyst" agent hallucinates, the backstory isn't helping.

Custom orchestrator is what I'd recommend for production. It's more code upfront but you control the routing, error handling, and monitoring. Here's the minimal version:

class Orchestrator:
    def __init__(self, agents: dict[str, Agent]):
        self.agents = agents
 
    async def route(self, message: str, context: dict) -> str:
        # Step 1: classify intent
        intent = await self.agents["router"].classify(message, context)
 
        # Step 2: delegate to specialist
        specialist = self.agents.get(intent.agent_name)
        if not specialist:
            return await self.agents["fallback"].respond(message, context)
 
        # Step 3: execute with timeout
        try:
            result = await asyncio.wait_for(
                specialist.execute(message, context, intent),
                timeout=10.0,
            )
        except asyncio.TimeoutError:
            return await self.agents["fallback"].respond(message, context)
 
        return result

50 lines of Python. No framework dependencies. Full control over routing, timeouts, and fallbacks. Add logging, metrics, and error handling as needed.

the rules I follow now

Start with one agent. Only split when you have a measured reason.
Explicit handoffs, not shared memory. Pass structured data between agents.
Every agent has a fallback. Timeout? Error? Unknown intent? Route to fallback. Always.
Monitor per-agent, not per-system. If accuracy drops, you need to know which agent degraded.
Static topology. Deploy agents as a unit. No runtime spawning.

Multi-agent systems are powerful when each agent genuinely needs different context, tools, and constraints. For everything else, a well-structured single agent with multiple steps is simpler and easier to maintain.

Multi-Agent Orchestration Lessons

the split/merge decision

the patterns that worked at Avoca

what didn't work

orchestration framework choices

the rules I follow now

More in AI & LLM

The Context Window Trap

LLM Evaluation Is Hard

Voice AI in Production