LLM Agent Orchestration Coordinating Multiple AI Agents at Scale
How do you get multiple LLM agents to work together reliably? This post covers the four canonical orchestration patterns, major frameworks (LangGraph, CrewAI, AutoGen, MetaGPT), task decomposition, communication protocols (MCP, A2A, ACP, ANP), Mixture-of-Agents architectures, the MAST taxonomy of 14 failure modes, and transactional reliability via SagaLLM.
Based on: 20+ papers (2023–2026) · MAST NeurIPS 2025 · VLDB 2025
Category: Agents & Systems
Level: Advanced
Post: 52 of 52
Why Orchestration
Beyond the Single Agent
Single agents hit a ceiling. Orchestration is how you break through it.
A single LLM agent is powerful but fundamentally limited: it processes one context window at a time, cannot truly parallelize work, and accumulates errors as task complexity grows. Agent orchestration — coordinating multiple specialized agents to collaborate on a shared goal — is the architecture that unlocks a new tier of capability.
But orchestration introduces its own hard problems. How should tasks be divided and assigned? How do agents communicate without losing context? What happens when an agent fails mid-task? How do you prevent infinite loops and cascading errors? The field has matured rapidly: from informal multi-agent experiments in 2023 to formal taxonomies, production-proven frameworks, standardized communication protocols, and NeurIPS-published failure taxonomies by 2025.
The Scale of the Shift
Gartner reported a 1,445% surge in multi-agent system inquiries between Q1 2024 and Q2 2025. Real production deployments show transformative results — but also painful failures. GetOnStack incurred $47,000/week in costs from undetected recursive agent loops. 40% of multi-agent pilots fail within 6 months of production deployment.
1,445%
Surge in multi-agent system inquiries 2024→2025 (Gartner)
58.6%
Kimi K2.6 on SWE-Bench Pro via 300 swarm agents + 4,000 coordinated steps
20×
Faster mortgage approvals with multi-agent orchestration (80% cost reduction)
41.8%
Of MAST failures traced to Specification & System Design errors
Architecture Patterns
The Four Orchestration Patterns
Each pattern embodies a different tradeoff between control, latency, fault tolerance, and debuggability. Click a pattern to explore.
🎯
Supervisor
Centralized control + delegation
🔗
Pipeline
Sequential stage-by-stage
🔀
Router
Classify & dispatch
🌊
Swarm
Decentralized peer-to-peer
Major Frameworks
LangGraph · CrewAI · AutoGen · MetaGPT
Four production-grade frameworks, each with a distinct design philosophy. Click a tab to compare.
LangChain
LangGraph
Graph-based stateful orchestration. The most flexible and production-mature framework for complex conditional workflows.
Core Abstraction
StateGraph — directed graph with nodes (agents/functions) and conditional edges. Every node transformation updates a shared state object.
Orchestration Pattern
Supervisor via create_supervisor(): central node routes tasks to workers based on LLM reasoning. Also supports hierarchical, flat, and parallel configurations.
State Management
Built-in checkpointing with time-travel debugging. Long-horizon stateful workflows supported natively. LangGraph Studio for visual pipeline debugging.
Best For
Complex branching pipelines, compliance workflows, financial analysis, any workflow requiring conditional routing and auditability.
Weakness
Steep learning curve. Requires upfront graph design investment. Over-engineering risk for simple use cases.
Production Case
11x rebuilt "Alice" AI SDR using LangGraph hierarchical design; achieved human-level 2% reply rates.
# LangGraph supervisor pattern sketch
from langgraph.graph import StateGraph
from langgraph.prebuilt import create_supervisor
supervisor = create_supervisor(
agents=[research_agent, writer_agent, reviewer_agent],
model=llm,
prompt="You are a supervisor. Delegate tasks appropriately."
)
app = supervisor.compile(checkpointer=memory)
CrewAI
CrewAI
Role-based agent crews. The lowest-friction framework — agents behave like employees with roles, goals, and backstories.
Core Abstraction
Agents with roles + goals + backstories forming a Crew. Three process types: Sequential, Hierarchical (manager_llm coordinates), Consensual (in dev).
Orchestration Pattern
Sequential: task N output feeds task N+1. Hierarchical: manager agent delegates, validates results before proceeding. Explicit manager_llm param required.
State Management
Role-specific memory tiers: short-term (within task), long-term (across tasks), entity-based, contextual. Output of each task available as context.
Best For
Role-delegation workflows, content production pipelines, business process automation. Fastest time-to-prototype.
Weakness
Less flexible for highly conditional or non-hierarchical workflows. Limited support for dynamic agent spawning mid-task.
Learning Curve
Lowest of all frameworks — 20 lines to a working multi-agent system. Strong CLI tooling and templates.
Conversational multi-agent framework. Agents communicate through structured dialogue; each agent can be an LLM, a tool, or a human-in-the-loop.
Core Abstraction
ConversableAgent — any combination of LLM, tools, human input. GroupChat orchestrates multiple agents in turn-based or conditional conversation patterns.
Orchestration Pattern
RoundRobinGroupChat (structured turns), SelectorGroupChat (LLM picks next speaker), GraphFlow (DAG of tool calls). Each response broadcast to all participants.
State Management
Conversation history in-memory by default. Context = running message history. AG2 v0.4+ adds more flexible patterns and Azure integration.
Best For
Iterative critique-and-revise tasks, brainstorming with human-in-the-loop, tasks requiring agents to debate and converge on answers.
Weakness
Context sharding challenges at scale. Less structured for formal business process workflows. Can generate verbose conversation history.
Paper
Wu et al. (2023) arXiv:2308.08155. Rebranded from AutoGen → AG2 in v0.4. Strong Azure/Microsoft 365 integration.
SOPs encoded into prompt sequences. Agents communicate via structured documents rather than natural language — prevents cascading hallucinations from ambiguous dialogue.
Software development, document generation, structured workflows where artifacts (not just chat) are the primary output.
Weakness
Rigid role structure. Not easily adapted to workflows outside the software development metaphor. Higher prompt engineering investment.
Paper
Hong et al. (2023) arXiv:2308.00352. ICLR 2024. With Schmidhuber as co-author.
OpenAI (Educational)
OpenAI Swarm → Agents SDK
Minimalist two-abstraction framework built on transparency. Experimental/educational, superseded by OpenAI Agents SDK (March 2025) for production.
Two Abstractions
Agents: Python class with system prompt + tools + optional routine. Handoffs: a tool function returns the next agent as its value, transferring control based on context.
Design Philosophy
Strip away all abstraction. Maximum transparency — every routing decision is a visible Python function call. Stateless, client-side, built on Chat Completions API.
Handoff Mechanism
An agent's tool can return another Agent object. The orchestrator switches to the returned agent for the next turn. No hidden routing logic.
Production Evolution
OpenAI Agents SDK (March 2025) adds tracing, guardrails, streaming, and production observability on top of Swarm's conceptual model.
Best For
Learning multi-agent concepts. Low-complexity handoff workflows. Teams prioritizing full control and auditability over framework features.
Weakness
No built-in state management. No checkpointing. Not designed for long-horizon or parallel workflows.
# OpenAI Swarm handoff sketch
def transfer_to_billing():
return billing_agent # return the next agent object
triage_agent = Agent(
name="Triage Agent",
instructions="Route customer queries to the right department.",
functions=[transfer_to_billing, transfer_to_technical]
)
Task Decomposition & Planning
Breaking Down Hard Tasks
Orchestration starts with planning. How you decompose a query into sub-tasks — and assign them to agents — determines the whole pipeline's quality.
A key challenge in multi-agent orchestration is task decomposition: taking a complex user query and splitting it into sub-tasks that individual agents can solve. Three principles guide good decomposition: Solvability (each sub-task is independently resolvable by an available agent), Completeness (all aspects of the original query are covered), and Non-Redundancy (minimal effective set of sub-tasks).
"Compare the energy consumption and CO₂ footprint of training GPT-4 vs Llama 3 70B, and suggest which is more sustainable for a startup."
The user submits a complex multi-faceted query requiring: (1) retrieval of specific technical data, (2) comparative analysis, and (3) domain-specific recommendation. No single agent can reliably handle all three without accumulating errors.
This is where the meta-agent (orchestrator) takes over — its job is not to answer the question but to design the plan for answering it.
Meta-Agent (Orchestrator)
↓ decomposes
Sub-task A: GPT-4 training energy & CO₂
Sub-task B: Llama 3 70B training energy & CO₂
Sub-task C: Inference cost comparison
Sub-task D: Sustainability recommendation
The meta-agent applies the three AOP (Agent-Oriented Planning) principles (ICLR 2025, Li et al.):
Solvability: Is each sub-task independently solvable by an available agent? (A reward model predicts this without executing agents.)
Completeness: Do the sub-tasks together cover the full original query?
Non-Redundancy: Can any sub-task be merged without losing coverage?
AOP achieved 43.7% on Husky-QA vs. single-agent baselines of 33–36%, and outperformed the HUSKY system (39.6%).
Sub-task A
→
Reward Model (768-dim MLP)
→
✓ Accept (solvability > 0.7)
↩ Replan (no agent can solve)
↗ Plan-in-detail (similarity match)
Before assigning any sub-task, AOP uses a reward model (768-dimensional MLP embedder) to predict whether an available agent can actually solve it — without executing the agent. Three outcomes:
Accept: High predicted solvability → assign to best-matching agent
Replan: No agent in the pool can solve this sub-task → decompose further or reformulate
Plan-in-detail: Similarity match against representative works suggests this sub-task needs further specification
The DAAO framework (2025) extends this with a variational autoencoder that encodes query difficulty (0–1). Hard queries spawn more layers: L = ⌈d·ℓ⌉. Result: +11.21% over MaAS at only 64% of its inference cost.
Sub-task A → GPT-4 CO₂
Sub-task B → Llama CO₂
→
Research Agent (web + RAG)
Sub-task C → Inference cost
→
Calculator Agent (code exec)
Sub-task D → Recommend
→
Advisor Agent (reasoning)
DyLAN (Dynamic LLM-Powered Agent Network, COLM 2024) improves agent assignment by computing an Agent Importance Score — an unsupervised metric that dynamically selects the optimal agents from a candidate pool for each task, rather than using a fixed team.
Result: MMLU accuracy improved by up to 25% in specific subjects over fixed-team baselines. Agents not contributing to a task are dropped, reducing token waste and hallucination propagation.
Research Agent result
+
Calculator Agent result
+
Advisor Agent result
→
Meta-Agent Synthesizes → Final Answer
The meta-agent aggregates sub-task results into a coherent final answer. This is the most error-prone step — the aggregator must reconcile results that may be contradictory, use different units, or address slightly different formulations of the question.
The RL framework (Zhang 2026) formalizes aggregation as decision O4: how to combine partial outputs. Rewards include aggregation quality (semantic consistency + coverage), split correctness, and parallelism speedup. Notably, the stopping decision (O5) has no established RL training method yet — a key open research gap.
Interoperability
Agent Communication Protocols
As multi-agent systems move toward open ecosystems, standardized protocols replace ad-hoc message passing. Four protocols define the emerging stack.
A 2025 survey (Ehtesham et al., arXiv:2505.02279) identified four protocols forming an adoption ladder: from tool access (MCP) to full decentralized agent marketplaces (ANP). A companion paper (Yuan et al., arXiv:2604.02369) analyzing 18 protocols found a critical gap: most excel at communication and syntax but fail at semantic alignment — meaning verification, intent clarification, and context consistency across sessions.
Phase 1
MCP — Model Context Protocol
Anthropic · Nov 2024
▼
Architecture: JSON-RPC client-server. The agent (client) makes typed function calls to tools (servers). Servers expose capabilities via a standardized schema.
Primary function: Secure tool invocation and typed data exchange. Standardizes how an agent accesses external tools (databases, APIs, file systems) without custom integration for each.
Governance
Donated to Agentic AI Foundation (AAIF, Linux Foundation) in December 2025. Co-founded by Anthropic, Block, OpenAI.
Designed for tool access, not agent-to-agent delegation. No session state management across calls.
Use When
An agent needs reliable access to tools, databases, or file systems in a standardized way.
Phase 2
ACP — Agent Communication Protocol
AgentUnion / IBM
▼
Architecture: RESTful HTTP with MIME-typed multipart messages. Supports rich, multimodal payloads. DIDs (Decentralized Identifiers) for agent identity.
Primary function: Scalable agent invocation and session management. Designed for agents that need to exchange complex, multi-part content (text + images + structured data) over HTTP.
More complex infrastructure but handles multi-step interactions and multimodal content that MCP's JSON-RPC wasn't designed for.
Use When
Agents need multi-turn stateful interactions, or need to exchange non-text content (images, audio, documents).
Limitation
Higher infrastructure overhead. Less widely adopted than MCP as of 2025.
Phase 3
A2A — Agent-to-Agent Protocol
Google · April 2025
▼
Architecture: Peer-to-peer via capability Agent Cards. Each agent publishes a JSON card describing its capabilities, inputs, outputs, and constraints. Other agents discover and invoke via the card schema.
Primary function: Secure task delegation across enterprise workflows and organizational boundaries. Agents from different vendors or teams can interoperate without custom integration.
Requires agents to publish and maintain accurate Agent Cards. Discovery at scale introduces latency.
Phase 4
ANP — Agent Network Protocol
W3C · Decentralized
▼
Architecture: W3C Decentralized Identifiers (DIDs) + JSON-LD semantic graphs. Fully decentralized — no central registry or directory. Agents discover each other via DHT-style network.
Primary function: Open network discovery and decentralized collaboration. Enables fully autonomous agent marketplaces where agents find, contract with, and pay each other without human intermediation.
Key Concept
DIDs give agents persistent, verifiable identities independent of any platform. JSON-LD graphs enable semantic capability description beyond simple text tags.
Use Case
Open agent ecosystems, autonomous agent economies, cross-border multi-agent workflows without central authority.
Maturity
Earliest stage of the four protocols. Mostly research and proof-of-concept as of 2025. Most deployments are still on MCP + A2A.
Semantic Gap
Yuan et al. (2025): even with ANP's semantic graphs, most protocols push meaning alignment into application-level prompts — the core interoperability challenge remains unsolved at the protocol layer.
Ensemble Architectures
Mixture of Agents
Can multiple weaker models collectively outperform a single stronger model? MoA says yes — but the devil is in the architecture.
Mixture-of-Agents (Wang et al., arXiv:2406.04692, Together AI) exploits a surprising property: LLMs are "collaborative" — they consistently generate better outputs when shown other models' responses as auxiliary context, even if those models are weaker. This enables layered ensemble architectures that outperform any single model.
Input
User Query
Layer 1
GPT-4o
Claude 3.5
Llama 3 70B
Mixtral 8x7B
↓ all Layer 1 outputs passed to Layer 2 as auxiliary context
Layer 2
GPT-4o
Claude 3.5
Llama 3 70B
↓ aggregated by Aggregator
Output
Aggregator (GPT-4o) → Final Answer
AlpacaEval 2.0: 65.1% — beats GPT-4o standalone (57.5%) using only open-source models in lower layers
Each layer passes all its outputs to the next layer as auxiliary context. The key insight: every LLM generates better responses when it can see what other models said first — even weaker models. The aggregator synthesizes all layer outputs into a final answer.
Cost warning: MoA multiplies token consumption by (number of models) × (number of layers). A 3-model × 2-layer MoA costs ~6× more than a single call. Latency also increases significantly unless layers run in parallel.
Input
User Query
Layer 1
GPT-4o
GPT-4o
GPT-4o
GPT-4o
↓ 4 independent GPT-4o samples passed to aggregation
Output
GPT-4o Aggregator → Final Answer
AlpacaEval 2.0: 65.1% + 6.6% = ~71.7% — Self-MoA beats original MoA across all benchmarks
Li et al. (arXiv:2502.00674, 2025) challenged the assumption that heterogeneous models are better. Their finding: mixing different LLMs frequently lowers average output quality because weaker models drag down the ensemble.
Self-MoA uses the same top model multiple times with different random seeds. The variance between samples from a high-quality model is more beneficial than variance from mixing strong and weak models. Beats original MoA by 6.6% on AlpacaEval 2.0 and 3.8% average across MMLU, CRUX, MATH.
Takeaway: Don't mix for diversity's sake. Use the best available model and exploit within-model variance. MoE (Mixture of Experts, parameter-level) is architecturally distinct — it's internal routing within one model, not external delegation between separate models.
MAST · NeurIPS 2025
Why Multi-Agent Systems Fail
Cemri et al. analyzed 1,600+ annotated execution traces across 7 frameworks and identified 14 failure modes in 3 categories. Inter-annotator agreement: κ = 0.88.
The MAST paper (arXiv:2503.13657, NeurIPS 2025) is the most systematic empirical study of multi-agent LLM failures to date. The core finding: architecture and coordination failures dominate, not model capability failures. Tactical fixes like clearer prompts only yielded +14% improvement — "identified failures require more complex solutions."
FC1 — Specification & System Design Failures
41.77%
FM-1.1
Agents ignore task constraints
The agent accepts input in a wrong format or violates stated constraints (e.g., a chess agent accepting an invalid move notation instead of rejecting it). Root cause: prompts don't enforce constraints at the code level — the LLM must police itself and fails.
FM-1.2
Agents exceed role boundaries
A CEO agent makes unilateral technical decisions it was not assigned. An engineer agent begins managing stakeholder communications. Role boundaries are defined in prompts but not enforced — agents "helpfully" exceed their scope, causing downstream inconsistency.
FM-1.3
Redundant repetition without progress
The agent loop repeats completed steps in circles without advancing toward the goal. Classic "groundhog day" pattern — often triggered by ambiguous success criteria. The agent cannot determine that it has completed a step, so it re-executes it.
FM-1.4
Unexpected context loss
Critical information established earlier in the conversation disappears from the agent's effective context. Causes: context window overflow with truncation from the left, poor context management in the framework, or position-biased attention that de-weights middle-of-context information.
FM-1.5
Agents don't know when to stop
Agents continue executing after the task is complete, over-generating, second-guessing results, or starting new iterations of a completed workflow. Closely related to the RL stopping decision (O5) identified by Zhang (2026) as having no established training method — agents must be told explicitly when done.
FC2 — Inter-Agent Misalignment
36.94%
FM-2.1
Dialogue unexpectedly restarts
A new agent brought into the conversation reintroduces itself, re-asks questions already answered, and restarts from scratch — losing all prior context. Occurs because agents receive only a subset of the full conversation history when the framework doesn't properly pass state at handoff.
FM-2.2
Fails to request clarification
Agent proceeds on an ambiguous instruction with a plausible but incorrect interpretation, rather than asking for clarification. The LLM's tendency to be "helpful" and avoid asking questions leads it to confidently execute the wrong task, propagating errors downstream.
FM-2.3
Task focus drifts
The original objective gets replaced by a related but different objective as agents exchange messages. Each agent subtly reframes the goal, and over multiple turns the system ends up solving a different problem from what the user requested.
FM-2.4
Agents withhold critical information
An agent omits information from its output that another agent needs to make a correct decision. Causes: summarization that drops edge cases, implicit assumptions not stated explicitly, or the agent assuming another agent already has the context.
FM-2.5
Agents disregard peer recommendations
A reviewer agent suggests a fix; the author agent acknowledges but doesn't implement it. Or a critic raises a valid concern that the producing agent dismisses without justification. Each agent's prior in its own output is stronger than its receptiveness to peer correction.
FM-2.6
Reasoning contradicts executed actions
The agent's stated reasoning ("I will do X because...") doesn't match the action it actually takes (does Y instead). The CoT reasoning is disconnected from the tool call or output — the model's reasoning trace is post-hoc rationalization, not the actual decision driver.
FC3 — Task Verification & Termination
21.30%
FM-3.1
Tasks terminate before completion
The system declares success and terminates before the task is actually complete. Often triggered by a plausible-looking intermediate output that the orchestrator misidentifies as the final answer. Particularly common when the success criterion is vague ("done") rather than verifiable ("all 5 test cases pass").
FM-3.2
Verification skipped or incomplete
The verification step is not executed at all, or executes on the wrong output, or applies the wrong success criteria. The agent may "verify" by re-reading its own output — not by executing code, running tests, or comparing against ground truth. Self-verification is systematically overconfident.
FM-3.3
Verification reaches wrong conclusions
Verification runs but produces an incorrect verdict — approving a broken output or rejecting a correct one. Root causes: the verifier uses different assumptions than the producer, the test cases are insufficient, or the verifier LLM is the same model that produced the error (same blind spots).
Engineering for Production
Reliability: SagaLLM & Transactional Guarantees
Multi-agent workflows need the same recovery guarantees as distributed databases. SagaLLM adapts the Saga pattern from distributed systems to LLM orchestration.
Traditional software transactions follow ACID guarantees — if any step fails, the whole transaction rolls back. Multi-agent LLM workflows are fundamentally different: steps are long-running, expensive, and non-reversible. SagaLLM (Chang & Geng, VLDB 2025) adapts the Saga pattern from distributed databases — where each step has a compensating transaction (a way to undo its effects) — to LLM planning workflows.
Four Problems SagaLLM Solves
1. Unreliable self-validation — LLMs overestimate their own output correctness. Independent validator agents provide objective verification.
2. Context loss across interactions — Modular checkpointing at each saga step preserves state. If an agent fails mid-workflow, recovery resumes from the last checkpoint, not the start.
3. No transactional safeguards — Compensating transactions allow rollback. If step 4 fails, steps 1–3 execute their compensating logic to undo side effects.
4. Weak inter-agent coordination — Saga-style dependencies make coordination explicit and verifiable, not implicit in prompts.
Production Engineering Principles
Circuit breakers — detect runaway agent loops early; stop execution before costs spiral. GetOnStack learned this the hard way ($47K/week from unchecked loops).
Hard budget limits — maximum token budget per workflow. ZenML found "context rot" begins at 50k–150k tokens regardless of theoretical context windows.
Tool count discipline — "analysis paralysis" when >15 tools exposed simultaneously. Restrict available tools to the task-relevant subset at each stage.
Explicit termination criteria — machine-verifiable success conditions (code runs, tests pass, schema validates), not "the agent decides it's done."
The RL Frontier (Zhang 2026)
RL for multi-agent orchestration (arXiv:2605.02801) formalizes the orchestrator's job as 5 decisions: O1 (when to spawn), O2 (whom to delegate), O3 (how to communicate), O4 (how to aggregate), O5 (when to stop). Eight reward families cover outcome quality, parallelism speedup, aggregation accuracy, and team coordination. The stopping decision (O5) has no established RL training method yet — the most critical open research gap in production multi-agent engineering.