Post 52 · LLM Agent Orchestration

Why Orchestration

Beyond the Single Agent

Single agents hit a ceiling. Orchestration is how you break through it.

A single LLM agent is powerful but fundamentally limited: it processes one context window at a time, cannot truly parallelize work, and accumulates errors as task complexity grows. Agent orchestration — coordinating multiple specialized agents to collaborate on a shared goal — is the architecture that unlocks a new tier of capability.

But orchestration introduces its own hard problems. How should tasks be divided and assigned? How do agents communicate without losing context? What happens when an agent fails mid-task? How do you prevent infinite loops and cascading errors? The field has matured rapidly: from informal multi-agent experiments in 2023 to formal taxonomies, production-proven frameworks, standardized communication protocols, and NeurIPS-published failure taxonomies by 2025.

The Scale of the Shift

Gartner reported a 1,445% surge in multi-agent system inquiries between Q1 2024 and Q2 2025. Real production deployments show transformative results — but also painful failures. GetOnStack incurred $47,000/week in costs from undetected recursive agent loops. 40% of multi-agent pilots fail within 6 months of production deployment.

1,445%

Surge in multi-agent system inquiries 2024→2025 (Gartner)

58.6%

Kimi K2.6 on SWE-Bench Pro via 300 swarm agents + 4,000 coordinated steps

20×

Faster mortgage approvals with multi-agent orchestration (80% cost reduction)

41.8%

Of MAST failures traced to Specification & System Design errors

Architecture Patterns

The Four Orchestration Patterns

Each pattern embodies a different tradeoff between control, latency, fault tolerance, and debuggability. Click a pattern to explore.

🎯

Supervisor

Centralized control + delegation

🔗

Pipeline

Sequential stage-by-stage

🔀

Router

Classify & dispatch

🌊

Swarm

Decentralized peer-to-peer

Major Frameworks

LangGraph · CrewAI · AutoGen · MetaGPT

Four production-grade frameworks, each with a distinct design philosophy. Click a tab to compare.

LangChain

LangGraph

Graph-based stateful orchestration. The most flexible and production-mature framework for complex conditional workflows.

Core Abstraction

StateGraph — directed graph with nodes (agents/functions) and conditional edges. Every node transformation updates a shared state object.

Orchestration Pattern

Supervisor via create_supervisor(): central node routes tasks to workers based on LLM reasoning. Also supports hierarchical, flat, and parallel configurations.

State Management

Built-in checkpointing with time-travel debugging. Long-horizon stateful workflows supported natively. LangGraph Studio for visual pipeline debugging.

Best For

Complex branching pipelines, compliance workflows, financial analysis, any workflow requiring conditional routing and auditability.

Weakness

Steep learning curve. Requires upfront graph design investment. Over-engineering risk for simple use cases.

Production Case

11x rebuilt "Alice" AI SDR using LangGraph hierarchical design; achieved human-level 2% reply rates.

# LangGraph supervisor pattern sketch
from langgraph.graph import StateGraph
from langgraph.prebuilt import create_supervisor

supervisor = create_supervisor(
    agents=[research_agent, writer_agent, reviewer_agent],
    model=llm,
    prompt="You are a supervisor. Delegate tasks appropriately."
)
app = supervisor.compile(checkpointer=memory)

CrewAI

Role-based agent crews. The lowest-friction framework — agents behave like employees with roles, goals, and backstories.

Core Abstraction

Agents with roles + goals + backstories forming a Crew. Three process types: Sequential, Hierarchical (manager_llm coordinates), Consensual (in dev).

Orchestration Pattern

Sequential: task N output feeds task N+1. Hierarchical: manager agent delegates, validates results before proceeding. Explicit manager_llm param required.

State Management

Role-specific memory tiers: short-term (within task), long-term (across tasks), entity-based, contextual. Output of each task available as context.

Best For

Role-delegation workflows, content production pipelines, business process automation. Fastest time-to-prototype.

Weakness

Less flexible for highly conditional or non-hierarchical workflows. Limited support for dynamic agent spawning mid-task.

Learning Curve

Lowest of all frameworks — 20 lines to a working multi-agent system. Strong CLI tooling and templates.

# CrewAI hierarchical crew sketch
from crewai import Agent, Task, Crew, Process

researcher = Agent(role="Researcher", goal="Find key facts", backstory="...")
writer = Agent(role="Writer", goal="Draft the report", backstory="...")
crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, write_task],
    process=Process.hierarchical,
    manager_llm=gpt4
)

Microsoft · AutoGen / AG2

AutoGen / AG2

Conversational multi-agent framework. Agents communicate through structured dialogue; each agent can be an LLM, a tool, or a human-in-the-loop.

Core Abstraction

ConversableAgent — any combination of LLM, tools, human input. GroupChat orchestrates multiple agents in turn-based or conditional conversation patterns.

Orchestration Pattern

RoundRobinGroupChat (structured turns), SelectorGroupChat (LLM picks next speaker), GraphFlow (DAG of tool calls). Each response broadcast to all participants.

State Management

Conversation history in-memory by default. Context = running message history. AG2 v0.4+ adds more flexible patterns and Azure integration.

Best For

Iterative critique-and-revise tasks, brainstorming with human-in-the-loop, tasks requiring agents to debate and converge on answers.

Weakness

Context sharding challenges at scale. Less structured for formal business process workflows. Can generate verbose conversation history.

Paper

Wu et al. (2023) arXiv:2308.08155. Rebranded from AutoGen → AG2 in v0.4. Strong Azure/Microsoft 365 integration.

# AutoGen GroupChat sketch
from autogen import AssistantAgent, UserProxyAgent, GroupChat

assistant = AssistantAgent("assistant", llm_config={"model": "gpt-4"})
critic = AssistantAgent("critic", llm_config={"model": "gpt-4"})
user_proxy = UserProxyAgent("user", human_input_mode="NEVER")
groupchat = GroupChat(agents=[user_proxy, assistant, critic], messages=[], max_round=10)

ICLR 2024

MetaGPT

Assembly-line paradigm with standardized operating procedures. Agents produce and consume structured artifacts, not just dialogue.

Core Abstraction

5 specialized roles: Product Manager → Architect → Project Manager → Engineer → QA Engineer. Each produces structured artifacts (PRDs, system designs, code, tests).

Key Innovation

SOPs encoded into prompt sequences. Agents communicate via structured documents rather than natural language — prevents cascading hallucinations from ambiguous dialogue.

Benchmark Results

HumanEval 85.9% Pass@1. MBPP 87.7% Pass@1. SoftwareDev executability 3.75/4.0 (vs. ChatDev 2.25). Human revisions needed: 0.83 (vs. 2.5).

Best For

Software development, document generation, structured workflows where artifacts (not just chat) are the primary output.

Weakness

Rigid role structure. Not easily adapted to workflows outside the software development metaphor. Higher prompt engineering investment.

Paper

Hong et al. (2023) arXiv:2308.00352. ICLR 2024. With Schmidhuber as co-author.

OpenAI (Educational)

OpenAI Swarm → Agents SDK

Minimalist two-abstraction framework built on transparency. Experimental/educational, superseded by OpenAI Agents SDK (March 2025) for production.

Two Abstractions

Agents: Python class with system prompt + tools + optional routine. Handoffs: a tool function returns the next agent as its value, transferring control based on context.

Design Philosophy

Strip away all abstraction. Maximum transparency — every routing decision is a visible Python function call. Stateless, client-side, built on Chat Completions API.

Handoff Mechanism

An agent's tool can return another Agent object. The orchestrator switches to the returned agent for the next turn. No hidden routing logic.

Production Evolution

OpenAI Agents SDK (March 2025) adds tracing, guardrails, streaming, and production observability on top of Swarm's conceptual model.

Best For

Learning multi-agent concepts. Low-complexity handoff workflows. Teams prioritizing full control and auditability over framework features.

Weakness

No built-in state management. No checkpointing. Not designed for long-horizon or parallel workflows.

# OpenAI Swarm handoff sketch
def transfer_to_billing():
    return billing_agent   # return the next agent object

triage_agent = Agent(
    name="Triage Agent",
    instructions="Route customer queries to the right department.",
    functions=[transfer_to_billing, transfer_to_technical]
)

Task Decomposition & Planning

Breaking Down Hard Tasks

Orchestration starts with planning. How you decompose a query into sub-tasks — and assign them to agents — determines the whole pipeline's quality.

A key challenge in multi-agent orchestration is task decomposition: taking a complex user query and splitting it into sub-tasks that individual agents can solve. Three principles guide good decomposition: Solvability (each sub-task is independently resolvable by an available agent), Completeness (all aspects of the original query are covered), and Non-Redundancy (minimal effective set of sub-tasks).

"Compare the energy consumption and CO₂ footprint of training GPT-4 vs Llama 3 70B, and suggest which is more sustainable for a startup."

The user submits a complex multi-faceted query requiring: (1) retrieval of specific technical data, (2) comparative analysis, and (3) domain-specific recommendation. No single agent can reliably handle all three without accumulating errors.

This is where the meta-agent (orchestrator) takes over — its job is not to answer the question but to design the plan for answering it.

Meta-Agent (Orchestrator)

↓ decomposes

Sub-task A:
GPT-4 training energy & CO₂

Sub-task B:
Llama 3 70B training energy & CO₂

Sub-task C:
Inference cost comparison

Sub-task D:
Sustainability recommendation

The meta-agent applies the three AOP (Agent-Oriented Planning) principles (ICLR 2025, Li et al.):

Solvability: Is each sub-task independently solvable by an available agent? (A reward model predicts this without executing agents.)
Completeness: Do the sub-tasks together cover the full original query?
Non-Redundancy: Can any sub-task be merged without losing coverage?

AOP achieved 43.7% on Husky-QA vs. single-agent baselines of 33–36%, and outperformed the HUSKY system (39.6%).

Sub-task A

→

Reward Model
(768-dim MLP)

→

✓ Accept (solvability > 0.7)

↩ Replan (no agent can solve)

↗ Plan-in-detail (similarity match)

Before assigning any sub-task, AOP uses a reward model (768-dimensional MLP embedder) to predict whether an available agent can actually solve it — without executing the agent. Three outcomes:

Accept: High predicted solvability → assign to best-matching agent
Replan: No agent in the pool can solve this sub-task → decompose further or reformulate
Plan-in-detail: Similarity match against representative works suggests this sub-task needs further specification

The DAAO framework (2025) extends this with a variational autoencoder that encodes query difficulty (0–1). Hard queries spawn more layers: L = ⌈d·ℓ⌉. Result: +11.21% over MaAS at only 64% of its inference cost.

Sub-task A → GPT-4 CO₂

Sub-task B → Llama CO₂

→

Research Agent
(web + RAG)

Sub-task C → Inference cost

→

Calculator Agent
(code exec)

Sub-task D → Recommend

→

Advisor Agent
(reasoning)

DyLAN (Dynamic LLM-Powered Agent Network, COLM 2024) improves agent assignment by computing an Agent Importance Score — an unsupervised metric that dynamically selects the optimal agents from a candidate pool for each task, rather than using a fixed team.

Result: MMLU accuracy improved by up to 25% in specific subjects over fixed-team baselines. Agents not contributing to a task are dropped, reducing token waste and hallucination propagation.

Research Agent
result

+

Calculator Agent
result

+

Advisor Agent
result

→

Meta-Agent
Synthesizes →
Final Answer

The meta-agent aggregates sub-task results into a coherent final answer. This is the most error-prone step — the aggregator must reconcile results that may be contradictory, use different units, or address slightly different formulations of the question.

The RL framework (Zhang 2026) formalizes aggregation as decision O4: how to combine partial outputs. Rewards include aggregation quality (semantic consistency + coverage), split correctness, and parallelism speedup. Notably, the stopping decision (O5) has no established RL training method yet — a key open research gap.

Interoperability

Agent Communication Protocols

As multi-agent systems move toward open ecosystems, standardized protocols replace ad-hoc message passing. Four protocols define the emerging stack.

A 2025 survey (Ehtesham et al., arXiv:2505.02279) identified four protocols forming an adoption ladder: from tool access (MCP) to full decentralized agent marketplaces (ANP). A companion paper (Yuan et al., arXiv:2604.02369) analyzing 18 protocols found a critical gap: most excel at communication and syntax but fail at semantic alignment — meaning verification, intent clarification, and context consistency across sessions.

Phase 1

MCP — Model Context Protocol

Anthropic · Nov 2024

▼

Architecture: JSON-RPC client-server. The agent (client) makes typed function calls to tools (servers). Servers expose capabilities via a standardized schema.

Primary function: Secure tool invocation and typed data exchange. Standardizes how an agent accesses external tools (databases, APIs, file systems) without custom integration for each.

Governance

Donated to Agentic AI Foundation (AAIF, Linux Foundation) in December 2025. Co-founded by Anthropic, Block, OpenAI.

Scale

Sentry MCP server: 60M requests/month. Loblaws "Alfred" agent wraps 50+ internal APIs via MCP.

Limitation

Designed for tool access, not agent-to-agent delegation. No session state management across calls.

Use When

An agent needs reliable access to tools, databases, or file systems in a standardized way.

Phase 2

ACP — Agent Communication Protocol

AgentUnion / IBM

▼

Architecture: RESTful HTTP with MIME-typed multipart messages. Supports rich, multimodal payloads. DIDs (Decentralized Identifiers) for agent identity.

Primary function: Scalable agent invocation and session management. Designed for agents that need to exchange complex, multi-part content (text + images + structured data) over HTTP.

Key Features

Session persistence across multiple request-response cycles. Streaming support. Native multimodal content types.

Compared to MCP

More complex infrastructure but handles multi-step interactions and multimodal content that MCP's JSON-RPC wasn't designed for.

Use When

Agents need multi-turn stateful interactions, or need to exchange non-text content (images, audio, documents).

Limitation

Higher infrastructure overhead. Less widely adopted than MCP as of 2025.

Phase 3

A2A — Agent-to-Agent Protocol

Google · April 2025

▼

Architecture: Peer-to-peer via capability Agent Cards. Each agent publishes a JSON card describing its capabilities, inputs, outputs, and constraints. Other agents discover and invoke via the card schema.

Primary function: Secure task delegation across enterprise workflows and organizational boundaries. Agents from different vendors or teams can interoperate without custom integration.

Launch

Launched with 50+ technology partners: Atlassian, Box, Cohere, Intuit, LangChain, MongoDB, PayPal, Salesforce.

Key Innovation

Agent Cards enable dynamic capability discovery — an orchestrator can find the right specialist agent at runtime without pre-configuration.

Use When

Cross-organization agent collaboration, enterprise agent marketplaces, workflows spanning multiple vendor systems.

Limitation

Requires agents to publish and maintain accurate Agent Cards. Discovery at scale introduces latency.

Phase 4

ANP — Agent Network Protocol

W3C · Decentralized

▼

Architecture: W3C Decentralized Identifiers (DIDs) + JSON-LD semantic graphs. Fully decentralized — no central registry or directory. Agents discover each other via DHT-style network.

Primary function: Open network discovery and decentralized collaboration. Enables fully autonomous agent marketplaces where agents find, contract with, and pay each other without human intermediation.

Key Concept

DIDs give agents persistent, verifiable identities independent of any platform. JSON-LD graphs enable semantic capability description beyond simple text tags.

Use Case

Open agent ecosystems, autonomous agent economies, cross-border multi-agent workflows without central authority.

Maturity

Earliest stage of the four protocols. Mostly research and proof-of-concept as of 2025. Most deployments are still on MCP + A2A.

Semantic Gap

Yuan et al. (2025): even with ANP's semantic graphs, most protocols push meaning alignment into application-level prompts — the core interoperability challenge remains unsolved at the protocol layer.

Ensemble Architectures

Mixture of Agents

Can multiple weaker models collectively outperform a single stronger model? MoA says yes — but the devil is in the architecture.

Mixture-of-Agents (Wang et al., arXiv:2406.04692, Together AI) exploits a surprising property: LLMs are "collaborative" — they consistently generate better outputs when shown other models' responses as auxiliary context, even if those models are weaker. This enables layered ensemble architectures that outperform any single model.

Input

User Query

Layer 1

GPT-4o
Claude 3.5
Llama 3 70B
Mixtral 8x7B

↓ all Layer 1 outputs passed to Layer 2 as auxiliary context

Layer 2

GPT-4o
Claude 3.5
Llama 3 70B

↓ aggregated by Aggregator

Output

Aggregator (GPT-4o) → Final Answer

AlpacaEval 2.0: 65.1% — beats GPT-4o standalone (57.5%) using only open-source models in lower layers

Each layer passes all its outputs to the next layer as auxiliary context. The key insight: every LLM generates better responses when it can see what other models said first — even weaker models. The aggregator synthesizes all layer outputs into a final answer.

Cost warning: MoA multiplies token consumption by (number of models) × (number of layers). A 3-model × 2-layer MoA costs ~6× more than a single call. Latency also increases significantly unless layers run in parallel.

Input

User Query

Layer 1

GPT-4o
GPT-4o
GPT-4o
GPT-4o

↓ 4 independent GPT-4o samples passed to aggregation

Output

GPT-4o Aggregator → Final Answer

AlpacaEval 2.0: 65.1% + 6.6% = ~71.7% — Self-MoA beats original MoA across all benchmarks

Li et al. (arXiv:2502.00674, 2025) challenged the assumption that heterogeneous models are better. Their finding: mixing different LLMs frequently lowers average output quality because weaker models drag down the ensemble.

Self-MoA uses the same top model multiple times with different random seeds. The variance between samples from a high-quality model is more beneficial than variance from mixing strong and weak models. Beats original MoA by 6.6% on AlpacaEval 2.0 and 3.8% average across MMLU, CRUX, MATH.

Takeaway: Don't mix for diversity's sake. Use the best available model and exploit within-model variance. MoE (Mixture of Experts, parameter-level) is architecturally distinct — it's internal routing within one model, not external delegation between separate models.

MAST · NeurIPS 2025

Why Multi-Agent Systems Fail

Cemri et al. analyzed 1,600+ annotated execution traces across 7 frameworks and identified 14 failure modes in 3 categories. Inter-annotator agreement: κ = 0.88.

The MAST paper (arXiv:2503.13657, NeurIPS 2025) is the most systematic empirical study of multi-agent LLM failures to date. The core finding: architecture and coordination failures dominate, not model capability failures. Tactical fixes like clearer prompts only yielded +14% improvement — "identified failures require more complex solutions."

FC1 — Specification & System Design Failures

41.77%

FM-1.1

Agents ignore task constraints

The agent accepts input in a wrong format or violates stated constraints (e.g., a chess agent accepting an invalid move notation instead of rejecting it). Root cause: prompts don't enforce constraints at the code level — the LLM must police itself and fails.

FM-1.2

Agents exceed role boundaries

A CEO agent makes unilateral technical decisions it was not assigned. An engineer agent begins managing stakeholder communications. Role boundaries are defined in prompts but not enforced — agents "helpfully" exceed their scope, causing downstream inconsistency.

FM-1.3

Redundant repetition without progress

The agent loop repeats completed steps in circles without advancing toward the goal. Classic "groundhog day" pattern — often triggered by ambiguous success criteria. The agent cannot determine that it has completed a step, so it re-executes it.

FM-1.4

Unexpected context loss

Critical information established earlier in the conversation disappears from the agent's effective context. Causes: context window overflow with truncation from the left, poor context management in the framework, or position-biased attention that de-weights middle-of-context information.

FM-1.5

Agents don't know when to stop

Agents continue executing after the task is complete, over-generating, second-guessing results, or starting new iterations of a completed workflow. Closely related to the RL stopping decision (O5) identified by Zhang (2026) as having no established training method — agents must be told explicitly when done.

FC2 — Inter-Agent Misalignment

36.94%

FM-2.1

Dialogue unexpectedly restarts

A new agent brought into the conversation reintroduces itself, re-asks questions already answered, and restarts from scratch — losing all prior context. Occurs because agents receive only a subset of the full conversation history when the framework doesn't properly pass state at handoff.

FM-2.2

Fails to request clarification

Agent proceeds on an ambiguous instruction with a plausible but incorrect interpretation, rather than asking for clarification. The LLM's tendency to be "helpful" and avoid asking questions leads it to confidently execute the wrong task, propagating errors downstream.

FM-2.3

Task focus drifts

The original objective gets replaced by a related but different objective as agents exchange messages. Each agent subtly reframes the goal, and over multiple turns the system ends up solving a different problem from what the user requested.

FM-2.4

Agents withhold critical information

An agent omits information from its output that another agent needs to make a correct decision. Causes: summarization that drops edge cases, implicit assumptions not stated explicitly, or the agent assuming another agent already has the context.

FM-2.5

Agents disregard peer recommendations

A reviewer agent suggests a fix; the author agent acknowledges but doesn't implement it. Or a critic raises a valid concern that the producing agent dismisses without justification. Each agent's prior in its own output is stronger than its receptiveness to peer correction.

FM-2.6

Reasoning contradicts executed actions

The agent's stated reasoning ("I will do X because...") doesn't match the action it actually takes (does Y instead). The CoT reasoning is disconnected from the tool call or output — the model's reasoning trace is post-hoc rationalization, not the actual decision driver.

FC3 — Task Verification & Termination

21.30%

FM-3.1

Tasks terminate before completion

The system declares success and terminates before the task is actually complete. Often triggered by a plausible-looking intermediate output that the orchestrator misidentifies as the final answer. Particularly common when the success criterion is vague ("done") rather than verifiable ("all 5 test cases pass").

FM-3.2

Verification skipped or incomplete

The verification step is not executed at all, or executes on the wrong output, or applies the wrong success criteria. The agent may "verify" by re-reading its own output — not by executing code, running tests, or comparing against ground truth. Self-verification is systematically overconfident.

FM-3.3

Verification reaches wrong conclusions

Verification runs but produces an incorrect verdict — approving a broken output or rejecting a correct one. Root causes: the verifier uses different assumptions than the producer, the test cases are insufficient, or the verifier LLM is the same model that produced the error (same blind spots).

Engineering for Production

Reliability: SagaLLM & Transactional Guarantees

Multi-agent workflows need the same recovery guarantees as distributed databases. SagaLLM adapts the Saga pattern from distributed systems to LLM orchestration.

Traditional software transactions follow ACID guarantees — if any step fails, the whole transaction rolls back. Multi-agent LLM workflows are fundamentally different: steps are long-running, expensive, and non-reversible. SagaLLM (Chang & Geng, VLDB 2025) adapts the Saga pattern from distributed databases — where each step has a compensating transaction (a way to undo its effects) — to LLM planning workflows.

Four Problems SagaLLM Solves

1. Unreliable self-validation — LLMs overestimate their own output correctness. Independent validator agents provide objective verification.

2. Context loss across interactions — Modular checkpointing at each saga step preserves state. If an agent fails mid-workflow, recovery resumes from the last checkpoint, not the start.

3. No transactional safeguards — Compensating transactions allow rollback. If step 4 fails, steps 1–3 execute their compensating logic to undo side effects.

4. Weak inter-agent coordination — Saga-style dependencies make coordination explicit and verifiable, not implicit in prompts.

Production Engineering Principles

Circuit breakers — detect runaway agent loops early; stop execution before costs spiral. GetOnStack learned this the hard way ($47K/week from unchecked loops).

Hard budget limits — maximum token budget per workflow. ZenML found "context rot" begins at 50k–150k tokens regardless of theoretical context windows.

Tool count discipline — "analysis paralysis" when >15 tools exposed simultaneously. Restrict available tools to the task-relevant subset at each stage.

Explicit termination criteria — machine-verifiable success conditions (code runs, tests pass, schema validates), not "the agent decides it's done."

The RL Frontier (Zhang 2026)

RL for multi-agent orchestration (arXiv:2605.02801) formalizes the orchestrator's job as 5 decisions: O1 (when to spawn), O2 (whom to delegate), O3 (how to communicate), O4 (how to aggregate), O5 (when to stop). Eight reward families cover outcome quality, parallelism speedup, aggregation accuracy, and team coordination. The stopping decision (O5) has no established RL training method yet — the most critical open research gap in production multi-agent engineering.

Go Deeper

LLM Agent OrchestrationCoordinating Multiple AI Agents at Scale

Beyond the Single Agent

The Four Orchestration Patterns

LangGraph · CrewAI · AutoGen · MetaGPT

Breaking Down Hard Tasks

Agent Communication Protocols

Mixture of Agents

Why Multi-Agent Systems Fail

Reliability: SagaLLM & Transactional Guarantees

Related Posts

LLM Agent Orchestration
Coordinating Multiple AI Agents at Scale