Multi-Agent Systems · Evaluation

The Evaluation Gap

Single-model benchmarks miss everything that matters in multi-agent systems. The unit of analysis must shift from the model to the entire system: topology, orchestration logic, error handling, and framework choice all shape real-world performance.

6

Frameworks Evaluated

4

Topology Patterns

3

Key Benchmarks

5

Evaluation Phases

The Model Fallacy

Swapping GPT-4o for a weaker model in AutoGen may cost less than switching to a better-architected framework with the same model. System design decisions compound.

What MASEval Measures

Full-system evaluation across four dimensions: task performance, communication efficiency, error resilience, and resource efficiency — simultaneously, not sequentially.

Why GAIA / Tau / MMLU

Multi-step reasoning (GAIA), tool orchestration & policy compliance (Tau-bench), and broad knowledge (MMLU) each stress-test a different dimension of MAS capability.

Why is MAS evaluation harder than single-model evaluation?›

Single-model evaluation treats the LLM as a black box: input → output → score. Multi-agent systems introduce emergent behaviour that cannot be predicted from individual agent performance. A system with three strong individual agents may fail catastrophically because of poor orchestration. A weaker model in a better-designed topology may outperform a frontier model in a rigid pipeline. Additionally, errors compound: a 10% failure rate per step across a 4-agent chain produces ~34% task failure rate — a non-linearity absent in single-step evaluation.

What is the MASEval benchmark suite?›

MASEval is a framework-agnostic evaluation library that wraps popular agent frameworks (AutoGen, LangGraph, LlamaIndex, CAMEL, smolagents, Claude Agent SDK) behind a common interface called AgentAdapter. It runs identical benchmark tasks through each system, collects structured traces (messages, tool calls, token counts), and computes system-level metrics. The result: apples-to-apples comparisons across entire architectures, not just models. It is published under MIT license at parameterlab.github.io/MASEval.

System Design · Topology Patterns

System Architectures

The topology of a multi-agent system determines how agents coordinate, how information flows, and critically — how failures cascade. Click each pattern to explore its structure, coordination mode, and GAIA performance.

Select a topology above to see its properties.

Sequential · GAIA

38.2%

Chain pipeline

Hierarchical · GAIA

51.7%

Manager + workers

Parallel · GAIA

46.3%

Fan-out + aggregator

Mesh · GAIA

44.8%

Peer-to-peer

System Design · Agent Roles

Agent Role Specialization

Assigning the right role to each agent is as important as choosing the right topology. Click roles to assign them to agent slots, then see how assignment quality affects the system's overall GAIA score.

Available roles: 📋 Planner ⚙ Executor 🔍 Critic 📝 Summarizer 🔗 Coordinator

Click a role above, then click an agent slot below to assign it. Try both optimal and suboptimal configurations.

Assign roles to agents to see the performance impact. Optimal assignment maximises GSR; mismatched roles create bottlenecks.

📋 Planner

Decomposes the goal into sub-tasks. Critical at the root of hierarchical topologies. Without a dedicated Planner, orchestrators try to plan and execute simultaneously — degrading both.

🔍 Critic

Validates outputs before they pass downstream. Adding a Critic agent to any topology improves GAIA GSR by 7–12% on Level 3 tasks by catching hallucinations before they compound.

📝 Summarizer

Compresses inter-agent context to fit within token windows. Essential in mesh and long-chain topologies where cumulative context exceeds model limits. Reduces token cost by 30–40%.

System Design · Heterogeneous Models

Multi-Model MAS

Real deployments often mix models — a powerful frontier model as orchestrator and cheaper specialist models as workers. Click to assign different LLMs to each agent slot and see the predicted performance and cost impact.

Click an agent card to cycle through available models. The chart updates in real time.

Assign models to agents to see predicted GAIA GSR and monthly token cost shifts.

Orchestrator Effect

Using GPT-4o or Claude 3.5 Sonnet as orchestrator and Claude Haiku as workers can retain 90–95% of full-frontier performance at 40–60% of the token cost. The orchestrator's planning quality matters most — worker quality matters less for simple sub-tasks.

Cost-Performance Sweet Spot

The optimal heterogeneous assignment for hierarchical topology: frontier model (Claude 3.5 / GPT-4o) as orchestrator + mid-tier model (Haiku / GPT-4o mini) as workers. This pattern achieves ~48% GAIA GSR at roughly 55% of the all-frontier token cost.

Evaluation · Measurement Dimensions

What We Measure

MASEval evaluates systems across four orthogonal dimensions. Drag the sliders to define your ideal system profile. Select a reference framework to see how real systems compare.

Your Target Profile

● Task Performance 75

● Communication Eff. 65

● Error Resilience 70

● Resource Efficiency 60

Reference Framework

Radar Legend

— Violet polygon = Your target profile

— Amber polygon = Reference framework

Drag sliders to position your ideal operating point. Overlap with reference indicates framework fit.

Task Performance

Measured by GSR (Goal Success Rate) from GAIA — the percentage of tasks where the agent system correctly completes the full multi-step goal. Also includes tool-call success rate from Tau-bench and accuracy from MMLU.

Error Attribution

MASEval classifies failures into three categories: Agent Error (the agent produced wrong output), Environment Error (tool/API failure outside agent control), and User Error (malformed input or invalid goal). Attribution shapes corrective action.

Evaluation · Benchmarks

Testing Grounds

MASEval evaluates systems on three benchmark families, each designed to stress-test a different set of MAS capabilities. Click a benchmark to explore what it tests and how topologies perform.

GAIA General AI Assistants

466 tasks · GSR metric · Multi-step web + tool tasks

Tau-bench Tool Augmented Understanding

120 tasks · Completion + policy rate · Domain workflows

MMLU Massive Multitask Language Understanding

14,042 tasks · Accuracy · Broad domain knowledge

Select a benchmark above to see MAS relevance and topology scores.

Why Three Benchmarks?

Each benchmark isolates a distinct axis of MAS performance. GAIA tests end-to-end task completion under tool uncertainty. Tau-bench tests compliance under constrained workflows. MMLU tests knowledge routing across specialized agents. Together they triangulate a system's true capability profile.

Topology Performance Gap

Hierarchical topologies consistently outperform sequential across all three benchmarks. The orchestrator's ability to delegate and retry subtasks independently produces a 13.5 percentage-point GAIA GSR improvement over naive sequential pipelines.

MASEval Tool · How It Works

Evaluation Lifecycle

MASEval structures every evaluation run into five reproducible phases. Click each phase to understand what happens, what is collected, and how it maps to the evaluation dimensions.

Open Source

MASEval is published under MIT license at parameterlab.github.io/MASEval. All experiments in the paper are fully reproducible using the provided harness configurations and seed values.

Framework-Agnostic

Adding a new framework requires implementing just two methods: _run_agent() and get_messages(). The adapter pattern keeps evaluation logic decoupled from framework internals.

CI/CD Ready

Reports export as structured JSON, enabling integration with CI/CD pipelines. Run MASEval on every architecture change to catch system-level regressions before they reach production.

Paper: MASEval — A Framework for Multi-Agent System Evaluation

ParameterLab · arxiv.org/abs/2603.08835 · MIT License

Multi-Agent Systems GAIA Benchmark Framework Evaluation Reproducible Research

Evaluation · Key Finding

Framework vs. Model

The MASEval paper's headline finding: framework choice matters as much as model choice. Toggle between views and metrics to explore why architectural decisions are as consequential as foundation model selection.

Group By

Metric

      Framework ≈ Model in impact. The performance spread across frameworks (44.6% → 53.2% GAIA GSR) nearly equals the spread across frontier models (49.7% → 55.4%). Engineers optimizing only model selection are leaving half the performance gains on the table.
    

Why Frameworks Diverge

Frameworks differ in how they handle retries, tool call batching, inter-agent context passing, and error recovery. These implementation choices accumulate across a multi-step task, producing system-level performance gaps invisible in single-agent benchmarks.

Token Cost Trade-off

Higher-performing frameworks often use more tokens through richer context passing and verification steps. smolagents is the most token-efficient (55.3k avg) while CAMEL has the highest consumption (90.1k). The right choice depends on your cost/performance operating point.

Evaluation · Optimization

Cost-Performance Pareto Frontier

Every framework + topology combination represents a point in cost/performance space. The Pareto frontier identifies combinations where no other config is strictly better on both axes. Click any point to inspect its config.

Click any dot on the chart to inspect that framework + topology configuration.

      Reading the Pareto frontier: Points on the frontier (connected line) are "efficient" — no other configuration achieves higher GSR at the same or lower token cost. Points below-right of the frontier are suboptimal and should generally be avoided unless they offer other non-measured advantages (e.g. framework familiarity, latency).
    

Pareto-Optimal Configs

Claude SDK + Hierarchical leads the frontier: highest GSR (53.2%) at moderate token cost (68.9k). smolagents + Parallel offers the best cost efficiency at acceptable performance. These two represent opposite ends of the efficiency frontier.

Dominated Configs

CAMEL + Mesh is dominated — highest token cost (90.1k) but not highest GSR. AutoGen + Sequential has similar cost to CAMEL but much lower performance. Neither belongs on the Pareto frontier for most use cases.

Reference · Full Results

MASEval Leaderboard

All 24 framework × topology combinations ranked across every metric. Click any column header to sort. Filter by framework or topology. ★ marks Pareto-optimal configurations.

# ↕	Framework ↕	Topology ↕	GAIA GSR% ↓	Tau% ↕	MMLU% ↕	Tokens(k) ↕	Latency(s) ↕

Systems · Error Propagation

Failure Anatomy

How failures cascade (or isolate) depends entirely on topology. Select a topology, then inject a failure into any agent to watch how errors propagate through the system and observe the attribution breakdown.

Error Attribution Breakdown

Agent 65%

Env 25%

User 10%

■ Agent Error ■ Env Error ■ User Error

Select a topology and inject a failure to see propagation behaviour.

Reliability Calculator

Per-agent failure rate 10%

Number of agents 4

Formula: Sequential system failure = 1 − (1−p)ⁿ · Parallel system failure ≈ pⁿ · Hierarchical ≈ p + (1−p)·p_subtask

Sequential Cascade

A failure at step 2 of a 4-agent chain blocks all 3 downstream agents. Error rate compounds: each 10% agent failure probability produces ~34% system failure in a 4-step pipeline. Sequential topologies require robust individual agents.

Parallel Isolation

A failure in Worker-2 of a parallel topology only affects that worker's contribution. Workers 1 and 3 complete normally; the aggregator can proceed with partial results. This makes parallel architectures inherently more resilient to single-agent failures.

What is the difference between agent, environment, and user errors?›

Agent Error: The agent produced incorrect output, got stuck in a loop, or failed to use a tool correctly. Corrected by improving agent instructions, adding verification steps, or switching frameworks.

Environment Error: A tool API returned an error, a file was inaccessible, or a network timeout occurred — outside the agent's control. Corrected by improving tool wrappers, adding retry logic, or handling environment-side failures.

User Error: The input task was malformed, contradictory, or impossible. Corrected at the input validation layer before the agent system is invoked.

How does error attribution guide framework selection?›

If your dominant error type is Agent Error, prioritise frameworks with built-in reflection and retry mechanisms (LangGraph's conditional edges, AutoGen's critique loops). If Environment Error dominates, prioritise frameworks with robust tool-call error handling and fallback paths. Misattributing agent errors as environment errors leads to investing in the wrong layer of the stack.

Reference · Common Mistakes

Evaluation Anti-Patterns

Most teams evaluate MAS incorrectly — and the mistakes compound. Click each pattern to understand the failure mode and how to fix it.

      Root cause: Most anti-patterns stem from applying single-model evaluation thinking to multi-agent systems. The unit of analysis must be the entire system — topology, orchestration, error handling, and model — evaluated simultaneously on the same benchmark.
    

Cost Analysis · Scaling

Token Scaling Visualizer

How does token consumption grow as you add more agents? Different topologies scale at fundamentally different rates. Adjust the base token cost per agent to see how your specific workload will scale.

Base tokens / agent turn 800

Mesh Grows Quadratically

In a fully-connected mesh topology, each agent must process messages from all other agents. With n agents, total context grows as O(n²). At 6 agents, this is 36× the single-agent baseline — often exceeding model context windows entirely.

Parallel Stays Flat

Parallel topology workers process tasks independently with minimal cross-agent context. Token cost scales as O(n) for workers plus a fixed aggregation cost. Adding more workers doesn't compound cost — making parallel the most scalable topology for large n.

Systems · Context Management

Context Window Growth

As agents exchange messages across turns, cumulative context grows — often hitting model limits before task completion. Use the slider to set your model's context limit and see which topologies breach it first.

Model context limit (k tokens) 32k

Summarizer Agent

Adding a dedicated Summarizer agent compresses inter-agent context by 60–70% at each turn boundary. This delays context-limit breaches, enabling longer multi-step tasks without truncation.

Sliding Window

MASEval supports configurable context window strategies: full history, last-N-turns sliding window, or hierarchical summary. Sliding window with N=3 gives ~80% task performance at 30% the context cost of full history.

Partial Observability

Agents in parallel and hierarchical topologies can be configured with partial observability — each agent sees only messages addressed to it. This bounds per-agent context to O(1) regardless of system size.

Comparison · Live Simulation

Topology Race

Watch two topologies run the same task simultaneously. The animated message flows reveal coordination overhead differences in real time. Select your topologies and hit Race.

Left topology:

Right topology:

Hierarchical

Ready

Sequential

Ready

Reference · Cost Planning

Live Cost Estimator

Estimate your monthly API spend based on task volume, topology, and framework. Adjust the sliders to model your workload and see the cost breakdown.

Tasks / month 5,000

Avg task complexity Medium

Framework

Topology

Paper: MASEval — A Framework for Multi-Agent System Evaluation

ParameterLab · arxiv.org/abs/2603.08835 · MIT License

Multi-Agent Systems GAIA Benchmark Framework Evaluation Reproducible Research

Trace Explorer →

MASEval Tool · Observability

Trace Explorer

Every MASEval run produces structured trace artefacts — per-agent message histories, tool call logs, and token usage reports. Click a tab to inspect what each trace type looks like.

Agent Messages

Tool Calls

Token Usage

Task Result

TraceableMixin

All MASEval agents inherit from TraceableMixin, which automatically instruments every send_message() call. No manual logging required — traces are collected transparently regardless of which framework is running underneath.

Partial Observability

MASEval supports partial observability configurations where sub-agents cannot see each other's full message histories — matching real-world deployment constraints. Traces still capture the full system view for post-hoc analysis even when agents have limited runtime visibility.

MASEval vs. Other Evaluation Libraries

Feature	MASEval	MLflow GenAI	Inspect-AI	DeepEval
Multi-agent orchestration	✓ Full	✗	~ Partial	✗
Framework agnosticism	✓ 6 frameworks	~ LangChain only	~ Limited	~ LangChain
System-level comparison	✓ Native	✗	~ Manual	✗
Error attribution (3-way)	✓	✗	✗	~ Partial
Trace-first evaluation	✓	~ Logs only	✓	✗
GAIA / Tau benchmarks	✓ Built-in	✗	~ GAIA only	✗
Reproducibility (seeds)	✓	~ Partial	✓	~ Partial
CI/CD integration	✓ JSON export	✓ MLflow UI	~ Partial	✓

Recommendations · Selection Tool

Framework Selection Wizard

Answer 5 questions about your system requirements and get a ranked recommendation of topology and framework tailored to your use case.

Applications · Use Case Map

Real-World Use Cases

Different application domains benefit from different MAS architectures. Here are common use cases mapped to their optimal topology and framework based on MASEval benchmark findings.

      Pattern: Tasks requiring sequential decision-making with compliance constraints (customer service, legal review) suit Hierarchical topology. Tasks with parallelisable subtasks (research synthesis, code review) suit Parallel topology. Open-ended collaborative tasks (brainstorming, debate) suit Mesh topology.
    

MAESTRO · Reliability

Reliability Metrics

MAESTRO introduces reliability as a first-class evaluation dimension — measuring not just whether a system succeeds, but how gracefully it handles failure. Adjust the stress parameters to see how topology and retry budget affect system resilience.

      Why reliability matters: A MAS that achieves 70% task success under ideal conditions may degrade to 30% under realistic failure rates. MAESTRO separates peak performance from sustained performance — the latter determines production viability.
    

Failure Rate

—

% tasks failing under stress

MTTR

—

mean time to recovery (steps)

Resilience Score

—

recovered / total failures

Error Budget

—

% budget remaining

Stress-Test Simulator

Agent failure probability 15%

Number of agents 4

Retry budget per agent 2

Topology isolation factor Hierarchical

Failure Rate Formula

P(system_fail) = 1 − ∏(1 − pᵢ × (1−rᵢ)) where: pᵢ = per-agent failure probability rᵢ = per-agent recovery probability = 1 − (1−retry_success)^retries

The system fails only if at least one non-recoverable agent failure occurs. Recovery probability compounds across retry attempts.

MTTR Formula

MTTR = Σ (tᵣₑcₒᵥₑᵣ × pᵢ) / N_agents where: tᵣₑcₒᵥₑᵣ = retry_budget × 1.5 steps (avg backoff overhead)

Mean time to recovery averages the expected recovery steps across all agents weighted by their failure probability.

Resilience Score

R = recovered_failures / total_failures = [1 − (1−p_retry)^retries] MAESTRO target: R ≥ 0.80 Production threshold: R ≥ 0.90

A resilience score above 0.80 is MAESTRO's minimum bar for production-grade MAS. Hierarchical topologies score ~15% higher than sequential due to sub-task isolation.

MAESTRO Reliability Findings by Topology

Topology	Failure Rate	MTTR (steps)	Resilience Score	Error Budget Used	Failure Mode

MAESTRO · Tool Use

Tool-Use Evaluation

Real MAS deployments depend heavily on external tool calls — search, code execution, database queries, image generation. MAESTRO evaluates each agent's tool-calling capability as a first-class metric, not just task-level success rate.

Click a tool to see evaluation results

      Key MAESTRO finding: Tool-calling errors account for 43% of all MAS failures — more than reasoning errors (31%) or coordination failures (26%). The most common error is wrong argument schema (agent passes malformed input to a tool), not tool unavailability.
    

Tool Error Taxonomy

HotPotQA Benchmark

HotPotQA requires multi-hop reasoning across multiple documents — agents must chain search tool calls correctly. MAESTRO uses it to stress-test tool sequencing under information retrieval scenarios.

Tasks

113K

Avg Hops

2.8

Best MAS Score

74.3%

Human Baseline

91.2%

Best MAS: Hierarchical + LangGraph, GPT-4o orchestrator + GPT-3.5 workers. Single-agent GPT-4o scores 61.8% — the 12.5pp gap demonstrates the value of multi-agent tool chaining.

Tool-Use Success Rate by Framework × Tool Category

MAESTRO · Framework Comparison

MAESTRO vs MASEval

Two complementary frameworks for evaluating multi-agent systems. MASEval (arXiv:2603.08835) focuses on topology and framework benchmarking. MAESTRO (arXiv:2601.00481) adds reliability, observability, and tool-use as first-class dimensions.

Side-by-Side Feature Matrix

Capability	MASEval	MAESTRO	Notes
Topology benchmarking	✓	✓	Both cover sequential, hierarchical, parallel, mesh
Framework comparison (LangGraph/CrewAI/AutoGen)	✓	✓	MASEval has deeper leaderboard (24 combos)
GAIA benchmark	✓	—	MASEval primary benchmark
HotPotQA benchmark	—	✓	MAESTRO adds multi-hop search evaluation
Reliability metrics (MTTR, resilience score)	—	✓	MAESTRO's primary novel contribution
Stress-testing under failure conditions	—	✓	Systematic failure injection at scale
Tool-use capability evaluation	—	✓	Error taxonomy: schema, timeout, hallucinated tool
Integrated observability (trace + monitoring)	~	✓	MASEval has trace export; MAESTRO has live monitoring
Cost-performance Pareto analysis	✓	—	MASEval's cost model is more detailed
Framework selection wizard	✓	—	MASEval's recommendation engine

Additional Paper Source — MAESTRO

MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability

Ma et al. · arXiv:2601.00481 · 2025

📄 arXiv Paper