🤖
Multi-Agent System Evaluation
MASEval Framework — Visual Interactive Guide
This post is for paid subscribers of Visual Summary
Not a subscriber yet? Join Visual Summary →
πŸ”— Link copied!
Gap
β†’
Topologies
β†’
Roles
β†’
Multi-Model
β†’
Dimensions
β†’
Benchmarks
β†’
Lifecycle
β†’
Frameworks
β†’
Pareto
β†’
Leaderboard
β†’
Errors
β†’
Anti-Patterns
β†’
Scaling
β†’
Context
β†’
Race
β†’
Cost
β†’
Trace
β†’
Wizard
β†’
Use Cases

The Evaluation Gap

Single-model benchmarks miss everything that matters in multi-agent systems. The unit of analysis must shift from the model to the entire system: topology, orchestration logic, error handling, and framework choice all shape real-world performance.

6
Frameworks Evaluated
4
Topology Patterns
3
Key Benchmarks
5
Evaluation Phases
The Model Fallacy
Swapping GPT-4o for a weaker model in AutoGen may cost less than switching to a better-architected framework with the same model. System design decisions compound.
What MASEval Measures
Full-system evaluation across four dimensions: task performance, communication efficiency, error resilience, and resource efficiency β€” simultaneously, not sequentially.
Why GAIA / Tau / MMLU
Multi-step reasoning (GAIA), tool orchestration & policy compliance (Tau-bench), and broad knowledge (MMLU) each stress-test a different dimension of MAS capability.
Why is MAS evaluation harder than single-model evaluation?β€Ί
Single-model evaluation treats the LLM as a black box: input β†’ output β†’ score. Multi-agent systems introduce emergent behaviour that cannot be predicted from individual agent performance. A system with three strong individual agents may fail catastrophically because of poor orchestration. A weaker model in a better-designed topology may outperform a frontier model in a rigid pipeline. Additionally, errors compound: a 10% failure rate per step across a 4-agent chain produces ~34% task failure rate β€” a non-linearity absent in single-step evaluation.
What is the MASEval benchmark suite?β€Ί
MASEval is a framework-agnostic evaluation library that wraps popular agent frameworks (AutoGen, LangGraph, LlamaIndex, CAMEL, smolagents, Claude Agent SDK) behind a common interface called AgentAdapter. It runs identical benchmark tasks through each system, collects structured traces (messages, tool calls, token counts), and computes system-level metrics. The result: apples-to-apples comparisons across entire architectures, not just models. It is published under MIT license at parameterlab.github.io/MASEval.

System Architectures

The topology of a multi-agent system determines how agents coordinate, how information flows, and critically β€” how failures cascade. Click each pattern to explore its structure, coordination mode, and GAIA performance.

Select a topology above to see its properties.
Sequential Β· GAIA
38.2%
Chain pipeline
Hierarchical Β· GAIA
51.7%
Manager + workers
Parallel Β· GAIA
46.3%
Fan-out + aggregator
Mesh Β· GAIA
44.8%
Peer-to-peer

Agent Role Specialization

Assigning the right role to each agent is as important as choosing the right topology. Click roles to assign them to agent slots, then see how assignment quality affects the system's overall GAIA score.

Available roles: πŸ“‹ Planner βš™ Executor πŸ” Critic πŸ“ Summarizer πŸ”— Coordinator
Click a role above, then click an agent slot below to assign it. Try both optimal and suboptimal configurations.
Assign roles to agents to see the performance impact. Optimal assignment maximises GSR; mismatched roles create bottlenecks.
πŸ“‹ Planner
Decomposes the goal into sub-tasks. Critical at the root of hierarchical topologies. Without a dedicated Planner, orchestrators try to plan and execute simultaneously β€” degrading both.
πŸ” Critic
Validates outputs before they pass downstream. Adding a Critic agent to any topology improves GAIA GSR by 7–12% on Level 3 tasks by catching hallucinations before they compound.
πŸ“ Summarizer
Compresses inter-agent context to fit within token windows. Essential in mesh and long-chain topologies where cumulative context exceeds model limits. Reduces token cost by 30–40%.

Multi-Model MAS

Real deployments often mix models β€” a powerful frontier model as orchestrator and cheaper specialist models as workers. Click to assign different LLMs to each agent slot and see the predicted performance and cost impact.

Click an agent card to cycle through available models. The chart updates in real time.
Assign models to agents to see predicted GAIA GSR and monthly token cost shifts.
Orchestrator Effect
Using GPT-4o or Claude 3.5 Sonnet as orchestrator and Claude Haiku as workers can retain 90–95% of full-frontier performance at 40–60% of the token cost. The orchestrator's planning quality matters most β€” worker quality matters less for simple sub-tasks.
Cost-Performance Sweet Spot
The optimal heterogeneous assignment for hierarchical topology: frontier model (Claude 3.5 / GPT-4o) as orchestrator + mid-tier model (Haiku / GPT-4o mini) as workers. This pattern achieves ~48% GAIA GSR at roughly 55% of the all-frontier token cost.

What We Measure

MASEval evaluates systems across four orthogonal dimensions. Drag the sliders to define your ideal system profile. Select a reference framework to see how real systems compare.

Your Target Profile
● Task Performance 75
● Communication Eff. 65
● Error Resilience 70
● Resource Efficiency 60
Reference Framework
Radar Legend
β€” Violet polygon = Your target profile
β€” Amber polygon = Reference framework
Drag sliders to position your ideal operating point. Overlap with reference indicates framework fit.
Task Performance
Measured by GSR (Goal Success Rate) from GAIA β€” the percentage of tasks where the agent system correctly completes the full multi-step goal. Also includes tool-call success rate from Tau-bench and accuracy from MMLU.
Error Attribution
MASEval classifies failures into three categories: Agent Error (the agent produced wrong output), Environment Error (tool/API failure outside agent control), and User Error (malformed input or invalid goal). Attribution shapes corrective action.

Testing Grounds

MASEval evaluates systems on three benchmark families, each designed to stress-test a different set of MAS capabilities. Click a benchmark to explore what it tests and how topologies perform.

GAIA General AI Assistants
466 tasks Β· GSR metric Β· Multi-step web + tool tasks
Tau-bench Tool Augmented Understanding
120 tasks Β· Completion + policy rate Β· Domain workflows
MMLU Massive Multitask Language Understanding
14,042 tasks Β· Accuracy Β· Broad domain knowledge
Select a benchmark above to see MAS relevance and topology scores.
Why Three Benchmarks?
Each benchmark isolates a distinct axis of MAS performance. GAIA tests end-to-end task completion under tool uncertainty. Tau-bench tests compliance under constrained workflows. MMLU tests knowledge routing across specialized agents. Together they triangulate a system's true capability profile.
Topology Performance Gap
Hierarchical topologies consistently outperform sequential across all three benchmarks. The orchestrator's ability to delegate and retry subtasks independently produces a 13.5 percentage-point GAIA GSR improvement over naive sequential pipelines.

Evaluation Lifecycle

MASEval structures every evaluation run into five reproducible phases. Click each phase to understand what happens, what is collected, and how it maps to the evaluation dimensions.

Open Source
MASEval is published under MIT license at parameterlab.github.io/MASEval. All experiments in the paper are fully reproducible using the provided harness configurations and seed values.
Framework-Agnostic
Adding a new framework requires implementing just two methods: _run_agent() and get_messages(). The adapter pattern keeps evaluation logic decoupled from framework internals.
CI/CD Ready
Reports export as structured JSON, enabling integration with CI/CD pipelines. Run MASEval on every architecture change to catch system-level regressions before they reach production.
Paper: MASEval β€” A Framework for Multi-Agent System Evaluation
ParameterLab Β· arxiv.org/abs/2603.08835 Β· MIT License
Multi-Agent Systems GAIA Benchmark Framework Evaluation Reproducible Research

Framework vs. Model

The MASEval paper's headline finding: framework choice matters as much as model choice. Toggle between views and metrics to explore why architectural decisions are as consequential as foundation model selection.

Group By
Metric
Framework β‰ˆ Model in impact. The performance spread across frameworks (44.6% β†’ 53.2% GAIA GSR) nearly equals the spread across frontier models (49.7% β†’ 55.4%). Engineers optimizing only model selection are leaving half the performance gains on the table.
Why Frameworks Diverge
Frameworks differ in how they handle retries, tool call batching, inter-agent context passing, and error recovery. These implementation choices accumulate across a multi-step task, producing system-level performance gaps invisible in single-agent benchmarks.
Token Cost Trade-off
Higher-performing frameworks often use more tokens through richer context passing and verification steps. smolagents is the most token-efficient (55.3k avg) while CAMEL has the highest consumption (90.1k). The right choice depends on your cost/performance operating point.

Cost-Performance Pareto Frontier

Every framework + topology combination represents a point in cost/performance space. The Pareto frontier identifies combinations where no other config is strictly better on both axes. Click any point to inspect its config.

Click any dot on the chart to inspect that framework + topology configuration.
Reading the Pareto frontier: Points on the frontier (connected line) are "efficient" β€” no other configuration achieves higher GSR at the same or lower token cost. Points below-right of the frontier are suboptimal and should generally be avoided unless they offer other non-measured advantages (e.g. framework familiarity, latency).
Pareto-Optimal Configs
Claude SDK + Hierarchical leads the frontier: highest GSR (53.2%) at moderate token cost (68.9k). smolagents + Parallel offers the best cost efficiency at acceptable performance. These two represent opposite ends of the efficiency frontier.
Dominated Configs
CAMEL + Mesh is dominated β€” highest token cost (90.1k) but not highest GSR. AutoGen + Sequential has similar cost to CAMEL but much lower performance. Neither belongs on the Pareto frontier for most use cases.

MASEval Leaderboard

All 24 framework Γ— topology combinations ranked across every metric. Click any column header to sort. Filter by framework or topology. β˜… marks Pareto-optimal configurations.

Framework:
Topology:
# ↕ Framework ↕ Topology ↕ GAIA GSR% ↓ Tau% ↕ MMLU% ↕ Tokens(k) ↕ Latency(s) ↕

Failure Anatomy

How failures cascade (or isolate) depends entirely on topology. Select a topology, then inject a failure into any agent to watch how errors propagate through the system and observe the attribution breakdown.

Error Attribution Breakdown
Agent 65%
Env 25%
User 10%
■ Agent Error ■ Env Error ■ User Error
Select a topology and inject a failure to see propagation behaviour.
Reliability Calculator
Per-agent failure rate 10%
Number of agents 4
Formula: Sequential system failure = 1 βˆ’ (1βˆ’p)ⁿ  Β·  Parallel system failure β‰ˆ pⁿ  Β·  Hierarchical β‰ˆ p + (1βˆ’p)Β·p_subtask
Sequential Cascade
A failure at step 2 of a 4-agent chain blocks all 3 downstream agents. Error rate compounds: each 10% agent failure probability produces ~34% system failure in a 4-step pipeline. Sequential topologies require robust individual agents.
Parallel Isolation
A failure in Worker-2 of a parallel topology only affects that worker's contribution. Workers 1 and 3 complete normally; the aggregator can proceed with partial results. This makes parallel architectures inherently more resilient to single-agent failures.
What is the difference between agent, environment, and user errors?β€Ί
Agent Error: The agent produced incorrect output, got stuck in a loop, or failed to use a tool correctly. Corrected by improving agent instructions, adding verification steps, or switching frameworks.

Environment Error: A tool API returned an error, a file was inaccessible, or a network timeout occurred β€” outside the agent's control. Corrected by improving tool wrappers, adding retry logic, or handling environment-side failures.

User Error: The input task was malformed, contradictory, or impossible. Corrected at the input validation layer before the agent system is invoked.
How does error attribution guide framework selection?β€Ί
If your dominant error type is Agent Error, prioritise frameworks with built-in reflection and retry mechanisms (LangGraph's conditional edges, AutoGen's critique loops). If Environment Error dominates, prioritise frameworks with robust tool-call error handling and fallback paths. Misattributing agent errors as environment errors leads to investing in the wrong layer of the stack.

Evaluation Anti-Patterns

Most teams evaluate MAS incorrectly β€” and the mistakes compound. Click each pattern to understand the failure mode and how to fix it.

Root cause: Most anti-patterns stem from applying single-model evaluation thinking to multi-agent systems. The unit of analysis must be the entire system β€” topology, orchestration, error handling, and model β€” evaluated simultaneously on the same benchmark.

Token Scaling Visualizer

How does token consumption grow as you add more agents? Different topologies scale at fundamentally different rates. Adjust the base token cost per agent to see how your specific workload will scale.

Base tokens / agent turn 800
Mesh Grows Quadratically
In a fully-connected mesh topology, each agent must process messages from all other agents. With n agents, total context grows as O(nΒ²). At 6 agents, this is 36Γ— the single-agent baseline β€” often exceeding model context windows entirely.
Parallel Stays Flat
Parallel topology workers process tasks independently with minimal cross-agent context. Token cost scales as O(n) for workers plus a fixed aggregation cost. Adding more workers doesn't compound cost β€” making parallel the most scalable topology for large n.

Context Window Growth

As agents exchange messages across turns, cumulative context grows β€” often hitting model limits before task completion. Use the slider to set your model's context limit and see which topologies breach it first.

Model context limit (k tokens) 32k
Summarizer Agent
Adding a dedicated Summarizer agent compresses inter-agent context by 60–70% at each turn boundary. This delays context-limit breaches, enabling longer multi-step tasks without truncation.
Sliding Window
MASEval supports configurable context window strategies: full history, last-N-turns sliding window, or hierarchical summary. Sliding window with N=3 gives ~80% task performance at 30% the context cost of full history.
Partial Observability
Agents in parallel and hierarchical topologies can be configured with partial observability β€” each agent sees only messages addressed to it. This bounds per-agent context to O(1) regardless of system size.

Topology Race

Watch two topologies run the same task simultaneously. The animated message flows reveal coordination overhead differences in real time. Select your topologies and hit Race.

Hierarchical
Ready
Sequential
Ready

Live Cost Estimator

Estimate your monthly API spend based on task volume, topology, and framework. Adjust the sliders to model your workload and see the cost breakdown.

Tasks / month 5,000
Avg task complexity Medium
Framework
Topology
Paper: MASEval β€” A Framework for Multi-Agent System Evaluation
ParameterLab Β· arxiv.org/abs/2603.08835 Β· MIT License
Multi-Agent Systems GAIA Benchmark Framework Evaluation Reproducible Research
Trace Explorer β†’

Trace Explorer

Every MASEval run produces structured trace artefacts β€” per-agent message histories, tool call logs, and token usage reports. Click a tab to inspect what each trace type looks like.

Agent Messages
Tool Calls
Token Usage
Task Result
TraceableMixin
All MASEval agents inherit from TraceableMixin, which automatically instruments every send_message() call. No manual logging required β€” traces are collected transparently regardless of which framework is running underneath.
Partial Observability
MASEval supports partial observability configurations where sub-agents cannot see each other's full message histories β€” matching real-world deployment constraints. Traces still capture the full system view for post-hoc analysis even when agents have limited runtime visibility.
MASEval vs. Other Evaluation Libraries
Feature MASEval MLflow GenAI Inspect-AI DeepEval
Multi-agent orchestrationβœ“ Fullβœ—~ Partialβœ—
Framework agnosticismβœ“ 6 frameworks~ LangChain only~ Limited~ LangChain
System-level comparisonβœ“ Nativeβœ—~ Manualβœ—
Error attribution (3-way)βœ“βœ—βœ—~ Partial
Trace-first evaluationβœ“~ Logs onlyβœ“βœ—
GAIA / Tau benchmarksβœ“ Built-inβœ—~ GAIA onlyβœ—
Reproducibility (seeds)βœ“~ Partialβœ“~ Partial
CI/CD integrationβœ“ JSON exportβœ“ MLflow UI~ Partialβœ“

Framework Selection Wizard

Answer 5 questions about your system requirements and get a ranked recommendation of topology and framework tailored to your use case.

Real-World Use Cases

Different application domains benefit from different MAS architectures. Here are common use cases mapped to their optimal topology and framework based on MASEval benchmark findings.

Pattern: Tasks requiring sequential decision-making with compliance constraints (customer service, legal review) suit Hierarchical topology. Tasks with parallelisable subtasks (research synthesis, code review) suit Parallel topology. Open-ended collaborative tasks (brainstorming, debate) suit Mesh topology.

Reliability Metrics

MAESTRO introduces reliability as a first-class evaluation dimension β€” measuring not just whether a system succeeds, but how gracefully it handles failure. Adjust the stress parameters to see how topology and retry budget affect system resilience.

Why reliability matters: A MAS that achieves 70% task success under ideal conditions may degrade to 30% under realistic failure rates. MAESTRO separates peak performance from sustained performance β€” the latter determines production viability.
Failure Rate
β€”
% tasks failing under stress
MTTR
β€”
mean time to recovery (steps)
Resilience Score
β€”
recovered / total failures
Error Budget
β€”
% budget remaining
Stress-Test Simulator
Agent failure probability 15%
Number of agents 4
Retry budget per agent 2
Topology isolation factor Hierarchical
Failure Rate Formula
P(system_fail) = 1 βˆ’ ∏(1 βˆ’ pα΅’ Γ— (1βˆ’rα΅’)) where: pα΅’ = per-agent failure probability rα΅’ = per-agent recovery probability = 1 βˆ’ (1βˆ’retry_success)^retries
The system fails only if at least one non-recoverable agent failure occurs. Recovery probability compounds across retry attempts.
MTTR Formula
MTTR = Ξ£ (tα΅£β‚‘cβ‚’α΅₯β‚‘α΅£ Γ— pα΅’) / N_agents where: tα΅£β‚‘cβ‚’α΅₯β‚‘α΅£ = retry_budget Γ— 1.5 steps (avg backoff overhead)
Mean time to recovery averages the expected recovery steps across all agents weighted by their failure probability.
Resilience Score
R = recovered_failures / total_failures = [1 βˆ’ (1βˆ’p_retry)^retries] MAESTRO target: R β‰₯ 0.80 Production threshold: R β‰₯ 0.90
A resilience score above 0.80 is MAESTRO's minimum bar for production-grade MAS. Hierarchical topologies score ~15% higher than sequential due to sub-task isolation.
MAESTRO Reliability Findings by Topology
Topology Failure Rate MTTR (steps) Resilience Score Error Budget Used Failure Mode

Tool-Use Evaluation

Real MAS deployments depend heavily on external tool calls β€” search, code execution, database queries, image generation. MAESTRO evaluates each agent's tool-calling capability as a first-class metric, not just task-level success rate.

Click a tool to see evaluation results
Key MAESTRO finding: Tool-calling errors account for 43% of all MAS failures β€” more than reasoning errors (31%) or coordination failures (26%). The most common error is wrong argument schema (agent passes malformed input to a tool), not tool unavailability.
Tool Error Taxonomy
HotPotQA Benchmark

HotPotQA requires multi-hop reasoning across multiple documents β€” agents must chain search tool calls correctly. MAESTRO uses it to stress-test tool sequencing under information retrieval scenarios.

Tasks
113K
Avg Hops
2.8
Best MAS Score
74.3%
Human Baseline
91.2%
Best MAS: Hierarchical + LangGraph, GPT-4o orchestrator + GPT-3.5 workers. Single-agent GPT-4o scores 61.8% β€” the 12.5pp gap demonstrates the value of multi-agent tool chaining.
Tool-Use Success Rate by Framework Γ— Tool Category

MAESTRO vs MASEval

Two complementary frameworks for evaluating multi-agent systems. MASEval (arXiv:2603.08835) focuses on topology and framework benchmarking. MAESTRO (arXiv:2601.00481) adds reliability, observability, and tool-use as first-class dimensions.

Side-by-Side Feature Matrix
Capability MASEval MAESTRO Notes
Topology benchmarking βœ“ βœ“ Both cover sequential, hierarchical, parallel, mesh
Framework comparison (LangGraph/CrewAI/AutoGen) βœ“ βœ“ MASEval has deeper leaderboard (24 combos)
GAIA benchmark βœ“ β€” MASEval primary benchmark
HotPotQA benchmark β€” βœ“ MAESTRO adds multi-hop search evaluation
Reliability metrics (MTTR, resilience score) β€” βœ“ MAESTRO's primary novel contribution
Stress-testing under failure conditions β€” βœ“ Systematic failure injection at scale
Tool-use capability evaluation β€” βœ“ Error taxonomy: schema, timeout, hallucinated tool
Integrated observability (trace + monitoring) ~ βœ“ MASEval has trace export; MAESTRO has live monitoring
Cost-performance Pareto analysis βœ“ β€” MASEval's cost model is more detailed
Framework selection wizard βœ“ β€” MASEval's recommendation engine
Additional Paper Source β€” MAESTRO
MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability
Ma et al.  Β·  arXiv:2601.00481  Β·  2025
📄 arXiv Paper
Paper Source
MASEval: A Framework for Multi-Agent System Evaluation
ParameterLab  Β·  arXiv:2603.08835  Β·  MIT License
📄 arXiv Paper 🔗 Project Page