Single-model benchmarks miss everything that matters in multi-agent systems. The unit of analysis must shift from the model to the entire system: topology, orchestration logic, error handling, and framework choice all shape real-world performance.
6
Frameworks Evaluated
4
Topology Patterns
3
Key Benchmarks
5
Evaluation Phases
The Model Fallacy
Swapping GPT-4o for a weaker model in AutoGen may cost less than switching to a better-architected framework with the same model. System design decisions compound.
What MASEval Measures
Full-system evaluation across four dimensions: task performance, communication efficiency, error resilience, and resource efficiency β simultaneously, not sequentially.
Why GAIA / Tau / MMLU
Multi-step reasoning (GAIA), tool orchestration & policy compliance (Tau-bench), and broad knowledge (MMLU) each stress-test a different dimension of MAS capability.
Why is MAS evaluation harder than single-model evaluation?βΊ
Single-model evaluation treats the LLM as a black box: input β output β score. Multi-agent systems introduce emergent behaviour that cannot be predicted from individual agent performance. A system with three strong individual agents may fail catastrophically because of poor orchestration. A weaker model in a better-designed topology may outperform a frontier model in a rigid pipeline. Additionally, errors compound: a 10% failure rate per step across a 4-agent chain produces ~34% task failure rate β a non-linearity absent in single-step evaluation.
What is the MASEval benchmark suite?βΊ
MASEval is a framework-agnostic evaluation library that wraps popular agent frameworks (AutoGen, LangGraph, LlamaIndex, CAMEL, smolagents, Claude Agent SDK) behind a common interface called AgentAdapter. It runs identical benchmark tasks through each system, collects structured traces (messages, tool calls, token counts), and computes system-level metrics. The result: apples-to-apples comparisons across entire architectures, not just models. It is published under MIT license at parameterlab.github.io/MASEval.
System Design · Topology Patterns
System Architectures
The topology of a multi-agent system determines how agents coordinate, how information flows, and critically β how failures cascade. Click each pattern to explore its structure, coordination mode, and GAIA performance.
Select a topology above to see its properties.
Sequential Β· GAIA
38.2%
Chain pipeline
Hierarchical Β· GAIA
51.7%
Manager + workers
Parallel Β· GAIA
46.3%
Fan-out + aggregator
Mesh Β· GAIA
44.8%
Peer-to-peer
System Design · Agent Roles
Agent Role Specialization
Assigning the right role to each agent is as important as choosing the right topology. Click roles to assign them to agent slots, then see how assignment quality affects the system's overall GAIA score.
Available roles:π Plannerβ Executorπ Criticπ Summarizerπ Coordinator
Click a role above, then click an agent slot below to assign it. Try both optimal and suboptimal configurations.
Assign roles to agents to see the performance impact. Optimal assignment maximises GSR; mismatched roles create bottlenecks.
π Planner
Decomposes the goal into sub-tasks. Critical at the root of hierarchical topologies. Without a dedicated Planner, orchestrators try to plan and execute simultaneously β degrading both.
π Critic
Validates outputs before they pass downstream. Adding a Critic agent to any topology improves GAIA GSR by 7β12% on Level 3 tasks by catching hallucinations before they compound.
π Summarizer
Compresses inter-agent context to fit within token windows. Essential in mesh and long-chain topologies where cumulative context exceeds model limits. Reduces token cost by 30β40%.
System Design · Heterogeneous Models
Multi-Model MAS
Real deployments often mix models β a powerful frontier model as orchestrator and cheaper specialist models as workers. Click to assign different LLMs to each agent slot and see the predicted performance and cost impact.
Click an agent card to cycle through available models. The chart updates in real time.
Assign models to agents to see predicted GAIA GSR and monthly token cost shifts.
Orchestrator Effect
Using GPT-4o or Claude 3.5 Sonnet as orchestrator and Claude Haiku as workers can retain 90β95% of full-frontier performance at 40β60% of the token cost. The orchestrator's planning quality matters most β worker quality matters less for simple sub-tasks.
Cost-Performance Sweet Spot
The optimal heterogeneous assignment for hierarchical topology: frontier model (Claude 3.5 / GPT-4o) as orchestrator + mid-tier model (Haiku / GPT-4o mini) as workers. This pattern achieves ~48% GAIA GSR at roughly 55% of the all-frontier token cost.
Evaluation · Measurement Dimensions
What We Measure
MASEval evaluates systems across four orthogonal dimensions. Drag the sliders to define your ideal system profile. Select a reference framework to see how real systems compare.
Your Target Profile
● Task Performance75
● Communication Eff.65
● Error Resilience70
● Resource Efficiency60
Reference Framework
Radar Legend
β Violet polygon = Your target profile
β Amber polygon = Reference framework
Drag sliders to position your ideal operating point. Overlap with reference indicates framework fit.
Task Performance
Measured by GSR (Goal Success Rate) from GAIA β the percentage of tasks where the agent system correctly completes the full multi-step goal. Also includes tool-call success rate from Tau-bench and accuracy from MMLU.
Error Attribution
MASEval classifies failures into three categories: Agent Error (the agent produced wrong output), Environment Error (tool/API failure outside agent control), and User Error (malformed input or invalid goal). Attribution shapes corrective action.
Evaluation · Benchmarks
Testing Grounds
MASEval evaluates systems on three benchmark families, each designed to stress-test a different set of MAS capabilities. Click a benchmark to explore what it tests and how topologies perform.
Select a benchmark above to see MAS relevance and topology scores.
Why Three Benchmarks?
Each benchmark isolates a distinct axis of MAS performance. GAIA tests end-to-end task completion under tool uncertainty. Tau-bench tests compliance under constrained workflows. MMLU tests knowledge routing across specialized agents. Together they triangulate a system's true capability profile.
Topology Performance Gap
Hierarchical topologies consistently outperform sequential across all three benchmarks. The orchestrator's ability to delegate and retry subtasks independently produces a 13.5 percentage-point GAIA GSR improvement over naive sequential pipelines.
MASEval Tool · How It Works
Evaluation Lifecycle
MASEval structures every evaluation run into five reproducible phases. Click each phase to understand what happens, what is collected, and how it maps to the evaluation dimensions.
Open Source
MASEval is published under MIT license at parameterlab.github.io/MASEval. All experiments in the paper are fully reproducible using the provided harness configurations and seed values.
Framework-Agnostic
Adding a new framework requires implementing just two methods: _run_agent() and get_messages(). The adapter pattern keeps evaluation logic decoupled from framework internals.
CI/CD Ready
Reports export as structured JSON, enabling integration with CI/CD pipelines. Run MASEval on every architecture change to catch system-level regressions before they reach production.
Paper: MASEval β A Framework for Multi-Agent System Evaluation
ParameterLab Β· arxiv.org/abs/2603.08835 Β· MIT License
Multi-Agent SystemsGAIA BenchmarkFramework EvaluationReproducible Research
Evaluation · Key Finding
Framework vs. Model
The MASEval paper's headline finding: framework choice matters as much as model choice. Toggle between views and metrics to explore why architectural decisions are as consequential as foundation model selection.
Group By
Metric
Framework β Model in impact. The performance spread across frameworks (44.6% β 53.2% GAIA GSR) nearly equals the spread across frontier models (49.7% β 55.4%). Engineers optimizing only model selection are leaving half the performance gains on the table.
Why Frameworks Diverge
Frameworks differ in how they handle retries, tool call batching, inter-agent context passing, and error recovery. These implementation choices accumulate across a multi-step task, producing system-level performance gaps invisible in single-agent benchmarks.
Token Cost Trade-off
Higher-performing frameworks often use more tokens through richer context passing and verification steps. smolagents is the most token-efficient (55.3k avg) while CAMEL has the highest consumption (90.1k). The right choice depends on your cost/performance operating point.
Evaluation · Optimization
Cost-Performance Pareto Frontier
Every framework + topology combination represents a point in cost/performance space. The Pareto frontier identifies combinations where no other config is strictly better on both axes. Click any point to inspect its config.
Click any dot on the chart to inspect that framework + topology configuration.
Reading the Pareto frontier: Points on the frontier (connected line) are "efficient" β no other configuration achieves higher GSR at the same or lower token cost. Points below-right of the frontier are suboptimal and should generally be avoided unless they offer other non-measured advantages (e.g. framework familiarity, latency).
Pareto-Optimal Configs
Claude SDK + Hierarchical leads the frontier: highest GSR (53.2%) at moderate token cost (68.9k). smolagents + Parallel offers the best cost efficiency at acceptable performance. These two represent opposite ends of the efficiency frontier.
Dominated Configs
CAMEL + Mesh is dominated β highest token cost (90.1k) but not highest GSR. AutoGen + Sequential has similar cost to CAMEL but much lower performance. Neither belongs on the Pareto frontier for most use cases.
Reference · Full Results
MASEval Leaderboard
All 24 framework Γ topology combinations ranked across every metric. Click any column header to sort. Filter by framework or topology. β marks Pareto-optimal configurations.
Framework:
Topology:
# β
Framework β
Topology β
GAIA GSR% β
Tau% β
MMLU% β
Tokens(k) β
Latency(s) β
Systems · Error Propagation
Failure Anatomy
How failures cascade (or isolate) depends entirely on topology. Select a topology, then inject a failure into any agent to watch how errors propagate through the system and observe the attribution breakdown.
Error Attribution Breakdown
Agent 65%
Env 25%
User 10%
■ Agent Error■ Env Error■ User Error
Select a topology and inject a failure to see propagation behaviour.
Reliability Calculator
Per-agent failure rate10%
Number of agents4
Formula: Sequential system failure = 1 β (1βp)βΏ Β· Parallel system failure β pβΏ Β· Hierarchical β p + (1βp)Β·p_subtask
Sequential Cascade
A failure at step 2 of a 4-agent chain blocks all 3 downstream agents. Error rate compounds: each 10% agent failure probability produces ~34% system failure in a 4-step pipeline. Sequential topologies require robust individual agents.
Parallel Isolation
A failure in Worker-2 of a parallel topology only affects that worker's contribution. Workers 1 and 3 complete normally; the aggregator can proceed with partial results. This makes parallel architectures inherently more resilient to single-agent failures.
What is the difference between agent, environment, and user errors?βΊ
Agent Error: The agent produced incorrect output, got stuck in a loop, or failed to use a tool correctly. Corrected by improving agent instructions, adding verification steps, or switching frameworks.
Environment Error: A tool API returned an error, a file was inaccessible, or a network timeout occurred β outside the agent's control. Corrected by improving tool wrappers, adding retry logic, or handling environment-side failures.
User Error: The input task was malformed, contradictory, or impossible. Corrected at the input validation layer before the agent system is invoked.
How does error attribution guide framework selection?βΊ
If your dominant error type is Agent Error, prioritise frameworks with built-in reflection and retry mechanisms (LangGraph's conditional edges, AutoGen's critique loops). If Environment Error dominates, prioritise frameworks with robust tool-call error handling and fallback paths. Misattributing agent errors as environment errors leads to investing in the wrong layer of the stack.
Reference · Common Mistakes
Evaluation Anti-Patterns
Most teams evaluate MAS incorrectly β and the mistakes compound. Click each pattern to understand the failure mode and how to fix it.
Root cause: Most anti-patterns stem from applying single-model evaluation thinking to multi-agent systems. The unit of analysis must be the entire system β topology, orchestration, error handling, and model β evaluated simultaneously on the same benchmark.
Cost Analysis · Scaling
Token Scaling Visualizer
How does token consumption grow as you add more agents? Different topologies scale at fundamentally different rates. Adjust the base token cost per agent to see how your specific workload will scale.
Base tokens / agent turn800
Mesh Grows Quadratically
In a fully-connected mesh topology, each agent must process messages from all other agents. With n agents, total context grows as O(nΒ²). At 6 agents, this is 36Γ the single-agent baseline β often exceeding model context windows entirely.
Parallel Stays Flat
Parallel topology workers process tasks independently with minimal cross-agent context. Token cost scales as O(n) for workers plus a fixed aggregation cost. Adding more workers doesn't compound cost β making parallel the most scalable topology for large n.
Systems · Context Management
Context Window Growth
As agents exchange messages across turns, cumulative context grows β often hitting model limits before task completion. Use the slider to set your model's context limit and see which topologies breach it first.
Model context limit (k tokens)32k
Summarizer Agent
Adding a dedicated Summarizer agent compresses inter-agent context by 60β70% at each turn boundary. This delays context-limit breaches, enabling longer multi-step tasks without truncation.
Sliding Window
MASEval supports configurable context window strategies: full history, last-N-turns sliding window, or hierarchical summary. Sliding window with N=3 gives ~80% task performance at 30% the context cost of full history.
Partial Observability
Agents in parallel and hierarchical topologies can be configured with partial observability β each agent sees only messages addressed to it. This bounds per-agent context to O(1) regardless of system size.
Comparison · Live Simulation
Topology Race
Watch two topologies run the same task simultaneously. The animated message flows reveal coordination overhead differences in real time. Select your topologies and hit Race.
Hierarchical
Ready
Sequential
Ready
Reference · Cost Planning
Live Cost Estimator
Estimate your monthly API spend based on task volume, topology, and framework. Adjust the sliders to model your workload and see the cost breakdown.
Tasks / month5,000
Avg task complexityMedium
Framework
Topology
Paper: MASEval β A Framework for Multi-Agent System Evaluation
ParameterLab Β· arxiv.org/abs/2603.08835 Β· MIT License
Multi-Agent SystemsGAIA BenchmarkFramework EvaluationReproducible Research
Trace Explorer β
MASEval Tool · Observability
Trace Explorer
Every MASEval run produces structured trace artefacts β per-agent message histories, tool call logs, and token usage reports. Click a tab to inspect what each trace type looks like.
Agent Messages
Tool Calls
Token Usage
Task Result
TraceableMixin
All MASEval agents inherit from TraceableMixin, which automatically instruments every send_message() call. No manual logging required β traces are collected transparently regardless of which framework is running underneath.
Partial Observability
MASEval supports partial observability configurations where sub-agents cannot see each other's full message histories β matching real-world deployment constraints. Traces still capture the full system view for post-hoc analysis even when agents have limited runtime visibility.
MASEval vs. Other Evaluation Libraries
Feature
MASEval
MLflow GenAI
Inspect-AI
DeepEval
Multi-agent orchestration
β Full
β
~ Partial
β
Framework agnosticism
β 6 frameworks
~ LangChain only
~ Limited
~ LangChain
System-level comparison
β Native
β
~ Manual
β
Error attribution (3-way)
β
β
β
~ Partial
Trace-first evaluation
β
~ Logs only
β
β
GAIA / Tau benchmarks
β Built-in
β
~ GAIA only
β
Reproducibility (seeds)
β
~ Partial
β
~ Partial
CI/CD integration
β JSON export
β MLflow UI
~ Partial
β
Recommendations · Selection Tool
Framework Selection Wizard
Answer 5 questions about your system requirements and get a ranked recommendation of topology and framework tailored to your use case.
Applications · Use Case Map
Real-World Use Cases
Different application domains benefit from different MAS architectures. Here are common use cases mapped to their optimal topology and framework based on MASEval benchmark findings.
MAESTRO introduces reliability as a first-class evaluation dimension β measuring not just whether a system succeeds, but how gracefully it handles failure. Adjust the stress parameters to see how topology and retry budget affect system resilience.
Why reliability matters: A MAS that achieves 70% task success under ideal conditions may degrade to 30% under realistic failure rates. MAESTRO separates peak performance from sustained performance β the latter determines production viability.
Mean time to recovery averages the expected recovery steps across all agents weighted by their failure probability.
Resilience Score
R = recovered_failures / total_failures
= [1 β (1βp_retry)^retries]
MAESTRO target: R β₯ 0.80
Production threshold: R β₯ 0.90
A resilience score above 0.80 is MAESTRO's minimum bar for production-grade MAS. Hierarchical topologies score ~15% higher than sequential due to sub-task isolation.
MAESTRO Reliability Findings by Topology
Topology
Failure Rate
MTTR (steps)
Resilience Score
Error Budget Used
Failure Mode
MAESTRO · Tool Use
Tool-Use Evaluation
Real MAS deployments depend heavily on external tool calls β search, code execution, database queries, image generation. MAESTRO evaluates each agent's tool-calling capability as a first-class metric, not just task-level success rate.
Click a tool to see evaluation results
Key MAESTRO finding: Tool-calling errors account for 43% of all MAS failures β more than reasoning errors (31%) or coordination failures (26%). The most common error is wrong argument schema (agent passes malformed input to a tool), not tool unavailability.
Tool Error Taxonomy
HotPotQA Benchmark
HotPotQA requires multi-hop reasoning across multiple documents β agents must chain search tool calls correctly. MAESTRO uses it to stress-test tool sequencing under information retrieval scenarios.
Tasks
113K
Avg Hops
2.8
Best MAS Score
74.3%
Human Baseline
91.2%
Best MAS: Hierarchical + LangGraph, GPT-4o orchestrator + GPT-3.5 workers. Single-agent GPT-4o scores 61.8% β the 12.5pp gap demonstrates the value of multi-agent tool chaining.
Tool-Use Success Rate by Framework Γ Tool Category
MAESTRO · Framework Comparison
MAESTRO vs MASEval
Two complementary frameworks for evaluating multi-agent systems. MASEval (arXiv:2603.08835) focuses on topology and framework benchmarking. MAESTRO (arXiv:2601.00481) adds reliability, observability, and tool-use as first-class dimensions.
Side-by-Side Feature Matrix
Capability
MASEval
MAESTRO
Notes
Topology benchmarking
β
β
Both cover sequential, hierarchical, parallel, mesh