Evaluation Metrics for LLM Multi-Agent Systems

Post 26 · Evaluation

The Measurement Gap in Multi-Agent AI

Single-model benchmarks break down when applied to systems where multiple agents coordinate, communicate, and fail together. 130+ research papers reveal a rich taxonomy of specialized metrics — most unknown to practitioners.

95

Unique Metrics

61

Papers Read

8

Metric Categories

23

Critical/High Impact

    The core insight: A 2.1% accuracy difference between two MAS systems can mask a 12.8% difference in information divergence (IDS) and an 80% difference in unnecessary path ratio (UPR). Outcome metrics alone are insufficient — process metrics are essential. (GEMMAS, arxiv:2507.13190)
  

Why Task Metrics Fail

GSR and accuracy measure final outcomes but ignore how agents coordinated to reach them. A system that succeeds through chaotic, expensive communication is scored identically to an efficient one.

Process-Level Metrics

IDS, UPR, TES, IC, RC and CORE measure the quality of inter-agent communication — the hidden layer that determines whether multi-agent collaboration actually adds value over a single model.

Safety in MAS is Emergent

Trust paradox (TVP): increasing inter-agent trust improves task success but amplifies OER and AD — information over-exposure and authorization drift. Single-agent safety metrics don't capture this.

Why single-model metrics are insufficient for MAS ›

In single-agent evaluation, success means the model produced the right output. In multi-agent systems, success requires: (1) correct task outcome, (2) efficient communication, (3) appropriate coordination among roles, (4) safe information handling, and (5) robust planning through failures. Each dimension needs its own metric family. The AgentBoard benchmark introduced fine-grained progress metrics precisely because binary success/failure was masking meaningful differences between agents that partially complete tasks vs. those that fail immediately.

How these metrics were extracted from 130+ papers ›

The metrics catalogued here were extracted from a systematic review of 130+ papers on LLM-based multi-agent systems, spanning 2023–2026. For this visual, 9 papers were read in full to verify formulas directly: GEMMAS (arxiv:2507.13190), Collab-Overcooked (arxiv:2502.20073), Trust-Vulnerability Paradox (arxiv:2510.18563), COLLAB (Mahmud et al., NeurIPS 2025), CoordiLang (coordination benchmarks), τ-bench (Yao et al., 2024), and survey papers by Farooq et al., Mohammadi et al., and Shah et al. Metrics IDS, UPR, OER, AD, ANU, TES, ITES, PC, IC, RC, pass^k all have formulas confirmed directly from primary sources.

Task Performance Metrics →

Category 1

Task Performance Metrics

The foundational layer — measuring whether agents accomplish their goals. These metrics form the baseline, but their inadequacy for MAS motivated the creation of all subsequent metric categories.

GSR

Goal Success Rate

Task

Primary end-to-end metric measuring whether the agent achieved the stated goal. Used across GAIA (466 tasks), Tau-bench (120 tasks), and AgentBench. The MASEval framework found that framework choice alone (AutoGen vs Claude SDK) causes a 8.6 percentage point GSR range — nearly equal to the model-choice gap across frontier LLMs.

GSR = Completed Tasks / Total Tasks × 100% Example values (GAIA benchmark): Claude SDK: 53.2% AutoGen: 51.7% LangGraph: 48.9% LlamaIndex: 46.1% smolagents: 44.6%

Benchmark Context

GAIA (466 tasks): Real-world generalist tasks. Primary metric: GSR.

Tau-bench (120 tasks): Policy compliance in tool-use scenarios.

MMLU (14,042 tasks): Knowledge accuracy across domains.

AgentBoard: Fine-grained progress across 9 environments.

The Limitation

Task metrics treat a system as a black box. Two systems with identical GSR can differ by: 10× in token cost, 80% in path redundancy (UPR), and 12.8% in information diversity (IDS). The AgentBoard paper introduced progress rate precisely to surface these differences — binary success hides partial progress that matters for long-horizon tasks.

Communication Quality Metrics →

Category 2

Communication Quality Metrics

The hidden layer of MAS evaluation. These metrics assess the quality, diversity, and efficiency of inter-agent message exchange — the dimension where MAS either adds value or creates expensive redundancy.

Click a metric to explore:

IDS UPR CORE AF TES ITES PC IC RC

IDS

Information Divergence Score

Measures semantic divergence in agent messages using weighted pairwise similarity. Low IDS = agents echo each other. High IDS = rich, diverse information flow. Key finding: 2.1% accuracy gap masked 12.8% IDS difference on GSM8K.

IDS = Σ(i,j) w_ij·(1−SS_total[i,j]) / Σ(i,j) w_ij where: w_ij = max(S_ij,S_ji) + max(T_ij,T_ji) SS_total = 0.5·SS_syn + 0.5·SS_sem Source: GEMMAS (arxiv:2507.13190) GSM8K: same accuracy ≠ same quality System A: accuracy 87.3%, IDS 0.71 System B: accuracy 89.4%, IDS 0.84 ← 12.8% richer

    Key insight from GEMMAS (arxiv:2507.13190): Outcome-only metrics are insufficient. A system with 80% higher UPR (UPR = 1 − |P_necessary|/|P_all|) — meaning 80% more redundant reasoning paths — can achieve identical accuracy. This redundancy directly translates to higher computational costs and latency. IDS and UPR together expose the hidden inefficiency that GSR cannot see.
  

GEMMAS Framework

Introduced IDS and UPR by modeling agent interactions as a directed acyclic graph (DAG). Evaluated across 5 benchmarks. Enables process-level diagnostics invisible to outcome metrics.

Collab-Overcooked

Introduced TES, ITES, PC, IC, RC across 30 tasks in 6 complexity levels. Finding: LLMs excel at goal interpretation but struggle with active collaboration and continuous adaptation as complexity increases.

C2C Framework (AF)

Alignment Factor measures how well agent task understanding is aligned through communication. Systems optimizing AF reduced task completion time by ~40% with acceptable communication costs across 5–17 agents.

Coordination Metrics →

Category 3

Coordination Quality Metrics

Beyond communication, these metrics measure how well agents actually coordinate — from game-theoretic utility to tool-use efficiency and the emergence of groupthink.

ANU — Average Normalized Utility

From the COLLAB benchmark (Mahmud et al., NeurIPS 2025) which adapts Distributed Constraint Optimization Problems (DCOPs).

Formula: ANU = (U_achieved − U_min) / (U_max − U_min)

Normalizes achieved utility against optimal DCOP solver bounds. Finding: LLMs underperform symbolic solvers but are competitive in sparse constraint regimes. Instruction modality matters — image-based instructions sometimes outperform text.

CSS & TUE — From TRiSM

Component Synergy Score (CSS): Quantifies quality of inter-agent collaboration as a composite score.

Tool Utilization Efficacy (TUE): Evaluates whether agents use the right tools at the right time — not just whether they use tools at all.

Conformity & Independence

From BenchForm benchmark. Conformity Rate: how often agents abandon correct answers under group pressure. Independence Rate: inverse. Critical for multi-agent debate systems to avoid groupthink.

    Coordination finding (COLLAB): ANU = (U_achieved − U_min) / (U_max − U_min). LLMs are competitive with symbolic DCOP solvers in sparse constraint regimes, but instruction modality matters. CoordiLang: LLMs' zero-shot coordination drop (3.7%) is far smaller than RL agents' (16.2%) — language is a powerful coordination substrate. MultiAgentBench (Shah et al.): graph topology outperforms other topologies; cognitive planning prompting adds +3% across tasks.
  

Safety & Trust Metrics →

Category 4

Safety & Trust Metrics

Multi-agent safety is fundamentally different from single-model safety. Trust between agents creates new attack surfaces — the Trust-Vulnerability Paradox shows that improving coordination and improving security are often in direct conflict.

OER — Over-Exposure Rate

Formula: OER(S,τ) = (1/|G(S,τ)|) × Σ_{ℓ∈G(S,τ)} 𝟙(Oₗ∖A* ≠ ∅)

Counts task instances where agent outputs exceed Minimum Necessary Information set A*. Evaluated per system S and task type τ across instance set G. Key finding: higher inter-agent trust consistently increases OER even as it improves task success — the Trust-Vulnerability Paradox.

AD — Alignment Dispersion

Formula: AD(S) = Σ_τ w_τ·(OER(S,τ) − OER̄(S))²
where OER̄(S) = Σ_τ w_τ·OER(S,τ) (weighted mean)

Measures the variance of OER across task types τ, weighted by task frequency w_τ. High AD means the system's exposure control is inconsistent — safe in some task contexts, unsafe in others. A more robust measure than OER alone.

ERS — Effective Robustness Score

Formula: task_effectiveness × (1 − ASR)

Introduced in TAMAS benchmark (300 adversarial instances, 6 attack types, 211 tools). Measures the safety-effectiveness tradeoff. Multi-agent systems scored significantly lower ERS than single-agent baselines under adversarial conditions.

ECS — Ethical Cooperation Score

Formula: cooperation × autonomy × integrity × fairness

From Constitutional Multi-Agent Governance (CMAG). Penalizes cooperation achieved through manipulation. Unconstrained optimization gets 0.873 cooperation but ECS of 0.645 — CMAG achieves 0.741 ECS with only modest cooperation reduction. Hub-periphery exposure disparities reduced by 60%.

    The Trust-Vulnerability Paradox (TVP, arxiv:2510.18563): Increasing inter-agent trust enhances coordination but expands OER(S,τ) — exposures beyond the Minimum Necessary Information set A*. Formalized across 3 macro scenes and 19 sub-scenes. AD(S) = Σ_τ w_τ·(OER(S,τ)−OER̄(S))² measures whether this over-exposure is consistent or task-dependent. High AD means the system is unpredictably risky.
  

SafeR, SuccR, and SafeR@S — the embodied safety triad ›

Introduced in Safe-BeAl for LLM-based embodied agents performing daily tasks (cooking, cleaning, organizing). SafeR = proportion of scripts classified as safe. SuccR = proportion completing the task. SafeR@S = proportion that are safe among those that succeed. Safe-BeAl improved SafeR by 8.55–15.22% while preserving SuccR, demonstrating that safety and task completion can be jointly optimized. LLM agents without alignment exhibit unsafe behaviors even without adversarial inputs — purely from hallucinations and misalignment with physical-world knowledge.

ASR and FPR in adversarial MAS evaluation ›

Attack Success Rate (ASR) measures vulnerability to adversarial inputs (prompt injection, jailbreaking, AiTM attacks). Key data from Wang et al. (IEEE 2025): AutoDefense reduced ASR from 55.74% to 7.95% — an 85.7% relative reduction — by using a multi-agent defense pipeline where a defense agent screens LLM outputs before delivery. AgentMonitor (Wang et al.) achieved −6.2% harmful content and +1.8% helpful content simultaneously — a dual optimization proving that safety and helpfulness are not in conflict when measured separately. False Positive Rate (FPR) is critical to avoid over-refusal: WaltzRL reduced unsafe responses from 39.0% to 4.6% and over-refusals from 45.3% to 9.9% simultaneously using a two-agent RL feedback loop.

Planning & Progress Metrics →

Category 5

Planning & Progress Metrics

Long-horizon task evaluation requires metrics that capture intermediate progress, not just final outcomes. These metrics distinguish agents that fail immediately from those that make meaningful partial progress.

CheckPoint — Mobile-Bench

Category-based metric assessing whether mobile agents reach essential decision points during planning. Introduced alongside PassRate in Mobile-Bench (832 data entries, 103 APIs, 200+ multi-APP tasks).

Why it matters: An agent can fail PassRate (final task success) while still correctly navigating early checkpoints — sequential errors mean later checkpoints often fail even if early planning was correct.

Progress Rate — AgentBoard

Fine-grained progress metric from the NeurIPS 2024 AgentBoard benchmark. Captures incremental advancement across 9 environments including web navigation, household tasks, and scientific discovery.

Key finding: Using only final success rates (like task success rate), researchers were missing meaningful differences between agents. Progress Rate revealed that some "failing" agents were completing 70-80% of tasks.

    Three task complexity tiers (Mobile-Bench): SAST (Single App, Single Task), SAMT (Single App, Multi Task), MAMT (Multi App, Multi Task). CheckPoint performance degrades rapidly in MAMT — multi-app coordination failures cascade through planning checkpoints, revealing where LLM mobile agents break down.
  

Planning Evaluation Hierarchy

Milestone Completion
High-level: did the agent achieve major intermediate goals? Used in GAIA and AgentBoard. Binary per milestone.

CheckPoint Score
Mid-level: did the agent correctly navigate decision points during execution? Category-based, continuous score.

ITES (Incremental TES)
Micro-level: what was the contribution of each individual action to overall progress? Per-step attribution.

Domain-Specific Metrics →

Category 6

Domain-Specific Metrics

Multi-agent systems in healthcare, game theory, and creative generation require metrics tailored to domain constraints. Generic task metrics cannot capture clinical correctness, strategic equilibria, or creative novelty.

      MedAgentBoard finding: Multi-agent collaboration benefits specific scenarios (clinical workflow automation: +task completeness) but does not consistently outperform advanced single LLMs or specialized conventional methods. The evaluation suite uses 4 metric families across 4 task types.
    

AUROC & AUPRC — EHR Tasks

AUROC: P(score_pos > score_neg) — gold standard for EHR predictive tasks (mortality, readmission).

AUPRC: Preferred for imbalanced clinical datasets where positive events (ICU readmissions, mortality) are rare. AUPRC is more sensitive to performance on the minority class that matters most clinically.

ROUGE-L & SARI — Lay Summaries

ROUGE-L: Longest common subsequence overlap between generated and reference medical summaries. Measures content coverage.

SARI: System output Against References and Input — averages add/keep/delete F1 scores. Captures simplification quality beyond n-gram overlap. Used for lay medical text generation where both accuracy and readability matter.

LLM-as-Judge (Free-form QA)

For open-ended medical QA, MedAgentBoard uses LLM-as-a-judge scoring assessing: semantic correctness, clinical relevance, and factual consistency. Human expert panels additionally judge correctness, completeness, accuracy, and coherence for clinical workflow automation tasks.

Discordance & Undertriage Rates

For clinical triage MAS: discordance rate measures disagreement between agent recommendation and clinical gold standard. Undertriage rate measures dangerous under-prioritization. These domain-specific safety metrics are absent from general MAS benchmarks.

      CoordiLang finding (zero-shot coordination): LLMs show only a 3.7% cross-play performance drop vs. 16.2% for RL agents — confirming that language-based coordination transfers far better to novel partners. GPT-4-X achieved 85.0% cross-play vs. PPO's 70.2%. Game-theoretic metrics from 7 primary benchmarks across 573 papers; a key gap remains: lack of standardized protocols makes cross-study comparison unreliable.
    

Elo & Bradley-Terry Models

Classic rating systems adapted for LLM agent competitions. Elo updates: R_new = R_old + K×(Score − Expected). Bradley-Terry provides a probabilistic model for pairwise comparisons. Used across multi-agent competitions, board games, negotiation settings. GPT-4 achieves 72.5 on GAMA-Bench; GPT-3.5 shows marked improvements across 3 iterations.

Cooperation & Defection Rates

In iterated Prisoner's Dilemma and Stag Hunt games: cooperation rate measures how often agents choose mutually beneficial strategies. A 1-word "cheap talk" channel increased cooperation from 0% to 48.3% in 4-player Stag Hunt. Defection rate tracks self-interested choices that harm group welfare.

Zero-Shot Coordination (ZSC)

From CoordiLang benchmark. Measures how well agents coordinate without prior practice. Key finding: LLMs show only 3.7% cross-play performance drop vs. 16.2% for RL agents — LLMs are substantially better at zero-shot coordination through language. GPT-4-X cross-play: 85.0% vs PPO cross-play: 70.2%. Language enables compositional strategies unavailable to RL.

Conformity Rate & Independence Rate

BenchForm benchmark: agents given correct answers then pressured by fake group consensus. Conformity Rate = how often agents abandon correct answers. Independence Rate = inverse. LLMs show substantial conformity that degrades multi-agent debate quality — especially damaging in sequential interaction protocols.

Multi-Dimensional Frameworks →

Category 7

Multi-Dimensional Evaluation Frameworks

Beyond individual metrics, researchers have proposed holistic frameworks that evaluate agentic AI across multiple axes simultaneously — recognizing that no single metric captures system quality.

5-Axis Balanced Framework (Shukla 2025)

Axis 1: Capability & Efficiency — task success, latency, token cost
Axis 2: Robustness & Adaptability — failure recovery, distribution shift
Axis 3: Safety & Ethics — OER, AD, ECS, harm indices
Axis 4: Human-Centered Interaction — trust, transparency, explainability
Axis 5: Economic & Sustainability — cost, environmental impact

Industry deployments show 20–60% productivity gains but routinely omit Axes 3–5.

Goal-Drift Score

Novel indicator measuring deviation from the original user intent over long-horizon tasks. An agent that starts pursuing an aligned goal but gradually drifts toward misaligned objectives as context accumulates scores higher on traditional metrics but lower on goal-drift.

Harm-Reduction Index

Measures reduction in harmful outcomes relative to a baseline: (Baseline_harm − Agent_harm) / Baseline_harm. Requires defining a baseline system and harm taxonomy. Part of the 5-axis framework's safety dimension — currently absent from most MAS evaluation pipelines.

Two-Dimensional Taxonomy (Mohammadi et al.)

Dimension 1: Evaluation Objectives — what to evaluate (behavior, capabilities, reliability, safety). Dimension 2: Evaluation Process — how to evaluate (interaction modes, benchmarks, metric computation, tooling). Highlights enterprise gaps: role-based access, dynamic long-horizon interactions, compliance.

Three-Tier Value Framework

From application-driven value alignment survey: Macro level (societal values), Meso level (organizational policies), Micro level (agent behavior). Evaluation datasets and methods are mapped to each tier, enabling alignment-aware evaluation that connects individual agent metrics to societal outcomes.

Interactive Metric Explorer →

Reference

Interactive Metric Explorer

Search, filter, and sort all 95 metrics — verified from 61 papers. Each metric includes a real-world importance rating (1–5) and practitioner impact note. Filter by importance level to focus on Critical or High-impact metrics only. Click column headers to sort.

Importance:

Level:

Abbr. ⇅	Full Name ⇅	Category ⇅	Formula	Source ⇅	Range	Importance ⇅

Critical + High Importance (30 metrics)

Critical (3): GSR, Accuracy, ASR — universal leaderboard and security standards.
High (26): pass@k, pass^k, pGSR, Sub.R, Sup-GSR, Sys-GSR, Elo, OER, AD, ERS, ORR, PDR, ASV, Robustness, FPR, CheckPoint, Progress Rate, AUROC, AUPRC, ROUGE-L + 6 new: TTFT, CPT, TSA, TCR, MCR, HR — production-grade MAS evaluation stack. Filter by "Critical" or "High" to see only these 29 metrics.

Biggest Real-World Gaps

ORR (over-refusal) is critical for production but rarely reported. Goal-Drift and Harm-Reduction are conceptually defined but lack implementations. AD (alignment dispersion) reveals unpredictable safety behavior — almost never measured despite legal exposure. ASV is the most interpretable safety metric but absent from most MAS evaluation pipelines.

Recommended Practitioner Stack

For production MAS: pass^k (reliability) + pGSR (partial credit) + ORR+ASR+ERS (safety triangle) + OER+AD (exposure consistency) + IDS+UPR (communication audit) + ANU (coordination utility) + CheckPoint (planning progress). 10-metric stack covers all 8 dimensions from primary sources.