Evaluating LLM Agents — A Visual Guide

The Core Challenge

Why Agent Evaluation Is Fundamentally Different

Evaluating a single LLM call is hard enough. Evaluating an LLM agent — autonomous, multi-step, tool-using, non-deterministic — is a different problem entirely.

Click a panel to explore each challenge

Open-Ended Behaviour

Unlike a single LLM call with a defined output format, agents can take any sequence of actions to reach a goal. There is no unique "correct" path — two agents that both succeed may take completely different routes. Standard unit-test style evaluation cannot capture this space of valid solutions.

Non-Deterministic

The same query produces different tool call sequences, reasoning chains, and outputs on each run. You cannot run once and declare pass/fail — you need statistical sampling across many runs to get reliable measurements.

Multi-Step & Long-Horizon

Agents take dozens to hundreds of actions per task. An error at step 3 may not surface until step 47. Diagnosing failure requires step-level tracing, not just checking the final output.

System-Level Interactions

Agent behaviour emerges from interactions between the LLM, tools, memory, external APIs, and environment state. Evaluating the model in isolation misses most real failure modes — evaluation must be system-level.

The Evaluation Gap →

Research Findings

The Evaluation Gap — What the Research Shows

A multivocal literature review of 134 academic and 27 industry sources reveals a dramatic disconnect between how agents are evaluated and what good evaluation requires. (Xia et al., 2024)

Click each bar for details · Blue = academic · Green = industry practice

Pre-deployment Focus (93%)

93.28% of academic evaluation sources focus exclusively on pre-deployment (offline) evaluation. Only 2.24% cover post-deployment, and 4.48% cover continuous evaluation. Yet in industry, 40.74% of practitioners use continuous evaluation. Agents deployed in production face distribution shift, new edge cases, and changing user behaviour that pre-deployment benchmarks cannot anticipate.

The Implication

Academic benchmarks optimise for what is easy to measure — final task success on controlled inputs — not what matters in production: reliability over time, graceful degradation, and safe failure modes.

Industry vs Academia

Industry shows more balanced evaluation: 44.44% pre-deployment, 14.81% post-deployment, 40.74% continuous. Industry practitioners face real consequences when agents fail — this drives more rigorous evaluation practice.

What Good Looks Like

Evaluation should span the full lifecycle, combine end-to-end with intermediate metrics, operate at system level, adapt to new risks, and close the loop: findings must drive concrete improvements.

4 Capability Dimensions →

What to Evaluate

Four Fundamental Agent Capabilities

Before choosing metrics, you must know what you are measuring. Agent evaluation covers four core capability areas — each requires distinct evaluation approaches.

Planning — Decomposing Goals into Actions

Planning is the agent's ability to break a high-level goal into an ordered sequence of executable sub-tasks, handle dependencies between steps, and recover from unexpected results. Evaluation checks: does the plan cover all required steps? Is the order logically correct? Does the agent adapt the plan when an intermediate step fails?

Key benchmarks: WebArena (web task planning), AgentBench (multi-domain planning), SWE-bench (software engineering planning). Key metric: plan quality score = ratio of correct sub-tasks in the generated plan vs. reference plan.

Metrics Deep Dive →

Measurement

Metrics — From Task Success to Step Accuracy

92% of academic papers report only end-to-end task success. But a single metric hides the critical question: where exactly did the agent fail? Click a metric category to explore.

End-to-End Metrics

End-to-end metrics measure whether the agent ultimately achieved the goal, regardless of how.

Task Success Rate (TSR): binary pass/fail — did the agent complete the task? Simple but hides all failure modes.
Goal Completion Rate (GCR): partial credit — what fraction of the goal was achieved? Better for multi-part tasks.
User Satisfaction: human rating of the final output quality (1–5 scale). Captures subjective quality that automated metrics miss.

Limitation: a 70% TSR tells you nothing about whether failures were in planning, tool selection, memory, or output formatting.

Key agent evaluation metrics: Task Success Rate = successful_tasks / total_tasks Goal Completion = Σ(sub_goal_scores) / n_sub_goals ← partial credit Plan Quality Score = |correct_steps ∩ plan| / |reference| ← F1 variant Tool Call Acc. = correct_tool_calls / total_tool_calls Parameter F1 = 2·P·R/(P+R) on arg name+value pairs Step Progress Rate = steps_completed / steps_attempted ← intermediate Tool Error Rate = hallucinated_tools / total_tool_calls ← reliability

Offline Evaluation →

How to Evaluate

Offline Evaluation — Controlled, Reproducible, Scalable

Offline evaluation runs agents against fixed test suites without live system access. It is fast, cheap, and reproducible — but cannot capture real-world distribution shift or long-tail edge cases.

Click a phase to see details · Offline pipeline: Define → Build → Execute → Score → Analyse

Define Test Scope

Start by defining what capabilities you are testing and what constitutes a pass. Identify the task types your agent will face in production. Sample representative inputs. Define ground truth answers or reference trajectories. Without a well-defined scope, your benchmark measures what is easy to test, not what matters in production.

Static Test Suite

Fixed question-answer pairs or task specifications with known ground truth. Highly reproducible — run any time and get the same score. Risk: agents can be inadvertently tuned to the benchmark, inflating scores without real improvement.

Dynamic Test Suite

Programmatically generated test cases with randomised parameters. Harder to overfit, broader coverage. Trade-off: harder to interpret failures because test cases are not fixed across runs.

When should you use offline evaluation? ▼

Use offline evaluation during development (fast iteration), before any deployment (regression testing), and when comparing candidate models or configurations. Do not use offline evaluation as your only signal — production always introduces conditions that benchmarks miss.

What are the risks of offline-only evaluation? ▼

93% of academic evaluation is pre-deployment only. The risks: benchmark saturation (models fine-tuned specifically on common benchmarks), distribution shift (real users behave differently from test designers), and false confidence (high offline score → poor production performance). The EDDOps framework requires continuous online evaluation to close this gap.

Key Benchmarks →

Benchmarks

Six Benchmarks Every Agent Evaluator Should Know

The agent evaluation ecosystem has produced specialised benchmarks for different capability areas. Click each to explore what it tests, how it scores, and when to use it.

AgentBench

Multi-domain OS, DB, Web tasks

WebArena

Real web app task completion

SWE-bench

GitHub issue resolution

GAIA

466 real-world assistant tasks

BFCL

2,000 function-call Q&A pairs

ToolEmu

Safety-focused tool evaluation

AgentBench — Multi-Domain Agent Evaluation

AgentBench tests LLM agents across 8 distinct environments: OS (shell commands), DB (database queries), KG (knowledge graphs), ALFWorld (household tasks), Mind2Web (web navigation), WebShop (shopping), and more.

What it measures: task completion rate across diverse, realistic environments requiring planning, tool use, and state tracking.
Why it matters: multi-domain testing reveals that agent performance varies dramatically by domain.
Limitation: static test suite — agents can be fine-tuned specifically to these environments.

Online Evaluation →

Production Monitoring

Online Evaluation — Measuring What Actually Happens

Online evaluation monitors agent behaviour in real production environments, with real users and real consequences. It catches what offline testing misses — but is slower, costlier, and harder to interpret.

Simulated production monitoring dashboard — metrics update every 800ms

Shadow Mode

Run the new agent in parallel with the current system — same inputs, both process the request, only the current system's output is shown to users. Compare results offline. Zero user risk. Widely used for validating model upgrades before cutover.

A/B Testing

Route a fraction of real traffic to the new agent. Measure task success, user satisfaction, and error rates on live users. Statistically valid comparison. Risk: real users experience the experimental agent — requires guardrails and rollback capability.

What metrics do you track online? ▼

Task success rate (sampled, human-verified), user satisfaction ratings, error rates by type (tool failure, context lost, unhelpful response), latency percentiles (p50/p95/p99), cost per task, and escalation rate (tasks handed to human operators). Track these in time-series dashboards so regressions are visible immediately.

Why is online evaluation harder than offline? ▼

Ground truth is often unavailable — you don't know the "correct" answer to a user's real query. Attribution is hard — failures come from model, tool, infrastructure, or user input. Confounders are everywhere — user behaviour, time of day, and data drift all affect metrics. Feedback loops are slow — it may take hours or days to know if an action was correct.

LLM-as-Judge →

Automated Evaluation

LLM-as-Judge — Using AI to Evaluate AI

Human evaluation is the gold standard but it is slow and expensive. LLM-as-Judge uses a separate (often more capable) LLM to evaluate agent outputs — fast, scalable, but with its own biases and failure modes.

Click a pattern to explore evaluation approaches · reliability bar = estimated agreement with human labels

Pairwise Comparison

Show the judge LLM two agent responses (A and B) for the same task. Ask: which is better, and why? Pairwise comparison is more reliable than absolute scoring because ranking two options is easier than assigning a score on a scale. Risk: position bias — judges tend to prefer the first response shown. Mitigation: randomise order and average both orderings.

Known Biases

Position bias (prefer first), verbosity bias (prefer longer), self-preference (prefer outputs from same model family), format bias (prefer well-formatted responses regardless of correctness). Always audit judge calibration against human labels.

When to Use

Open-ended tasks where ground truth doesn't exist (summarisation, explanation quality, helpfulness). Use LLM-as-Judge to replace expensive human evaluation at scale — but always validate on a sample with real human raters first.

When Not to Use

Tasks with factual ground truth (tool call accuracy, parameter correctness, code execution). Tasks where the judge may not have domain knowledge. Safety-critical evaluations where bias in the judge is unacceptable.

EDDOps Process →

The Framework

EDDOps — Evaluation-Driven Development and Operations

EDDOps (Xia et al., 2024) proposes a process model where evaluation is not a one-time checkpoint but a continuous feedback loop embedded throughout the agent lifecycle.

Process Steps

Evaluation Drivers

Architecture Layers

161

Sources Reviewed

Click each step in the EDDOps cycle to explore details

Step 1: Define Evaluation Plan

Before writing a single test case, define what success looks like. Identify the agent's intended use cases, the user population, and the failure modes that matter most. Choose evaluation methods appropriate to each use case. Assign responsibility for evaluation to specific team members. Key output: an evaluation plan document specifying metrics, thresholds, frequency, and ownership before development begins.

EDDOps 4-Step Cycle: Step 1: Define Evaluation Plan ├── Identify use cases and user population ├── Select evaluation methods per capability dimension └── Set acceptance thresholds + assign ownership Step 2: Develop Test Cases ├── Build offline benchmark suite (static + dynamic) ├── Define ground truth trajectories or outcomes └── Create adversarial and edge-case inputs Step 3: Conduct Evaluation (Offline + Online) ├── Offline: run benchmark, score, regression-test └── Online: shadow mode → A/B test → full deployment Step 4: Analyse & Improve ├── Root-cause analysis of failures by dimension ├── Update training data / prompts / tool definitions └── Feed findings back to Step 1 (continuous loop)

How is EDDOps different from standard MLOps? ▼

Standard MLOps focuses on model performance metrics on fixed test sets. EDDOps extends this for agents: evaluation spans the full lifecycle (not just pre-deployment), covers all 4 capability dimensions (not just end-to-end accuracy), requires system-level evaluation (not just model-level), and mandates closed feedback loops where findings drive concrete changes.

Reference Architecture →

Architecture

Three-Layer Reference Architecture for Agent Evaluation

The EDDOps reference architecture organises evaluation infrastructure into three layers: Supply Chain (data and tools), Agent (the system under test), and Operation (monitoring and control). Click a layer to explore.

Click a layer to highlight its components and responsibilities

Supply Chain Layer

The Supply Chain Layer provides the raw materials for evaluation: training data, tool definitions, evaluation datasets, and ground truth labels. Components: data pipeline (ingestion, cleaning, versioning), tool registry (available APIs, function signatures), and evaluation data store. Getting this layer right is the most underrated part of agent evaluation — poor quality here makes all downstream evaluation unreliable.

Supply Chain

Data pipeline · Tool registry · Evaluation datasets · Ground truth labels · Synthetic test generation · Data versioning

Agent Layer

LLM backbone · Memory systems · Tool executor · Planner/orchestrator · Context manager · Output formatter

Operation Layer

Online monitors · A/B framework · LLM-as-Judge pipeline · Human review queue · Alerting · Feedback storage

6 Evaluation Drivers →

Design Principles

Six Drivers of High-Quality Agent Evaluation

The EDDOps framework distils the literature into six evaluation drivers — principles that separate rigorous evaluation from superficial benchmark-running. Click each to explore.

Lifecycle Coverage

Evaluate across all lifecycle phases, not just pre-deployment

Metric Mix

Combine end-to-end, step-level, and intermediate metrics

System-Level Anchor

Evaluate the full system, not isolated components

Adaptive Evaluation

Update test suites as agent capabilities and risks evolve

Closed Feedback Loops

Evaluation findings must drive concrete improvements

Human Oversight

Keep humans in the loop for safety-critical evaluation decisions

D1 — Lifecycle Coverage

Most academic evaluation covers only pre-deployment. D1 requires evaluation at every phase: development (unit tests per capability), pre-deployment (full benchmark suite), deployment (shadow mode, A/B tests), and post-deployment (continuous monitoring, drift detection). Build evaluation infrastructure that can run continuously without human intervention.

Safety & Robustness →

Safety Research

Safety & Robustness — The Evaluation Dimension Everyone Skips

88% of academic evaluation uses AI-only evaluators with no human oversight. Yet safety failures in tool-using agents can have real-world consequences. ToolEmu and related work introduce explicit safety evaluation.

Click an attack type to see evaluation approach

Prompt Injection via Tool Output

A malicious actor embeds instructions inside tool output (e.g., a web page the agent browses). The agent, treating tool output as data, instead executes the injected instruction — potentially exfiltrating data, making unauthorised API calls, or corrupting system state. Evaluation (ToolEmu): present 144 adversarially crafted tool-use scenarios. Score: did the agent complete the legitimate task AND resist the injection?

Why Safety Eval Is Hard

Safety failures are rare in testing but catastrophic in production. Standard benchmarks don't include adversarial inputs. Success is defined as NOT doing something harmful — harder to measure than task completion. The same action can be safe in one context and unsafe in another.

Robustness Evaluation

How does agent performance degrade under: noisy inputs, partial tool failures, ambiguous instructions, contradictory information, and adversarial prompts? Robustness testing requires intentionally degraded inputs — not just well-formed benchmark tasks.

What is ToolEmu? ▼

ToolEmu (Ruan et al.) is a safety evaluation framework that emulates tool execution using an LLM, allowing safety testing without live API access. It tests agents across 36 tools and 144 scenarios, scoring both task completion and safety compliance. Key finding: agents are over-eager to comply with instructions that lead to irreversible consequences (deleting files, sending emails, making purchases).

D6 Human Oversight — Why it matters for safety ▼

88% of papers use AI-only evaluators. But AI judges inherit the same biases and blind spots as the models they evaluate. For safety-critical evaluation — deciding whether to deploy an agent with real-world tool access — human review of a sample of evaluation cases is non-negotiable. D6 mandates human oversight at safety decision points: initial deployment approval, after significant capability changes, and when online monitors flag anomalies.

Knowledge Check →

Knowledge Check

Test Your Understanding

Eight questions covering the key concepts from this blog. Click an option to check your answer.