Evaluating a single LLM call is hard enough. Evaluating an LLM agent — autonomous, multi-step, tool-using, non-deterministic — is a different problem entirely.
Click a panel to explore each challenge
Open-Ended Behaviour
Unlike a single LLM call with a defined output format, agents can take any sequence of actions to reach a goal. There is no unique "correct" path — two agents that both succeed may take completely different routes. Standard unit-test style evaluation cannot capture this space of valid solutions.
Non-Deterministic
The same query produces different tool call sequences, reasoning chains, and outputs on each run. You cannot run once and declare pass/fail — you need statistical sampling across many runs to get reliable measurements.
Multi-Step & Long-Horizon
Agents take dozens to hundreds of actions per task. An error at step 3 may not surface until step 47. Diagnosing failure requires step-level tracing, not just checking the final output.
System-Level Interactions
Agent behaviour emerges from interactions between the LLM, tools, memory, external APIs, and environment state. Evaluating the model in isolation misses most real failure modes — evaluation must be system-level.
A multivocal literature review of 134 academic and 27 industry sources reveals a dramatic disconnect between how agents are evaluated and what good evaluation requires. (Xia et al., 2024)
Click each bar for details · Blue = academic · Green = industry practice
Pre-deployment Focus (93%)
93.28% of academic evaluation sources focus exclusively on pre-deployment (offline) evaluation. Only 2.24% cover post-deployment, and 4.48% cover continuous evaluation. Yet in industry, 40.74% of practitioners use continuous evaluation. Agents deployed in production face distribution shift, new edge cases, and changing user behaviour that pre-deployment benchmarks cannot anticipate.
The Implication
Academic benchmarks optimise for what is easy to measure — final task success on controlled inputs — not what matters in production: reliability over time, graceful degradation, and safe failure modes.
Industry vs Academia
Industry shows more balanced evaluation: 44.44% pre-deployment, 14.81% post-deployment, 40.74% continuous. Industry practitioners face real consequences when agents fail — this drives more rigorous evaluation practice.
What Good Looks Like
Evaluation should span the full lifecycle, combine end-to-end with intermediate metrics, operate at system level, adapt to new risks, and close the loop: findings must drive concrete improvements.
Before choosing metrics, you must know what you are measuring. Agent evaluation covers four core capability areas — each requires distinct evaluation approaches.
Planning — Decomposing Goals into Actions
Planning is the agent's ability to break a high-level goal into an ordered sequence of executable sub-tasks, handle dependencies between steps, and recover from unexpected results. Evaluation checks: does the plan cover all required steps? Is the order logically correct? Does the agent adapt the plan when an intermediate step fails?
Key benchmarks: WebArena (web task planning), AgentBench (multi-domain planning), SWE-bench (software engineering planning). Key metric: plan quality score = ratio of correct sub-tasks in the generated plan vs. reference plan.
92% of academic papers report only end-to-end task success. But a single metric hides the critical question: where exactly did the agent fail? Click a metric category to explore.
End-to-End Metrics
End-to-end metrics measure whether the agent ultimately achieved the goal, regardless of how.
Task Success Rate (TSR): binary pass/fail — did the agent complete the task? Simple but hides all failure modes. Goal Completion Rate (GCR): partial credit — what fraction of the goal was achieved? Better for multi-part tasks. User Satisfaction: human rating of the final output quality (1–5 scale). Captures subjective quality that automated metrics miss.
Limitation: a 70% TSR tells you nothing about whether failures were in planning, tool selection, memory, or output formatting.
Offline evaluation runs agents against fixed test suites without live system access. It is fast, cheap, and reproducible — but cannot capture real-world distribution shift or long-tail edge cases.
Click a phase to see details · Offline pipeline: Define → Build → Execute → Score → Analyse
Define Test Scope
Start by defining what capabilities you are testing and what constitutes a pass. Identify the task types your agent will face in production. Sample representative inputs. Define ground truth answers or reference trajectories. Without a well-defined scope, your benchmark measures what is easy to test, not what matters in production.
Static Test Suite
Fixed question-answer pairs or task specifications with known ground truth. Highly reproducible — run any time and get the same score. Risk: agents can be inadvertently tuned to the benchmark, inflating scores without real improvement.
Dynamic Test Suite
Programmatically generated test cases with randomised parameters. Harder to overfit, broader coverage. Trade-off: harder to interpret failures because test cases are not fixed across runs.
When should you use offline evaluation? ▼
Use offline evaluation during development (fast iteration), before any deployment (regression testing), and when comparing candidate models or configurations. Do not use offline evaluation as your only signal — production always introduces conditions that benchmarks miss.
What are the risks of offline-only evaluation? ▼
93% of academic evaluation is pre-deployment only. The risks: benchmark saturation (models fine-tuned specifically on common benchmarks), distribution shift (real users behave differently from test designers), and false confidence (high offline score → poor production performance). The EDDOps framework requires continuous online evaluation to close this gap.
The agent evaluation ecosystem has produced specialised benchmarks for different capability areas. Click each to explore what it tests, how it scores, and when to use it.
AgentBench
Multi-domain OS, DB, Web tasks
WebArena
Real web app task completion
SWE-bench
GitHub issue resolution
GAIA
466 real-world assistant tasks
BFCL
2,000 function-call Q&A pairs
ToolEmu
Safety-focused tool evaluation
AgentBench — Multi-Domain Agent Evaluation
AgentBench tests LLM agents across 8 distinct environments: OS (shell commands), DB (database queries), KG (knowledge graphs), ALFWorld (household tasks), Mind2Web (web navigation), WebShop (shopping), and more.
What it measures: task completion rate across diverse, realistic environments requiring planning, tool use, and state tracking. Why it matters: multi-domain testing reveals that agent performance varies dramatically by domain. Limitation: static test suite — agents can be fine-tuned specifically to these environments.
Online Evaluation — Measuring What Actually Happens
Online evaluation monitors agent behaviour in real production environments, with real users and real consequences. It catches what offline testing misses — but is slower, costlier, and harder to interpret.
Simulated production monitoring dashboard — metrics update every 800ms
Shadow Mode
Run the new agent in parallel with the current system — same inputs, both process the request, only the current system's output is shown to users. Compare results offline. Zero user risk. Widely used for validating model upgrades before cutover.
A/B Testing
Route a fraction of real traffic to the new agent. Measure task success, user satisfaction, and error rates on live users. Statistically valid comparison. Risk: real users experience the experimental agent — requires guardrails and rollback capability.
What metrics do you track online? ▼
Task success rate (sampled, human-verified), user satisfaction ratings, error rates by type (tool failure, context lost, unhelpful response), latency percentiles (p50/p95/p99), cost per task, and escalation rate (tasks handed to human operators). Track these in time-series dashboards so regressions are visible immediately.
Why is online evaluation harder than offline? ▼
Ground truth is often unavailable — you don't know the "correct" answer to a user's real query. Attribution is hard — failures come from model, tool, infrastructure, or user input. Confounders are everywhere — user behaviour, time of day, and data drift all affect metrics. Feedback loops are slow — it may take hours or days to know if an action was correct.
Human evaluation is the gold standard but it is slow and expensive. LLM-as-Judge uses a separate (often more capable) LLM to evaluate agent outputs — fast, scalable, but with its own biases and failure modes.
Click a pattern to explore evaluation approaches · reliability bar = estimated agreement with human labels
Pairwise Comparison
Show the judge LLM two agent responses (A and B) for the same task. Ask: which is better, and why? Pairwise comparison is more reliable than absolute scoring because ranking two options is easier than assigning a score on a scale. Risk: position bias — judges tend to prefer the first response shown. Mitigation: randomise order and average both orderings.
Known Biases
Position bias (prefer first), verbosity bias (prefer longer), self-preference (prefer outputs from same model family), format bias (prefer well-formatted responses regardless of correctness). Always audit judge calibration against human labels.
When to Use
Open-ended tasks where ground truth doesn't exist (summarisation, explanation quality, helpfulness). Use LLM-as-Judge to replace expensive human evaluation at scale — but always validate on a sample with real human raters first.
When Not to Use
Tasks with factual ground truth (tool call accuracy, parameter correctness, code execution). Tasks where the judge may not have domain knowledge. Safety-critical evaluations where bias in the judge is unacceptable.
EDDOps — Evaluation-Driven Development and Operations
EDDOps (Xia et al., 2024) proposes a process model where evaluation is not a one-time checkpoint but a continuous feedback loop embedded throughout the agent lifecycle.
4
Process Steps
6
Evaluation Drivers
3
Architecture Layers
161
Sources Reviewed
Click each step in the EDDOps cycle to explore details
Step 1: Define Evaluation Plan
Before writing a single test case, define what success looks like. Identify the agent's intended use cases, the user population, and the failure modes that matter most. Choose evaluation methods appropriate to each use case. Assign responsibility for evaluation to specific team members. Key output: an evaluation plan document specifying metrics, thresholds, frequency, and ownership before development begins.
EDDOps 4-Step Cycle:
Step 1: Define Evaluation Plan
├── Identify use cases and user population
├── Select evaluation methods per capability dimension
└── Set acceptance thresholds + assign ownership
Step 2: Develop Test Cases
├── Build offline benchmark suite (static + dynamic)
├── Define ground truth trajectories or outcomes
└── Create adversarial and edge-case inputs
Step 3: Conduct Evaluation (Offline + Online)
├── Offline: run benchmark, score, regression-test
└── Online: shadow mode → A/B test → full deployment
Step 4: Analyse & Improve
├── Root-cause analysis of failures by dimension
├── Update training data / prompts / tool definitions
└── Feed findings back to Step 1 (continuous loop)
How is EDDOps different from standard MLOps? ▼
Standard MLOps focuses on model performance metrics on fixed test sets. EDDOps extends this for agents: evaluation spans the full lifecycle (not just pre-deployment), covers all 4 capability dimensions (not just end-to-end accuracy), requires system-level evaluation (not just model-level), and mandates closed feedback loops where findings drive concrete changes.
Three-Layer Reference Architecture for Agent Evaluation
The EDDOps reference architecture organises evaluation infrastructure into three layers: Supply Chain (data and tools), Agent (the system under test), and Operation (monitoring and control). Click a layer to explore.
Click a layer to highlight its components and responsibilities
Supply Chain Layer
The Supply Chain Layer provides the raw materials for evaluation: training data, tool definitions, evaluation datasets, and ground truth labels. Components: data pipeline (ingestion, cleaning, versioning), tool registry (available APIs, function signatures), and evaluation data store. Getting this layer right is the most underrated part of agent evaluation — poor quality here makes all downstream evaluation unreliable.
Supply Chain
Data pipeline · Tool registry · Evaluation datasets · Ground truth labels · Synthetic test generation · Data versioning
The EDDOps framework distils the literature into six evaluation drivers — principles that separate rigorous evaluation from superficial benchmark-running. Click each to explore.
D1
Lifecycle Coverage
Evaluate across all lifecycle phases, not just pre-deployment
D2
Metric Mix
Combine end-to-end, step-level, and intermediate metrics
D3
System-Level Anchor
Evaluate the full system, not isolated components
D4
Adaptive Evaluation
Update test suites as agent capabilities and risks evolve
D5
Closed Feedback Loops
Evaluation findings must drive concrete improvements
D6
Human Oversight
Keep humans in the loop for safety-critical evaluation decisions
D1 — Lifecycle Coverage
Most academic evaluation covers only pre-deployment. D1 requires evaluation at every phase: development (unit tests per capability), pre-deployment (full benchmark suite), deployment (shadow mode, A/B tests), and post-deployment (continuous monitoring, drift detection). Build evaluation infrastructure that can run continuously without human intervention.
Safety & Robustness — The Evaluation Dimension Everyone Skips
88% of academic evaluation uses AI-only evaluators with no human oversight. Yet safety failures in tool-using agents can have real-world consequences. ToolEmu and related work introduce explicit safety evaluation.
Click an attack type to see evaluation approach
Prompt Injection via Tool Output
A malicious actor embeds instructions inside tool output (e.g., a web page the agent browses). The agent, treating tool output as data, instead executes the injected instruction — potentially exfiltrating data, making unauthorised API calls, or corrupting system state. Evaluation (ToolEmu): present 144 adversarially crafted tool-use scenarios. Score: did the agent complete the legitimate task AND resist the injection?
Why Safety Eval Is Hard
Safety failures are rare in testing but catastrophic in production. Standard benchmarks don't include adversarial inputs. Success is defined as NOT doing something harmful — harder to measure than task completion. The same action can be safe in one context and unsafe in another.
Robustness Evaluation
How does agent performance degrade under: noisy inputs, partial tool failures, ambiguous instructions, contradictory information, and adversarial prompts? Robustness testing requires intentionally degraded inputs — not just well-formed benchmark tasks.
What is ToolEmu? ▼
ToolEmu (Ruan et al.) is a safety evaluation framework that emulates tool execution using an LLM, allowing safety testing without live API access. It tests agents across 36 tools and 144 scenarios, scoring both task completion and safety compliance. Key finding: agents are over-eager to comply with instructions that lead to irreversible consequences (deleting files, sending emails, making purchases).
D6 Human Oversight — Why it matters for safety ▼
88% of papers use AI-only evaluators. But AI judges inherit the same biases and blind spots as the models they evaluate. For safety-critical evaluation — deciding whether to deploy an agent with real-world tool access — human review of a sample of evaluation cases is non-negotiable. D6 mandates human oversight at safety decision points: initial deployment approval, after significant capability changes, and when online monitors flag anomalies.