LLM-as-a-Verifier — Fine-Grained Trajectory Scoring

Section 01 — Overview

The Verification Gap

LLMs generate multiple candidate solutions, but picking the best one is itself a hard problem. The gap between Pass@1 (random pick) and Oracle (perfect pick) is where verification lives.

86.4%

Terminal-Bench 2

77.8%

SWE-Bench Verified

G=20

Score Granularity

C×K×G

Scaling Axes

The Setup

A coding agent runs N times on each task, producing N candidate trajectories. Pass@1 randomly picks one. Oracle always picks the correct one. LLM-as-a-Verifier ranks them to beat random — closing the gap toward oracle.

The Key Insight

Instead of asking the LLM "is this correct? (yes/no)", extract the probability distribution over score tokens A→T via logprobs. More signal → better ranking → better trajectory selection.

Three Scaling Axes

C — decompose evaluation into multiple criteria (e.g., spec adherence, output match, error signals). K — repeat verification to reduce noise. G — use G score tokens for fine-grained probability mass.

What is a trajectory? ▶

A trajectory τ is the full execution trace of an agent solving a task: the sequence of observations, actions (commands, file edits, bash outputs), and reasoning steps from start to final state. For Terminal-Bench 2 and SWE-bench, trajectories include terminal commands, their outputs, error messages, and the agent's responses. The verifier reads the entire trace — not just the final answer — to evaluate solution quality.

Why not just run tests? ▶

Many real-world tasks lack reliable automated tests. SWE-bench has unit tests, but Terminal-Bench 2 involves arbitrary shell tasks where "correctness" requires semantic understanding. The verifier also catches cases where tests pass but the trajectory took an incorrect or fragile path — e.g., hardcoding expected outputs. LLM verification is complementary to, not a replacement for, test-based evaluation.

Why Scoring is Hard →

Section 02 — The Problem

Why Scoring Trajectories is Hard

A single binary verdict ("correct" / "incorrect") discards most of the LLM's signal. The challenge is extracting a continuous, calibrated quality score from a discrete token generator.

Binary scoring: The LLM outputs a single token — "Yes" or "No", "Pass" or "Fail". Two very different trajectories (one almost correct, one completely wrong) receive the same score of 0. No gradient. No ranking ability.

          Problem with binary: If 4 out of 5 trajectories are labelled "incorrect", you have no way to rank them. Random selection from the pool is as good as the verifier's choice.
        

What is a logprob? ▶

A logprob (log probability) is the natural log of the probability the model assigns to a token at a given position. If the model has 60% confidence that the next token is "A", the logprob is ln(0.6) ≈ −0.51. Modern APIs like Gemini expose the top-K logprobs at each position. LLM-as-a-Verifier uses logprobs=20 to capture the full distribution over the 20 score tokens (A–T).

          The logprob advantage: At the score token position, the LLM assigns probability to every letter A–T simultaneously. This gives a full probability distribution — effectively a soft, continuous score — without any extra API calls.
        

How does Gemini expose logprobs? ▶

Via response_logprobs=True and logprobs=20 parameters in the Gemini API (Vertex AI). The response includes position_logprobs: for each generated token position, the top-20 alternative tokens and their log-probabilities. The verifier locates the closing XML tag </score_A> in the stream, then reads logprobs at the next position — where the score letter appears.

Explore Logprob Scoring →

Section 03 — The Algorithm

Logprob-Based Score Extraction

The verifier asks the LLM to score a trajectory and extracts the probability mass over letters A–T at the score position. This probability distribution is collapsed into a single expected-value score in [0, 1].

Letter Scale (A → T) — G = 20 Granularity

A = 20 (perfect) · T = 1 (total failure) · Normalised to [0, 1]

Live Logprob Visualiser (illustrative — Gaussian shape is a simplification)

Drag the slider to simulate different trajectory quality levels. Watch the probability mass shift across letters.

Trajectory quality 0.70

Score spread (σ) 1.80

Expected score: — · Normalised: —

-- Score extraction (verifier_core.py) # 1. Find score token position in logprob stream tag_logprobs = _find_tag_logprobs(position_logprobs, "</score_A>") # 2. Aggregate probabilities over valid letters A–T probs = {} for token, logprob in tag_logprobs: if token.upper() in VALID_TOKENS: raw_val = VALID_TOKENS[token.upper()] # A→20, B→19, ..., T→1 probs[raw_val] = exp(logprob) # 3. Compute expected value expected = sum(v * p for v,p in probs.items()) / sum(probs.values()) # 4. Normalise to [0, 1] score = (expected - 1) / (20 - 1) # min=1 (T), max=20 (A)

Why Expected Value?

Taking E[score] over the distribution integrates all the model's uncertainty. A trajectory where the model is 55% "A" and 45% "B" gets a higher score than one where it's 55% "A" but also 30% "F" — correctly penalising inconsistency.

Why 20 Letters?

G=20 (A–T) provides fine-grained signal while staying within the top-20 logprob budget exposed by the API. Fewer bins = coarser scores. More bins would require API support for top-K > 20. G=20 is the sweet spot for Gemini's API.

Fallback

If logprobs are unavailable (wrong API tier), the verifier falls back to regex extraction — finding the letter between the XML tags. This gives a hard score (one of 20 values) with no gradient information.

Criteria Decomposition →

Section 04 — Criteria

Criteria Decomposition (C = 3)

Rather than a single holistic score, the verifier evaluates each trajectory pair on C = 3 independent criteria. This decomposes the evaluation problem and reduces the chance that one aspect dominates the score.

⚠ Bar values below are illustrative — not real scores from the repo.

Specification Adherence: Verifies that the agent met the exact requirements — correct file paths, installation locations, output formats, and explicit constraints — rather than solving a superficially similar but fundamentally different problem. Rewards strict compliance over approximate solutions.

-- Prompt structure for each criterion create_prompt_for_criterion( task_desc, # Problem statement trajectory_A, # Full trace of agent A trajectory_B, # Full trace of agent B criterion_name, # e.g. "Specification Adherence" criterion_desc, # Detailed evaluation rubric scale_desc # "A = perfect match, T = total failure" ) # Returns XML with <score_A>X</score_A> <score_B>Y</score_B>

          Pairwise evaluation: Each API call scores two trajectories simultaneously — extracting both score_A and score_B from a single response. This halves the cost versus scoring each trajectory independently.
        

Why Decompose?

A holistic score conflates dimensions. A trajectory might spec-adhere perfectly but have unresolved errors. Decomposition lets the verifier assign partial credit across dimensions, producing a more nuanced ranking. The final score is the sum across all C criteria and K verification passes.

Final Aggregation

For each pair (i, j):
s_i = Σ_c∈C Σ_k∈K score(τ_i, c, k) / (C × K)

Win awarded if s_i > s_j. Tie if |s_i − s_j| below threshold. Used in tournament selection.

Tournament Selection →

Section 05 — Selection

Round-Robin Tournament Selection

With N candidate trajectories per task, every possible pair is scored. The trajectory that wins the most head-to-head matchups — aggregated across all C criteria and K passes — is selected as the final answer.

Number of trajectories (N) 4

6 pairwise comparisons · 72 API calls (C=3 × K=4 × pairs)

Click canvas to run a new random tournament

How it works: All N(N−1)/2 trajectory pairs are evaluated. Each pair gets a score from the verifier (aggregated over C criteria × K passes). The trajectory with the most wins is declared the best candidate for this task.

Why Round-Robin?

A single ranking score might be miscalibrated between trajectories from different agents. Pairwise comparison is more robust — each matchup is a direct, contextualised comparison. The LLM sees both trajectories simultaneously, enabling contrastive evaluation.

K = 4 Passes

Each pair is evaluated K=4 times with independent API calls. Results are averaged. This reduces variance from LLM stochasticity — the verifier may give slightly different scores on each call, and averaging smooths this out.

Caching

Results are cached with key "{crit}|{task}|{i,j}|{rep}" and persisted to JSON. This enables reproducibility and avoids redundant API calls. Concurrent scoring via ThreadPoolExecutor speeds up large benchmark runs.

See Benchmark Results →

Section 06 — Results

Benchmark Results

LLM-as-a-Verifier closes a significant portion of the gap between random selection and oracle on two challenging coding benchmarks.

Pass@1 (Random)

81.8%

Baseline: randomly picking one of the available trajectories. 73/89 tasks solved on average.

LLM-as-a-Verifier

86.4% ±0.3

76.9/89 tasks. +4.6pp over random. Consistent across 3 repeated evaluations (standard error ±0.3).

Oracle

89.9%

Ceiling: always picking the correct trajectory. 80/89 tasks. Verifier recovers 60% of the gap.

Key Insights →

Section 07 — Insights

Key Insights & Connections

LLM-as-a-Verifier synthesises ideas from reward modelling, process reward models, and LLM-as-a-judge into a practical, scalable framework for trajectory selection.

Soft vs Hard Scoring

The core innovation: treating the LLM as a probability estimator, not a classifier. By reading logprobs instead of the generated token, we extract a continuous signal. This is analogous to how temperature-scaled softmax gives calibrated confidence in classification — the distribution is more informative than the argmax.

Relation to PRMs

Process Reward Models (PRMs) score intermediate steps of reasoning. LLM-as-a-Verifier scores entire trajectories — a coarser but more general approach. The C-criteria decomposition is a step toward process-level granularity without requiring step-level annotations.

Relation to LLM-as-a-Judge

LLM-as-a-Judge (MT-Bench, Chatbot Arena) uses LLMs to score free-form outputs. LLM-as-a-Verifier extends this to: (1) code/shell trajectories rather than text, (2) logprob extraction rather than token output, (3) criteria decomposition, and (4) tournament aggregation. It's a more structured, signal-rich version of the same idea.

Scalable Test-Time Compute

This framework is a form of inference-time scaling: run the agent N times, then invest compute in verification to select the best output. Combined with Best-of-N sampling, it represents a practical recipe for improving coding agent performance without retraining.

      The C × K × G product space: Each dimension is independently scalable. More criteria (C↑) captures more evaluation aspects. More passes (K↑) reduces noise. Higher granularity (G↑) extracts more signal per call. Total information scales as C × K × G — and the API cost scales similarly, making it a tunable compute–accuracy tradeoff.
    

What model is used as the verifier? ▶

Gemini 2.5 Flash (via Vertex AI) is the verifier model in the published results. It was chosen for its speed, cost-effectiveness, and support for response_logprobs=True with logprobs=20. The framework is model-agnostic — any API that exposes per-token logprobs could serve as the verifier. Note: the evaluated agent trajectories are from other models (the trajectories stored in data/), not from Gemini.

What are the limitations? ▶

1. Long context: Full trajectories can be thousands of tokens. Gemini 2.5 Flash's context is large, but very long traces may exceed limits or degrade attention quality.
2. API dependency: Requires logprob access, which not all providers offer (OpenAI limits logprobs to top-5; Anthropic does not expose them).
3. Verifier bias: The verifier model may share blind spots with the agent — if Gemini generates and verifies, it may consistently miss the same errors.
4. Cost: C=3 × K=4 × N(N−1)/2 API calls per task. For N=5 trajectories, that's 3×4×10 = 120 calls per task, scaling quadratically with N.
5. Swing tasks only: The verifier only helps on tasks where some trajectories succeed and some fail. All-pass and all-fail tasks are unaffected.

How does this connect to reinforcement learning? ▶

The verifier produces a reward signal R(τ) ∈ [0, 1] for any trajectory τ. This is exactly what a reward model does in RLHF. LLM-as-a-Verifier can be seen as a lightweight, training-free reward model for code trajectories — using the verifier LLM's internal probability estimates rather than a learned scalar head. The tournament selection then acts as a best-of-N policy, equivalent to rejection sampling with the verifier as the reward function.