Gap
βΊ
Problem
βΊ
Logprobs
βΊ
Criteria
βΊ
Tournament
βΊ
Results
βΊ
Insights
Section 01 β Overview
The Verification Gap
LLMs generate multiple candidate solutions, but picking the best one is itself a hard problem. The gap between Pass@1 (random pick) and Oracle (perfect pick) is where verification lives.
86.4%
Terminal-Bench 2
77.8%
SWE-Bench Verified
G=20
Score Granularity
CΓKΓG
Scaling Axes
The Setup
A coding agent runs N times on each task, producing N candidate trajectories. Pass@1 randomly picks one. Oracle always picks the correct one. LLM-as-a-Verifier ranks them to beat random β closing the gap toward oracle.
The Key Insight
Instead of asking the LLM "is this correct? (yes/no)", extract the probability distribution over score tokens AβT via logprobs. More signal β better ranking β better trajectory selection.
Three Scaling Axes
C β decompose evaluation into multiple criteria (e.g., spec adherence, output match, error signals). K β repeat verification to reduce noise. G β use G score tokens for fine-grained probability mass.
What is a trajectory?
βΆ
A trajectory Ο is the full execution trace of an agent solving a task: the sequence of observations, actions (commands, file edits, bash outputs), and reasoning steps from start to final state. For Terminal-Bench 2 and SWE-bench, trajectories include terminal commands, their outputs, error messages, and the agent's responses. The verifier reads the entire trace β not just the final answer β to evaluate solution quality.
Why not just run tests?
βΆ
Many real-world tasks lack reliable automated tests. SWE-bench has unit tests, but Terminal-Bench 2 involves arbitrary shell tasks where "correctness" requires semantic understanding. The verifier also catches cases where tests pass but the trajectory took an incorrect or fragile path β e.g., hardcoding expected outputs. LLM verification is complementary to, not a replacement for, test-based evaluation.
Why Scoring is Hard β
Section 02 β The Problem
Why Scoring Trajectories is Hard
A single binary verdict ("correct" / "incorrect") discards most of the LLM's signal. The challenge is extracting a continuous, calibrated quality score from a discrete token generator.
Binary scoring: The LLM outputs a single token β "Yes" or "No", "Pass" or "Fail". Two very different trajectories (one almost correct, one completely wrong) receive the same score of 0. No gradient. No ranking ability.
Problem with binary: If 4 out of 5 trajectories are labelled "incorrect", you have no way to rank them. Random selection from the pool is as good as the verifier's choice.
What is a logprob?
βΆ
A logprob (log probability) is the natural log of the probability the model assigns to a token at a given position. If the model has 60% confidence that the next token is "A", the logprob is ln(0.6) β β0.51. Modern APIs like Gemini expose the top-K logprobs at each position. LLM-as-a-Verifier uses
logprobs=20 to capture the full distribution over the 20 score tokens (AβT).
The logprob advantage: At the score token position, the LLM assigns probability to every letter AβT simultaneously. This gives a full probability distribution β effectively a soft, continuous score β without any extra API calls.
How does Gemini expose logprobs?
βΆ
Via
response_logprobs=True and logprobs=20 parameters in the Gemini API (Vertex AI). The response includes position_logprobs: for each generated token position, the top-20 alternative tokens and their log-probabilities. The verifier locates the closing XML tag </score_A> in the stream, then reads logprobs at the next position β where the score letter appears.
Explore Logprob Scoring β
Section 03 β The Algorithm
Logprob-Based Score Extraction
The verifier asks the LLM to score a trajectory and extracts the probability mass over letters AβT at the score position. This probability distribution is collapsed into a single expected-value score in [0, 1].
Letter Scale (A β T) β G = 20 Granularity
A = 20 (perfect) Β· T = 1 (total failure) Β· Normalised to [0, 1]
Live Logprob Visualiser (illustrative β Gaussian shape is a simplification)
Drag the slider to simulate different trajectory quality levels. Watch the probability mass shift across letters.
Expected score: β Β· Normalised: β
-- Score extraction (verifier_core.py)
# 1. Find score token position in logprob stream
tag_logprobs = _find_tag_logprobs(position_logprobs, "</score_A>")
# 2. Aggregate probabilities over valid letters AβT
probs = {}
for token, logprob in tag_logprobs:
if token.upper() in VALID_TOKENS:
raw_val = VALID_TOKENS[token.upper()] # Aβ20, Bβ19, ..., Tβ1
probs[raw_val] = exp(logprob)
# 3. Compute expected value
expected = sum(v * p for v,p in probs.items()) / sum(probs.values())
# 4. Normalise to [0, 1]
score = (expected - 1) / (20 - 1) # min=1 (T), max=20 (A)
Why Expected Value?
Taking E[score] over the distribution integrates all the model's uncertainty. A trajectory where the model is 55% "A" and 45% "B" gets a higher score than one where it's 55% "A" but also 30% "F" β correctly penalising inconsistency.
Why 20 Letters?
G=20 (AβT) provides fine-grained signal while staying within the top-20 logprob budget exposed by the API. Fewer bins = coarser scores. More bins would require API support for top-K > 20. G=20 is the sweet spot for Gemini's API.
Fallback
If logprobs are unavailable (wrong API tier), the verifier falls back to regex extraction β finding the letter between the XML tags. This gives a hard score (one of 20 values) with no gradient information.
Criteria Decomposition β
Section 04 β Criteria
Criteria Decomposition (C = 3)
Rather than a single holistic score, the verifier evaluates each trajectory pair on C = 3 independent criteria. This decomposes the evaluation problem and reduces the chance that one aspect dominates the score.
β Bar values below are illustrative β not real scores from the repo.
Specification Adherence: Verifies that the agent met the exact requirements β correct file paths, installation locations, output formats, and explicit constraints β rather than solving a superficially similar but fundamentally different problem. Rewards strict compliance over approximate solutions.
-- Prompt structure for each criterion
create_prompt_for_criterion(
task_desc, # Problem statement
trajectory_A, # Full trace of agent A
trajectory_B, # Full trace of agent B
criterion_name, # e.g. "Specification Adherence"
criterion_desc, # Detailed evaluation rubric
scale_desc # "A = perfect match, T = total failure"
)
# Returns XML with <score_A>X</score_A> <score_B>Y</score_B>
Pairwise evaluation: Each API call scores two trajectories simultaneously β extracting both score_A and score_B from a single response. This halves the cost versus scoring each trajectory independently.
Why Decompose?
A holistic score conflates dimensions. A trajectory might spec-adhere perfectly but have unresolved errors. Decomposition lets the verifier assign partial credit across dimensions, producing a more nuanced ranking. The final score is the sum across all C criteria and K verification passes.
Final Aggregation
For each pair (i, j):
s_i = Ξ£cβC Ξ£kβK score(Ο_i, c, k) / (C Γ K)
Win awarded if s_i > s_j. Tie if |s_i β s_j| below threshold. Used in tournament selection.
s_i = Ξ£cβC Ξ£kβK score(Ο_i, c, k) / (C Γ K)
Win awarded if s_i > s_j. Tie if |s_i β s_j| below threshold. Used in tournament selection.
Tournament Selection β
Section 05 β Selection
Round-Robin Tournament Selection
With N candidate trajectories per task, every possible pair is scored. The trajectory that wins the most head-to-head matchups β aggregated across all C criteria and K passes β is selected as the final answer.
6 pairwise comparisons Β· 72 API calls (C=3 Γ K=4 Γ pairs)
Click canvas to run a new random tournament
How it works: All N(Nβ1)/2 trajectory pairs are evaluated. Each pair gets a score from the verifier (aggregated over C criteria Γ K passes). The trajectory with the most wins is declared the best candidate for this task.
Why Round-Robin?
A single ranking score might be miscalibrated between trajectories from different agents. Pairwise comparison is more robust β each matchup is a direct, contextualised comparison. The LLM sees both trajectories simultaneously, enabling contrastive evaluation.
K = 4 Passes
Each pair is evaluated K=4 times with independent API calls. Results are averaged. This reduces variance from LLM stochasticity β the verifier may give slightly different scores on each call, and averaging smooths this out.
Caching
Results are cached with key
"{crit}|{task}|{i,j}|{rep}" and persisted to JSON. This enables reproducibility and avoids redundant API calls. Concurrent scoring via ThreadPoolExecutor speeds up large benchmark runs.
See Benchmark Results β
Section 06 β Results
Benchmark Results
LLM-as-a-Verifier closes a significant portion of the gap between random selection and oracle on two challenging coding benchmarks.
Pass@1 (Random)
81.8%
Baseline: randomly picking one of the available trajectories. 73/89 tasks solved on average.
LLM-as-a-Verifier
86.4% Β±0.3
76.9/89 tasks. +4.6pp over random. Consistent across 3 repeated evaluations (standard error Β±0.3).
Oracle
89.9%
Ceiling: always picking the correct trajectory. 80/89 tasks. Verifier recovers 60% of the gap.
Key Insights β
Section 07 β Insights
Key Insights & Connections
LLM-as-a-Verifier synthesises ideas from reward modelling, process reward models, and LLM-as-a-judge into a practical, scalable framework for trajectory selection.
Soft vs Hard Scoring
The core innovation: treating the LLM as a probability estimator, not a classifier. By reading logprobs instead of the generated token, we extract a continuous signal. This is analogous to how temperature-scaled softmax gives calibrated confidence in classification β the distribution is more informative than the argmax.
Relation to PRMs
Process Reward Models (PRMs) score intermediate steps of reasoning. LLM-as-a-Verifier scores entire trajectories β a coarser but more general approach. The C-criteria decomposition is a step toward process-level granularity without requiring step-level annotations.
Relation to LLM-as-a-Judge
LLM-as-a-Judge (MT-Bench, Chatbot Arena) uses LLMs to score free-form outputs. LLM-as-a-Verifier extends this to: (1) code/shell trajectories rather than text, (2) logprob extraction rather than token output, (3) criteria decomposition, and (4) tournament aggregation. It's a more structured, signal-rich version of the same idea.
Scalable Test-Time Compute
This framework is a form of inference-time scaling: run the agent N times, then invest compute in verification to select the best output. Combined with Best-of-N sampling, it represents a practical recipe for improving coding agent performance without retraining.
The C Γ K Γ G product space: Each dimension is independently scalable. More criteria (Cβ) captures more evaluation aspects. More passes (Kβ) reduces noise. Higher granularity (Gβ) extracts more signal per call. Total information scales as C Γ K Γ G β and the API cost scales similarly, making it a tunable computeβaccuracy tradeoff.
What model is used as the verifier?
βΆ
Gemini 2.5 Flash (via Vertex AI) is the verifier model in the published results. It was chosen for its speed, cost-effectiveness, and support for
response_logprobs=True with logprobs=20. The framework is model-agnostic β any API that exposes per-token logprobs could serve as the verifier. Note: the evaluated agent trajectories are from other models (the trajectories stored in data/), not from Gemini.
What are the limitations?
βΆ
1. Long context: Full trajectories can be thousands of tokens. Gemini 2.5 Flash's context is large, but very long traces may exceed limits or degrade attention quality.
2. API dependency: Requires logprob access, which not all providers offer (OpenAI limits logprobs to top-5; Anthropic does not expose them).
3. Verifier bias: The verifier model may share blind spots with the agent β if Gemini generates and verifies, it may consistently miss the same errors.
4. Cost: C=3 Γ K=4 Γ N(Nβ1)/2 API calls per task. For N=5 trajectories, that's 3Γ4Γ10 = 120 calls per task, scaling quadratically with N.
5. Swing tasks only: The verifier only helps on tasks where some trajectories succeed and some fail. All-pass and all-fail tasks are unaffected.
2. API dependency: Requires logprob access, which not all providers offer (OpenAI limits logprobs to top-5; Anthropic does not expose them).
3. Verifier bias: The verifier model may share blind spots with the agent β if Gemini generates and verifies, it may consistently miss the same errors.
4. Cost: C=3 Γ K=4 Γ N(Nβ1)/2 API calls per task. For N=5 trajectories, that's 3Γ4Γ10 = 120 calls per task, scaling quadratically with N.
5. Swing tasks only: The verifier only helps on tasks where some trajectories succeed and some fail. All-pass and all-fail tasks are unaffected.
How does this connect to reinforcement learning?
βΆ
The verifier produces a reward signal R(Ο) β [0, 1] for any trajectory Ο. This is exactly what a reward model does in RLHF. LLM-as-a-Verifier can be seen as a lightweight, training-free reward model for code trajectories β using the verifier LLM's internal probability estimates rather than a learned scalar head. The tournament selection then acts as a best-of-N policy, equivalent to rejection sampling with the verifier as the reward function.