🔍
Search Agents: Faster & Smarter
RAG Optimization — Tool Stack + Planner Training
This post is for paid subscribers of Visual Summary
Not a subscriber yet? Join Visual Summary →
Bottleneck
CER-C
Tool Stack
Planner
CLP
Results
Recommender

Search Agents Are Bottlenecked by Their Own Research

RAG agents spend most of their time searching — but most of that time is wasted on bad searches, redundant retrievals, and oversized context. Contextual AI's research identifies two axes to fix this: optimize the search tool, and train the planner to use it more efficiently.

60.7%
Best accuracy
2x
Speed gain
~3
Tool calls (converged)
2
Design axes
Hover a node to see its latency contribution and role in the pipeline.
The reranker matters most. Without it, nDCG@5 drops from 0.203 to 0.089 — a 56% quality collapse. But the reranker also accounts for ~1.5s of latency per search call. This tension between quality and speed is what the research resolves.
Why Search Agents Stall
Each research loop involves embedding the query, running ANN + BM25 retrieval, reranking the results, and feeding them into the LLM. The reranker alone takes ~1.5 seconds. Across 5–10 searches per query, latency compounds fast.
The Two Axes
Axis 1: Search tool configuration — which components (embedding, reranker, retrieval method) to use and at what quality level.
Axis 2: Planner training — teaching the model when to search, how many times, and when to stop.
BrowseComp-Plus
The evaluation benchmark. A harder version of BrowseComp requiring multi-hop reasoning across many documents. The baseline untrained Strong config achieves 50% accuracy. Trained configs push to 60.7%.

Search Strategy in Action: Trained vs Untrained

A trained planner issues exactly as many searches as needed and stops. An untrained planner keeps searching even after finding sufficient evidence. Click through each iteration to see the difference.

Click Next to step through each search iteration.

CER-C: How Quickly Does the Agent Find Evidence?

Traditional metrics like nDCG measure final retrieval quality. CER-C (Cumulative Evidence Recall — Curves) measures something different: how quickly the agent finds relevant documents per token of context consumed. It's a trajectory-level metric.

CER-C (Cumulative Evidence Recall - Curves) ------------------------------------------ x-axis : context tokens consumed (bucketed at 10K intervals) y-axis : fraction of known-relevant documents found Efficient agent : steep initial rise (finds docs early, uses few tokens) Wasteful agent : slow crawl (needs many tokens to find the same docs) Key insight: a fast tool + trained planner = steeper curve = better CER-C
Hover the chart to read exact recall values at each token budget.
What a steep curve means: the agent is finding relevant documents efficiently — each additional token of context contributes to evidence accumulation. A flat curve means the agent is burning tokens on unhelpful searches.
Why Per-Token, Not Just Accuracy?
Final accuracy hides how the agent got there. An agent that found all evidence in 20K tokens is far better than one that needed 80K tokens for the same result — even if both score identically on accuracy. CER-C makes this efficiency visible and comparable.
How Relevant Docs Are Labeled
Ground-truth relevant documents are known for each query. The metric tracks what fraction of those labeled documents have been retrieved at each 10K-token checkpoint in the agent's context window. Simple, interpretable, actionable.

The Search Tool Stack: Three Components, Three Trade-offs

The search tool has three moving parts: the embedding model (dimension controls recall vs. latency), the reranker (size controls quality vs. speed), and the retrieval method (sparse + dense hybrid vs. ANN-only). Three configurations were tested:

Fast config: 512-dim embeddings, 2B reranker, hybrid retrieval, top-50. Best latency (13s). The planner training erases the quality gap with Strong at no extra cost.
Embedding Dimensions
MRL (Matryoshka Representation Learning) embeddings let you choose dimension at inference. 512-dim vs 4096-dim: the larger gains 13% recall and 11% nDCG but introduces 7x retrieval latency. Since the reranker dominates latency, this trade-off often isn't worth it.
The Reranker
The 6B reranker delivers 27% quality improvement over the 2B at roughly 2x latency cost. Without any reranker, nDCG@5 drops from 0.203 to 0.089 — a 56% collapse. The reranker is non-negotiable for quality; the question is which size to use.
Hybrid Retrieval
ANN (dense vector) + BM25 (sparse keyword) retrieval combined adds 11% quality at modest latency overhead relative to total reasoning time. BM25 catches exact-match queries that dense embeddings miss. Hybrid is almost always worth the small overhead.
What is Matryoshka Representation Learning (MRL)?
MRL trains a single embedding model such that prefixes of the embedding vector are themselves valid, lower-dimensional embeddings. This means you can truncate a 4096-dim embedding to 512-dim and still get a semantically meaningful representation — just with less nuance. At query time, you choose the dimension: 512 for speed, 4096 for maximum recall.
Why does the reranker dominate latency?
ANN retrieval is approximate nearest-neighbor lookup — very fast even at scale (~50ms). BM25 is inverted-index keyword search — also fast. But the reranker runs a cross-encoder model over every candidate document pair (query, doc). With top-50 or top-200 candidates, that's 50–200 forward passes through a 2B or 6B parameter model. That takes ~1.5 seconds — dwarfing the upstream retrieval.
Why does hybrid retrieval work?
Dense embeddings capture semantic meaning — great for paraphrases and conceptual queries. Sparse BM25 captures exact keyword matches — great for proper nouns, model names, acronyms, and rare terms. Real queries often have both aspects. Union of both retrieval sets fed into the reranker gives the reranker more signal to work with, improving final precision.

Component Ablation: What Does Each Part Buy?

Remove each component one at a time and watch quality drop. The reranker is irreplaceable; hybrid retrieval and high-dim embeddings each add a meaningful but smaller contribution.

Hover a bar to see what was removed and the quality impact.

Two Recipes to Teach the Planner When to Search and When to Stop

A better search tool only helps if the planner uses it well. Two training recipes were tested: RL with outcome reward (GRPO) and on-policy distillation from a large teacher model. The best result combines both.

Recipe 1 — GRPO: The planner generates a full trajectory (search → read → answer). A binary reward (+1 correct, 0 wrong) is applied. The model learns search behavior through trial and error. Simple but sparse signal.
Recipe 1 — RL with Outcome Reward (GRPO) reward : +1 if final answer correct, 0 if wrong signal : binary, applied only at completion based on: Search-R1 approach Recipe 2 — On-policy Distillation teacher : Qwen3-235B (large reasoning model) student : smaller planner model signal : dense per-token via reverse KL divergence result : 39% KL reduction in 50 steps Combined (Two-stage, Best Result) Stage 1 : 50 steps of distillation (learn search behavior) Stage 2 : 30 steps of GRPO + CLP (tune for efficiency)
Why two stages? Distillation teaches the planner how to search (what a good search trajectory looks like). GRPO then teaches it when to stop (efficiency). Pure GRPO from random initialization is too slow — the sparse binary signal struggles to teach nuanced search behavior. Distillation first gives GRPO a strong starting point.
Recipe 1: GRPO
Grouped Relative Policy Optimization. The model generates multiple trajectory rollouts per query. Correct trajectories are rewarded, incorrect ones are penalized. The policy gradient update favors trajectories that led to correct answers. No teacher model needed — but signal is sparse.
Recipe 2: On-policy Distillation
The student model generates its own trajectories (on-policy). The teacher (Qwen3-235B) evaluates each token of the student's trajectory and provides a dense supervision signal via reverse KL divergence. 50 steps cut teacher-student divergence by 39% — far faster than GRPO alone.
What is GRPO?
Grouped Relative Policy Optimization is a reinforcement learning algorithm for LLMs. For each query, the model generates a group of trajectories. Within the group, trajectories that scored better than average receive positive rewards; worse ones receive negative. The policy is updated to increase the probability of better trajectories. It's computationally cheaper than PPO because it doesn't require a separate value model.
Why reverse KL divergence?
Reverse KL (KL[student || teacher]) encourages the student to concentrate probability mass on modes the teacher assigns high probability. This is "mode-seeking" behavior — the student learns to be confidently correct in regions where the teacher is confident. Forward KL would be "mean-seeking" and spread probability, causing the student to hedge on uncertain regions. For search behavior learning, mode-seeking produces sharper, more decisive search strategies.
Why two stages rather than just one?
Distillation alone teaches good search behavior but doesn't optimize for efficiency (tool call count). GRPO alone has sparse signal — it takes many steps to learn what a good trajectory looks like from scratch. Two-stage: distillation gives GRPO a warm start with already-reasonable search behavior, then GRPO + CLP penalty fine-tunes for efficiency. Combined, 80 total steps achieves what neither recipe could alone.

Training Convergence: Accuracy Up, Tool Calls Down

As training progresses, two things happen simultaneously: accuracy rises and redundant tool calls are eliminated. The dashed line marks the switch from distillation to GRPO+CLP.

The Conditional Log Penalty: Teaching Agents to Stop Searching

Training the planner to be accurate is step one. Step two is teaching it to be efficient — to not make unnecessary search calls. Three penalty formulas were tested. Only one preserves correct incentives while giving headroom for complex queries.

3
Formulas tested
13.4×
CLP headroom vs linear
tc=4
Linear hits zero
tc=54
CLP stays positive
Three Penalty Formulas for Tool Call Count (tc): Additive: R = em + λ·tc BROKEN: wrong+fast beats correct+slow Linear Multiplicative: R = em × (1 - α·tc) NARROW: reward hits 0 at tc=4 CLP (winner): R = em × max(0, 1 - ε·log(1+tc)) ε = 0.15 (optimal) → stays positive until tc ≈ 54 em = 1 if answer correct, 0 if wrong CLP insight: the first search call is expensive (skip it on easy questions), but additional searches are cheap — so the penalty should be logarithmic.
Epsilon (ε) value: 0.15
Hover the chart to compare reward values at each tool call count.
Why additive fails: R = em + λ·tc means a wrong answer with many tool calls still gets a positive reward (λ·tc). It breaks "separation" — the guarantee that a correct answer always beats a wrong one. The agent can be rewarded for being wrong as long as it tries hard.
Why Linear Fails
Linear multiplicative R = em × (1 - α·tc) hits zero at tc=4 (with standard α). Complex multi-hop queries legitimately need 5–10 searches. Any penalty that zeroes out the reward at tc=4 punishes the agent for doing thorough research on hard questions — the exact behavior you want on BrowseComp.
Why Log Scale Works
Logarithm grows slowly — doubling tool calls adds a fixed increment of penalty, not a proportional one. Early calls are penalized steeply (skip easy searches), later calls are penalized gently (allow deep research). This asymmetry matches the real cost structure: the first search is almost always worth doing, the 10th rarely is.
What does "breaks separation" mean for additive?
Separation is the requirement that R(correct answer) > R(wrong answer) always. With additive reward R = em + λ·tc, if a wrong answer uses tc=10 tool calls and λ=0.2, it gets reward 0 + 2.0 = 2.0. A correct answer with tc=0 gets 1.0 + 0 = 1.0. The wrong, inefficient answer wins. This trains the planner to make many calls even when it knows it's wrong — a pathological failure mode.
How was ε = 0.15 chosen?
It was swept empirically. At ε = 0.15, CLP stays positive (reward > 0) until tc ≈ 54 tool calls — well beyond any realistic query complexity. Lower ε values (< 0.05) are too permissive and don't meaningfully penalize redundant searches. Higher values (> 0.30) start to discourage legitimate multi-hop research. 0.15 was the sweet spot on the development set.

Compound Gains: Tool + Planner Together

Tool optimization and planner training compound — combining both beats either alone. The headline result: a trained planner on the fastest (cheapest) tool matches an untrained planner on the strongest (most expensive) tool, at half the latency.

50.1%
Trained Fast acc.
50.0%
Untrained Strong acc.
60.7%
Trained Max acc.
~3
Tool calls (all trained)
Hover a data point to see its configuration, accuracy, and latency.
The key insight: "A trained planner on the fastest tool gathers evidence more efficiently than an untrained planner with the strongest tool." This means you don't need to pay for the Max config — train the planner on Fast and get Strong-quality results at Fast speed.
Out-of-Distribution Generalization
Training happened on NQ + HotpotQA (1–2 hop questions). Evaluation was on harder out-of-distribution benchmarks: MuSiQue, Bamboogle, 2WikiMultihop, TriviaQA, PopQA, and BrowseComp-Plus. Consistent improvements across all — the planner learns generalizable search strategy, not dataset-specific patterns.
Why Tool Calls Converge to ~3
With CLP shaping, the trained planner learns to use exactly as many searches as needed — regardless of which retrieval stack it's on. Fast config with low-quality retrieval: ~3 calls. Max config with high-quality retrieval: also ~3 calls. The planner adapts its confidence threshold to the tool's capability.
Production Implication
Deploy the Fast tool config (cheaper embedding, smaller reranker) and invest in planner training. You get Strong-config quality at Fast-config cost. For most production RAG systems, this is the optimal operating point: minimize infrastructure cost, train the planner to compensate.
Which benchmarks were used?
Six evaluation benchmarks: NQ (Natural Questions — single-hop), HotpotQA (two-hop), MuSiQue (multi-hop, compositional), Bamboogle (adversarial, hard), 2WikiMultihop (multi-hop across Wikipedia), TriviaQA (trivia), PopQA (popularity-stratified), and BrowseComp-Plus (hardest: requires web-scale search across many documents).
Why does training on easy questions help with hard ones?
The training tasks (NQ, HotpotQA) teach the planner general search strategy: query formulation, evidence synthesis, when a document is relevant, when to refine the search. These are transferable skills. Hard multi-hop queries on BrowseComp-Plus require the same underlying behaviors, just chained more times. The CLP penalty generalizes too — the planner learns to be parsimonious with searches across all complexity levels.

Benchmark Matrix: Accuracy Across All Evaluations

Trained configs improve consistently across all 7 benchmarks — from simple single-hop (NQ) to adversarial multi-hop (BrowseComp-Plus). Click any cell to highlight the row.

Click a row to see accuracy details for that benchmark.

What Should You Deploy?

Set your priorities using the sliders. The recommender weighs latency sensitivity, quality requirements, and cost constraints to suggest the optimal config and training recipe for your use case.

Latency sensitivity 5
Quality requirement 5
Cost sensitivity 5
Adjust the sliders above to get a personalized recommendation.
Fast + Trained
Best for latency-sensitive or cost-constrained deployments. 512-dim embeddings, 2B reranker, hybrid retrieval. Trained planner closes the gap to Strong accuracy. Latency: ~13s. Cost: lowest.
Strong + Trained
Best balanced choice for most production systems. 512-dim, 6B reranker, hybrid. Trained planner pushes to 57%+ accuracy. Latency: ~26s. Recommended for general RAG applications.
Max + Trained
Best for quality-first applications where latency is acceptable. 4096-dim, 6B reranker, hybrid, top-200. Trained planner achieves 60.7% on BrowseComp-Plus. Latency: ~52s. Cost: highest.