The same model. Two different prompts. Dramatically different results โ but only on the tasks that are actually hard. On olympiad math, hardware optimization, and complex multi-hop reasoning, even frontier models leave significant accuracy on the table with generic prompts.
Why not just use ChatGPT with a simple prompt?
For everyday questions, you're right โ modern LLMs already handle them well. GEPA targets a different class of problems: hard, specialized tasks where even frontier models fall short without the right instructions. The paper tests on NPUEval (AMD hardware kernel optimization), AIME 2025 (olympiad math), and HotpotQA (multi-hop reasoning). On NPUEval, GPT-4o with a generic prompt achieves only 4โ19% vector utilization. With a GEPA-evolved prompt, the same model reaches 30.52% โ no fine-tuning, no new weights, just better instructions discovered automatically.
โ Baseline Prompt (GPT-4o on NPUEval)
Write an optimized kernel for the
following operation on AMD hardware:
{task}
Task (from paper's NPUEval benchmark)
"Implement a matrix multiply kernel for the AMD XDNA2 neural processing unit."
Result
Vector utilization: 4โ19%โ Model writes generic code with no awareness of XDNA2 tile architecture, memory layout constraints, or compiler intrinsics. Paper result: GPT-4o baseline.
โ GEPA-Optimized Prompt (same model)
Write an optimized kernel for AMD
XDNA2. Requirements:
- Use AIE tile vector intrinsics
- Align buffers to 32-byte boundaries
- Prefer ping-pong buffering for
memory latency hiding
- Avoid scalar fallbacks in hot loops
- Verify with compiler profiler output
Task: {task}
Same task
"Implement a matrix multiply kernel for the AMD XDNA2 neural processing unit."
Result
Vector utilization: 30.52% โ GEPA reflected on compiler errors and profiling feedback to evolve domain-specific constraints. Paper result: ~1.5โ7ร improvement over all baselines.
4โ19%
GPT-4o baseline (NPUEval)
30.52%
Same model with GEPA prompt
35ร
Fewer rollouts than RL methods
The model already has the capability โ it just needs the right instructions to activate it. GEPA's job is to automatically discover those instructions by letting the model read its own failures and write better prompts. No fine-tuning. No reward model. Just reflection.
GEPA has four key components that form a closed loop. Click any node to learn more, then navigate to its interactive section.
Paper Map โ Click any component to explore
GEPA vs. RL-based Approaches
Property
RL-based (e.g. GRPO)
MIPROv2 (non-RL optimizer)
GEPA
Weight Updates?
Yes โ billions of parameters updated via backprop
No โ prompt-only optimizer
No โ model weights never touched. Only the prompt text changes.
Complexity
High โ requires policy gradient training
Medium โ structured LLM-based search
Low โ just LLM calls
Reward Model
Required โ must be designed per task
None โ uses task accuracy metrics
None โ uses model self-reflection
Interpretability
Black-box โ why did the policy improve?
Partial โ prompts readable, process opaque
Human-readable โ read the reflection
Compute
High โ gradient computation + backprop
Moderate โ more calls than manual
Moderate โ inference-only
Transferability
Partial โ policy tied to one model
Good โ prompts transfer across models
Strong โ prompts work across models
vs GEPA (AIME-2025)
Worse โ GEPA beats by ~20%
Worse โ GEPA beats by +12%
Best overall
MIPROv2 is the leading non-RL prompt optimizer prior to GEPA. GEPA outperforms it by +12% on AIME-2025 (verified from paper abstract). GRPO is the primary RL baseline used in the paper.
Before understanding GEPA, it helps to understand the evolutionary algorithm framework it builds on. Evolutionary algorithms maintain a population of candidates, select the fittest, and breed the next generation.
๐งฌ Population
Start with N candidate prompts. These might be slight variations of a baseline, or a mix of hand-written options. Each candidate is a complete prompt template that can be evaluated on real tasks.
For GEPA, the initial population is often just one prompt: the simplest possible instruction. The algorithm builds from there.
Prompt A: "Answer: {q}" score: ?
Prompt B: "Solve: {q}" score: ?
Prompt C: "Q: {q} A:" score: ?
๐ Evaluate
Run each prompt on a small validation set (20โ50 examples). Score = accuracy on those examples. This is the fitness function โ it determines which prompts survive.
Unlike RL, there's no reward model here โ just direct task accuracy. This makes the fitness signal clean and interpretable.
Prompt A: "Answer: {q}" score: 0.52 โ
Prompt B: "Solve: {q}" score: 0.41
Prompt C: "Q: {q} A:" score: 0.38
๐ Select
Keep only the top-K prompts by validation score. The rest are discarded. This selection pressure forces the population to improve generation over generation.
GEPA typically keeps the top 2โ3 prompts as "survivors" that seed the next generation of mutations.
Prompt A: "Answer: {q}" score: 0.52 โ KEPT
Prompt B: "Solve: {q}" score: 0.41 โ DISCARDED
Prompt C: "Q: {q} A:" score: 0.38 โ DISCARDED
๐ Mutate
Generate M new prompt variants from the survivors. Traditional EA uses random mutation. GEPA's key innovation: the mutation is guided by the model's reflection on its own failures.
Instead of randomly tweaking words, GEPA asks the LLM: "Why did this prompt fail? How should it be improved?" The answer drives targeted mutations.
Prompt A (parent) "Answer: {q}"
โ reflect on failures โ generate children
"Step by step, answer: {q}" new candidate
"Think carefully, then answer: {q}" new candidate
Population Fitness Over Generations
Prompt Fitness Across 8 Generations
Each line = one prompt candidate. Watch the population converge.
Traditional evolutionary algorithms use random mutation: random word swaps, insertions, deletions. GEPA's mutation is guided โ the model reads its failure cases and writes targeted improvements. This is what makes GEPA sample-efficient.
โ Exploration vs. Exploitation โ A General EA Trade-off
This is a fundamental challenge in any evolutionary algorithm. Exploration means trying diverse, potentially very different prompt variants; exploitation means refining what already works. Too much exploitation โ the population converges early on a locally good but globally suboptimal prompt. Too much exploration โ no candidate gets refined enough to reach peak performance.
GEPA addresses this by keeping a small population (top-K=3 survivors), generating several candidates per generation (M=5 mutations), and using reflection-guided mutation โ which reduces the cost of exploration because each new candidate is directed, not random. That said, early convergence remains a real risk, especially on harder tasks with smaller validation sets.
GEPA is a closed loop that runs for a fixed number of iterations โ called a budget โ typically 5โ10 generations. "Budget" here means compute budget: each generation costs LLM API calls, so you set a limit upfront. Each generation improves the prompt using the model's own reasoning as a guide.
GEPA Loop โ Click any node to inspect
Step 1 โ Initialize
Start with a baseline prompt โ often the simplest possible instruction, like "Answer: {question}". No manual engineering is required. The algorithm will improve it automatically. The initial prompt forms the seed of the first-generation population.
Step 2 โ Evaluate
Run each prompt in the population on a held-out validation set of 20โ50 examples. Record which examples each prompt gets right and which it gets wrong. The validation accuracy is the fitness score โ no separate reward model is needed.
Step 3 โ Reflect
Feed the failed examples to the LLM and ask: "Given these failures, what is wrong with the current prompt? What should it emphasize or clarify?" The model produces a structured reflection identifying the prompt's weaknesses. This step replaces the reward function in RL-based approaches.
Step 4 โ Mutate
Generate M=5 new prompt candidates using the reflection as a guide. Each candidate attempts to address the identified weaknesses. The LLM writes improved prompts based on its own diagnosis โ the mutations are targeted, not random.
Step 5 โ Select
Evaluate all new candidates plus the survivors from the previous generation. Keep the top-K=3 by validation accuracy. The best prompt of this generation becomes the starting point for the next reflection step.
Step 6 โ Iterate
Repeat Evaluate โ Reflect โ Mutate โ Select until convergence or the compute budget is exhausted. GEPA typically converges in 5โ10 generations. The final best prompt in the population is returned as the optimized output.
Each generation, GEPA reads failure cases and rewrites the prompt. Watch how a generic two-word instruction grows into a precise, task-specific guide.
Generation 0 โ Baseline
Solve the problem:
{problem}
Score: 46.6%
Generation 3 โ Improving
Solve step by step.
Show all intermediate
values clearly.
Verify your answer.
Problem: {problem}
Score: 53.2%
Generation 6 โ Optimized
Solve step by step:
1. Identify formula/method
2. Substitute values
3. Show each calculation
4. Verify result against
all constraints
Problem: {problem}
Score: 60.0% โ
The prompt text is the only thing that changed. Same model, same weights โ but the instructions went from 4 words to a structured checklist, guided entirely by the model's own failure analysis.
One Full Generation โ Step by Step
What Happens Inside a Single Generation
We start with the current best prompt. In generation 1, this is the baseline. By generation 5, it already contains reflection-guided improvements.
Current best prompt:
"Solve step by step. Show all intermediate values. Problem: {problem}"
Population size: 3 prompts | Generation: 3 of 10
Run the current best prompt on the validation set. Each example is scored as pass โ or fail โ. The accuracy is the fitness score for this prompt.
โ Q: "A train travels 60mph for 2h. Distance?" โ "120 miles" โ
The model reads the 2 failed examples and diagnoses what the prompt is missing. This is the reflection โ written in natural language, fully readable.
Model Reflection Output
"The prompt asks to show intermediate values, but doesn't specify how to handle geometry problems (where shape properties must be confirmed first) or ratio chains (where each ratio must be normalised before combining). The prompt should explicitly require: (1) confirming shape type and properties before applying formulas, (2) normalising ratios to a common term before computing compound ratios."
Using the reflection, the model generates 3 new prompt candidates. Each addresses the identified weaknesses in a slightly different way.
Candidate A: "Solve step by step. For geometry: confirm shape type first. For ratios: normalise to common term. Show all values. Problem: {problem}"
Candidate B: "Identify problem type (geometry/algebra/ratio). Apply type-specific rules. Show each calculation. Verify result. Problem: {problem}"
Candidate C: "Step-by-step solution: (1) identify formula/method for this problem type, (2) substitute known values, (3) compute carefully, (4) verify. Problem: {problem}"
All 3 new candidates + the 3 existing survivors are evaluated on the validation set. The top 3 by score become the next generation's population.
Candidate A80% โ KEPT
Candidate B75% โ KEPT
Previous best60% โ KEPT
Candidate C55% โ DROPPED
Other survivors50% โ DROPPED
Generation complete. Best prompt improved from 60% โ 80%. Next generation begins.
Algorithm Pseudocode
GEPA(task, baseline_prompt, budget): # budget = max number of generations (e.g. 10)
population = [baseline_prompt]
for gen in range(budget): # each iteration = 1 generation of LLM calls
scores = evaluate(population, val_set) โ needs ground truth labels
failures = get_failures(best(population), val_set)
refl = reflect(model, failures) โ no reward model
new_prompts = mutate(model, best(population), refl, M=5)
population = select_top_k(population + new_prompts, k=3)
return best(population)
โ Ground Truth Required
GEPA needs a labelled validation set โ example inputs paired with known correct answers โ to score each prompt. The fitness signal is simply: how many examples did this prompt get right? This means GEPA works best for tasks with clear, measurable correctness: math, QA, code generation, instruction following. For open-ended tasks like creative writing or summarization where there is no single correct answer, applying GEPA is significantly harder.
โ Important: GEPA does NOT update model weights
GEPA is a prompt optimizer, not a training algorithm. The LLM's parameters โ all its billions of weights โ are completely frozen throughout. What GEPA changes is only the text of the prompt: a string of instructions passed to the model at inference time. The paper describes it as working via "natural language reflection to learn high-level rules" rather than gradient descent. You could run GEPA on any model you only have API access to and can't see the weights of โ like GPT-4o.
No reward model. No policy gradient. No value function. No backpropagation. Just the model reading its own mistakes and writing better instructions. Every component is a forward-pass LLM call.
Reflection is what separates GEPA from naive evolutionary search. Instead of random mutations, the model diagnoses exactly what the prompt is missing โ then writes a fix.
Interactive Reflection Demo
Current Prompt
Answer the question:
{question}
Failed Example
Q: Who directed the 2010 film starring the lead actor of Inception?
Model: "Christopher Nolan"
(โ skips intermediate reasoning)
Reflection Output
โ The prompt doesn't ask the model to identify intermediate entities first.
Fix: Require step-by-step entity resolution before answering.
Generation 0
Impact of Reflection on Learning
Random Mutation vs. Reflection-Guided Mutation
Without reflection, mutations are random word swaps โ performance is noisy and improvement is slow.
Real Reflection Examples (from paper)
HotpotQA (Multi-hop)
"The prompt does not instruct the model to identify which documents are relevant before answering. Add: First identify relevant facts from each document, then connect them to reach the answer."
AIME (Competition Math)
"The model skips algebraic steps and jumps to a numerical answer. Add: Show every algebraic manipulation. Do not skip arithmetic or combine steps. Verify the answer by substitution."
IFBench (Instruction Following)
"The model does not self-check its output format before responding. Add: Before writing your response, re-read the format constraints. Verify your answer satisfies all of them."
This is why GEPA is interpretable: you can read the reflection and understand exactly why the prompt changed. RL-based methods update weights invisibly โ GEPA updates prompts visibly.
Next: see how GEPA performs on real benchmarksExperiments โ
Results
Experiments: Does It Actually Work?
GEPA was evaluated on 4 diverse benchmarks against baseline prompting, manual prompt engineering, and RL-based prompt optimization methods.
4/4
Benchmarks won vs RL
+47%
Avg. improvement vs baseline
2
Models tested (open + closed)
Performance by Benchmark
Illustrative โ based on paper-reported trends
AIME spotlight (verified from paper): GEPA outperforms MIPROv2 โ the leading non-RL prompt optimizer โ by +12% on AIME-2025. MIPROv2 uses structured LLM-based search but lacks reflection: it cannot diagnose why a prompt failed. GEPA's reflection mechanism bridges that gap, particularly on tasks requiring precise multi-step reasoning where generic optimization stalls.
Compute Efficiency vs. RL Methods
Rollout Count Comparison (HotpotQA task)
Numbers from paper (HotpotQA with Qwen3-8B)
What is a rollout?
In RL methods (e.g. GRPO)
One rollout = the model generating one response to one input. To estimate a policy gradient reliably, you need thousands of rollouts for the same input โ gradient estimation from sparse rewards is statistically noisy and requires many samples to average out.
In GEPA
One rollout = one LLM inference call. Rollouts happen when evaluating prompts on the validation set, generating a reflection, or producing new candidate prompts. GEPA needs far fewer because one reflection call gives a direct, interpretable diagnosis โ far more signal per call than noisy gradient estimation.
GEPA achieves better results using 6,438 rollouts vs GRPO's 24,000 rollouts โ nearly 4ร more efficient. RL needs thousands of samples to estimate gradients reliably; GEPA gets equivalent signal from a single reflection call.
Cross-Model Prompt Transfer
Optimized prompts transfer across model families
A prompt optimized on GPT-4 retains most of its gains when applied to Qwen-3-8B, and vice versa. This suggests GEPA discovers task-level improvements, not model-specific hacks.
Next: see what happens when you remove each componentAblations โ
What Matters
What Breaks Without Each Component?
The paper's ablation study isolates each component. Toggle features off to see what happens to the performance curve.
Ablation Simulator โ Toggle components
๐ช Reflection
Removing reflection drops performance by ~31 points. Random mutations without diagnosis fail to fix the right things.
๐งฌ Selection
Without selection pressure, the population drifts. Quality stagnates around 55% because weak prompts aren't eliminated.
๐ Validation Eval
Without a fitness signal, there's no direction. The algorithm performs random walks โ no better than chance after the first generation.
GEPA's success challenges a common assumption: that complex optimization requires complex machinery. Sometimes the model's own reasoning is the best optimizer.
๐ฑ
Simple โ Worse
An evolutionary loop with model reflection beats RL-based baselines on 4/4 benchmarks, with no backpropagation and no reward model.
๐ช
Models Can Self-Improve
The same model being optimized can diagnose its own failures and write better instructions. No external judge needed.
๐
Prompts Are Portable
Prompts evolved for one model transfer effectively to other model families. GEPA finds task-level improvements, not model-specific tricks.
Known Limitations
๐ท๏ธ
Requires Ground Truth Labels
GEPA scores prompts by measuring accuracy on a labelled validation set โ inputs paired with known correct answers. Without ground truth, there is no fitness signal to guide evolution. This limits GEPA to tasks where correctness is measurable: math, QA, code, structured instruction-following.
โ๏ธ
Hard to Apply to Open-Ended Tasks
For subjective tasks โ creative writing, summarization, open-ended dialogue โ there is no single correct answer to compare against. GEPA cannot easily evaluate whether one prompt produces "better" creative output than another without a human judge or a separate scoring model, which reintroduces the complexity it was designed to avoid.
When Should You Use GEPA?
Decision Guide โ Is GEPA Right for Your Task?
Before vs. After GEPA Optimization
Prompt Quality Over Time
Baseline (no opt.)GEPA optimized
RL isn't always the answer. When a model can articulate why it failed and write a better instruction, you don't need a reward function โ you need a conversation. GEPA is that conversation, automated.