๐Ÿ”’
Visual Summary
GEPA โ€” Reflective Prompt Evolution Interactive Explorer
This interactive tool is exclusive to paid subscribers.
Enter the password from your subscriber email to unlock.
Not a subscriber yet? Join Visual Summary โ†’
Why Prompt Optimization Matters
The same model. Two different prompts. Dramatically different results โ€” but only on the tasks that are actually hard. On olympiad math, hardware optimization, and complex multi-hop reasoning, even frontier models leave significant accuracy on the table with generic prompts.
Why not just use ChatGPT with a simple prompt?

For everyday questions, you're right โ€” modern LLMs already handle them well. GEPA targets a different class of problems: hard, specialized tasks where even frontier models fall short without the right instructions. The paper tests on NPUEval (AMD hardware kernel optimization), AIME 2025 (olympiad math), and HotpotQA (multi-hop reasoning). On NPUEval, GPT-4o with a generic prompt achieves only 4โ€“19% vector utilization. With a GEPA-evolved prompt, the same model reaches 30.52% โ€” no fine-tuning, no new weights, just better instructions discovered automatically.

โŒ Baseline Prompt (GPT-4o on NPUEval)
Write an optimized kernel for the following operation on AMD hardware: {task}
Task (from paper's NPUEval benchmark)
"Implement a matrix multiply kernel for the AMD XDNA2 neural processing unit."
Result
Vector utilization: 4โ€“19% โŒ Model writes generic code with no awareness of XDNA2 tile architecture, memory layout constraints, or compiler intrinsics. Paper result: GPT-4o baseline.
โœ… GEPA-Optimized Prompt (same model)
Write an optimized kernel for AMD XDNA2. Requirements: - Use AIE tile vector intrinsics - Align buffers to 32-byte boundaries - Prefer ping-pong buffering for memory latency hiding - Avoid scalar fallbacks in hot loops - Verify with compiler profiler output Task: {task}
Same task
"Implement a matrix multiply kernel for the AMD XDNA2 neural processing unit."
Result
Vector utilization: 30.52% โœ“ GEPA reflected on compiler errors and profiling feedback to evolve domain-specific constraints. Paper result: ~1.5โ€“7ร— improvement over all baselines.
4โ€“19%
GPT-4o baseline (NPUEval)
30.52%
Same model with GEPA prompt
35ร—
Fewer rollouts than RL methods

The model already has the capability โ€” it just needs the right instructions to activate it. GEPA's job is to automatically discover those instructions by letting the model read its own failures and write better prompts. No fine-tuning. No reward model. Just reflection.

Next: see how all the pieces fit together Paper Overview โ†’
Paper at a Glance
GEPA has four key components that form a closed loop. Click any node to learn more, then navigate to its interactive section.
Paper Map โ€” Click any component to explore

GEPA vs. RL-based Approaches
PropertyRL-based (e.g. GRPO)MIPROv2 (non-RL optimizer)GEPA
Weight Updates?Yes โ€” billions of parameters updated via backpropNo โ€” prompt-only optimizerNo โ€” model weights never touched. Only the prompt text changes.
ComplexityHigh โ€” requires policy gradient trainingMedium โ€” structured LLM-based searchLow โ€” just LLM calls
Reward ModelRequired โ€” must be designed per taskNone โ€” uses task accuracy metricsNone โ€” uses model self-reflection
InterpretabilityBlack-box โ€” why did the policy improve?Partial โ€” prompts readable, process opaqueHuman-readable โ€” read the reflection
ComputeHigh โ€” gradient computation + backpropModerate โ€” more calls than manualModerate โ€” inference-only
TransferabilityPartial โ€” policy tied to one modelGood โ€” prompts transfer across modelsStrong โ€” prompts work across models
vs GEPA (AIME-2025)Worse โ€” GEPA beats by ~20%Worse โ€” GEPA beats by +12%Best overall
MIPROv2 is the leading non-RL prompt optimizer prior to GEPA. GEPA outperforms it by +12% on AIME-2025 (verified from paper abstract). GRPO is the primary RL baseline used in the paper.
Next: understand the evolutionary algorithm that powers GEPA Evolutionary Algorithms โ†’
Evolutionary Algorithms: The Intuition
Before understanding GEPA, it helps to understand the evolutionary algorithm framework it builds on. Evolutionary algorithms maintain a population of candidates, select the fittest, and breed the next generation.
๐Ÿงฌ Population

Start with N candidate prompts. These might be slight variations of a baseline, or a mix of hand-written options. Each candidate is a complete prompt template that can be evaluated on real tasks.

For GEPA, the initial population is often just one prompt: the simplest possible instruction. The algorithm builds from there.

Prompt A: "Answer: {q}"
score: ?
Prompt B: "Solve: {q}"
score: ?
Prompt C: "Q: {q} A:"
score: ?

Population Fitness Over Generations
Prompt Fitness Across 8 Generations
Each line = one prompt candidate. Watch the population converge.

Traditional evolutionary algorithms use random mutation: random word swaps, insertions, deletions. GEPA's mutation is guided โ€” the model reads its failure cases and writes targeted improvements. This is what makes GEPA sample-efficient.

โš– Exploration vs. Exploitation โ€” A General EA Trade-off

This is a fundamental challenge in any evolutionary algorithm. Exploration means trying diverse, potentially very different prompt variants; exploitation means refining what already works. Too much exploitation โ†’ the population converges early on a locally good but globally suboptimal prompt. Too much exploration โ†’ no candidate gets refined enough to reach peak performance.

GEPA addresses this by keeping a small population (top-K=3 survivors), generating several candidates per generation (M=5 mutations), and using reflection-guided mutation โ€” which reduces the cost of exploration because each new candidate is directed, not random. That said, early convergence remains a real risk, especially on harder tasks with smaller validation sets.

Next: see the full GEPA algorithm step by step How GEPA Works โ†’
How GEPA Works: Step by Step
GEPA is a closed loop that runs for a fixed number of iterations โ€” called a budget โ€” typically 5โ€“10 generations. "Budget" here means compute budget: each generation costs LLM API calls, so you set a limit upfront. Each generation improves the prompt using the model's own reasoning as a guide.
GEPA Loop โ€” Click any node to inspect
Step 1 โ€” Initialize

Start with a baseline prompt โ€” often the simplest possible instruction, like "Answer: {question}". No manual engineering is required. The algorithm will improve it automatically. The initial prompt forms the seed of the first-generation population.


How the Prompt Text Actually Evolves
Prompt Evolution Timeline โ€” Generation 0 โ†’ 3 โ†’ 6

Each generation, GEPA reads failure cases and rewrites the prompt. Watch how a generic two-word instruction grows into a precise, task-specific guide.

Generation 0 โ€” Baseline
Solve the problem: {problem}
Score: 46.6%
Generation 3 โ€” Improving
Solve step by step. Show all intermediate values clearly. Verify your answer. Problem: {problem}
Score: 53.2%
Generation 6 โ€” Optimized
Solve step by step: 1. Identify formula/method 2. Substitute values 3. Show each calculation 4. Verify result against all constraints Problem: {problem}
Score: 60.0% โœ“

The prompt text is the only thing that changed. Same model, same weights โ€” but the instructions went from 4 words to a structured checklist, guided entirely by the model's own failure analysis.


One Full Generation โ€” Step by Step
What Happens Inside a Single Generation

We start with the current best prompt. In generation 1, this is the baseline. By generation 5, it already contains reflection-guided improvements.

Current best prompt: "Solve step by step. Show all intermediate values. Problem: {problem}" Population size: 3 prompts | Generation: 3 of 10

Algorithm Pseudocode
GEPA(task, baseline_prompt, budget): # budget = max number of generations (e.g. 10) population = [baseline_prompt] for gen in range(budget): # each iteration = 1 generation of LLM calls scores = evaluate(population, val_set) โ† needs ground truth labels failures = get_failures(best(population), val_set) refl = reflect(model, failures) โ† no reward model new_prompts = mutate(model, best(population), refl, M=5) population = select_top_k(population + new_prompts, k=3) return best(population)
โš  Ground Truth Required

GEPA needs a labelled validation set โ€” example inputs paired with known correct answers โ€” to score each prompt. The fitness signal is simply: how many examples did this prompt get right? This means GEPA works best for tasks with clear, measurable correctness: math, QA, code generation, instruction following. For open-ended tasks like creative writing or summarization where there is no single correct answer, applying GEPA is significantly harder.

โš  Important: GEPA does NOT update model weights

GEPA is a prompt optimizer, not a training algorithm. The LLM's parameters โ€” all its billions of weights โ€” are completely frozen throughout. What GEPA changes is only the text of the prompt: a string of instructions passed to the model at inference time. The paper describes it as working via "natural language reflection to learn high-level rules" rather than gradient descent. You could run GEPA on any model you only have API access to and can't see the weights of โ€” like GPT-4o.

No reward model. No policy gradient. No value function. No backpropagation. Just the model reading its own mistakes and writing better instructions. Every component is a forward-pass LLM call.

Next: dive deeper into the reflection mechanism The Reflection Mechanism โ†’
The Reflection Mechanism
Reflection is what separates GEPA from naive evolutionary search. Instead of random mutations, the model diagnoses exactly what the prompt is missing โ€” then writes a fix.
Interactive Reflection Demo
Current Prompt
Answer the question: {question}
Failed Example
Q: Who directed the 2010 film starring the lead actor of Inception? Model: "Christopher Nolan" (โœ— skips intermediate reasoning)
Reflection Output
โš  The prompt doesn't ask the model to identify intermediate entities first. Fix: Require step-by-step entity resolution before answering.
Generation 0

Impact of Reflection on Learning
Random Mutation vs. Reflection-Guided Mutation
Without reflection, mutations are random word swaps โ€” performance is noisy and improvement is slow.

Real Reflection Examples (from paper)
HotpotQA (Multi-hop)

"The prompt does not instruct the model to identify which documents are relevant before answering. Add: First identify relevant facts from each document, then connect them to reach the answer."

AIME (Competition Math)

"The model skips algebraic steps and jumps to a numerical answer. Add: Show every algebraic manipulation. Do not skip arithmetic or combine steps. Verify the answer by substitution."

IFBench (Instruction Following)

"The model does not self-check its output format before responding. Add: Before writing your response, re-read the format constraints. Verify your answer satisfies all of them."

This is why GEPA is interpretable: you can read the reflection and understand exactly why the prompt changed. RL-based methods update weights invisibly โ€” GEPA updates prompts visibly.

Next: see how GEPA performs on real benchmarks Experiments โ†’
Experiments: Does It Actually Work?
GEPA was evaluated on 4 diverse benchmarks against baseline prompting, manual prompt engineering, and RL-based prompt optimization methods.
4/4
Benchmarks won vs RL
+47%
Avg. improvement vs baseline
2
Models tested (open + closed)
Performance by Benchmark
Illustrative โ€” based on paper-reported trends

Compute Efficiency vs. RL Methods
Rollout Count Comparison (HotpotQA task)
Numbers from paper (HotpotQA with Qwen3-8B)
What is a rollout?
In RL methods (e.g. GRPO)

One rollout = the model generating one response to one input. To estimate a policy gradient reliably, you need thousands of rollouts for the same input โ€” gradient estimation from sparse rewards is statistically noisy and requires many samples to average out.

In GEPA

One rollout = one LLM inference call. Rollouts happen when evaluating prompts on the validation set, generating a reflection, or producing new candidate prompts. GEPA needs far fewer because one reflection call gives a direct, interpretable diagnosis โ€” far more signal per call than noisy gradient estimation.

GEPA achieves better results using 6,438 rollouts vs GRPO's 24,000 rollouts โ€” nearly 4ร— more efficient. RL needs thousands of samples to estimate gradients reliably; GEPA gets equivalent signal from a single reflection call.

Cross-Model Prompt Transfer
Optimized prompts transfer across model families

A prompt optimized on GPT-4 retains most of its gains when applied to Qwen-3-8B, and vice versa. This suggests GEPA discovers task-level improvements, not model-specific hacks.

Next: see what happens when you remove each component Ablations โ†’
What Breaks Without Each Component?
The paper's ablation study isolates each component. Toggle features off to see what happens to the performance curve.
Ablation Simulator โ€” Toggle components
๐Ÿชž Reflection

Removing reflection drops performance by ~31 points. Random mutations without diagnosis fail to fix the right things.

๐Ÿงฌ Selection

Without selection pressure, the population drifts. Quality stagnates around 55% because weak prompts aren't eliminated.

๐Ÿ“Š Validation Eval

Without a fitness signal, there's no direction. The algorithm performs random walks โ€” no better than chance after the first generation.

Next: key lessons and limitations Key Takeaways โ†’
What GEPA Teaches Us
GEPA's success challenges a common assumption: that complex optimization requires complex machinery. Sometimes the model's own reasoning is the best optimizer.
๐ŸŒฑ
Simple โ‰  Worse

An evolutionary loop with model reflection beats RL-based baselines on 4/4 benchmarks, with no backpropagation and no reward model.

๐Ÿชž
Models Can Self-Improve

The same model being optimized can diagnose its own failures and write better instructions. No external judge needed.

๐Ÿ”€
Prompts Are Portable

Prompts evolved for one model transfer effectively to other model families. GEPA finds task-level improvements, not model-specific tricks.


Known Limitations
๐Ÿท๏ธ
Requires Ground Truth Labels

GEPA scores prompts by measuring accuracy on a labelled validation set โ€” inputs paired with known correct answers. Without ground truth, there is no fitness signal to guide evolution. This limits GEPA to tasks where correctness is measurable: math, QA, code, structured instruction-following.

โœ๏ธ
Hard to Apply to Open-Ended Tasks

For subjective tasks โ€” creative writing, summarization, open-ended dialogue โ€” there is no single correct answer to compare against. GEPA cannot easily evaluate whether one prompt produces "better" creative output than another without a human judge or a separate scoring model, which reintroduces the complexity it was designed to avoid.


When Should You Use GEPA?
Decision Guide โ€” Is GEPA Right for Your Task?

Before vs. After GEPA Optimization
Prompt Quality Over Time
Baseline (no opt.) GEPA optimized

RL isn't always the answer. When a model can articulate why it failed and write a better instruction, you don't need a reward function โ€” you need a conversation. GEPA is that conversation, automated.

Want more interactive AI/ML paper explorations?
Subscribe to Visual Summary โ†’
โ† โ†’ or J/K to navigate sections