GEPA — Reflective Prompt Evolution Interactive Explorer

Motivation · ~15 min read

Why Prompt Optimization Matters

The same model. Two different prompts. Dramatically different results — but only on the tasks that are actually hard. On olympiad math, hardware optimization, and complex multi-hop reasoning, even frontier models leave significant accuracy on the table with generic prompts.

Why not just use ChatGPT with a simple prompt?

For everyday questions, you're right — modern LLMs already handle them well. GEPA targets a different class of problems: hard, specialized tasks where even frontier models fall short without the right instructions. The paper tests on NPUEval (AMD hardware kernel optimization), AIME 2025 (olympiad math), and HotpotQA (multi-hop reasoning). On NPUEval, GPT-4o with a generic prompt achieves only 4–19% vector utilization. With a GEPA-evolved prompt, the same model reaches 30.52% — no fine-tuning, no new weights, just better instructions discovered automatically.

❌ Baseline Prompt (GPT-4o on NPUEval)

Write an optimized kernel for the following operation on AMD hardware: {task}

Task (from paper's NPUEval benchmark)

"Implement a matrix multiply kernel for the AMD XDNA2 neural processing unit."

Result

Vector utilization: 4–19% ❌ Model writes generic code with no awareness of XDNA2 tile architecture, memory layout constraints, or compiler intrinsics. Paper result: GPT-4o baseline.

✅ GEPA-Optimized Prompt (same model)

Write an optimized kernel for AMD XDNA2. Requirements: - Use AIE tile vector intrinsics - Align buffers to 32-byte boundaries - Prefer ping-pong buffering for memory latency hiding - Avoid scalar fallbacks in hot loops - Verify with compiler profiler output Task: {task}

Same task

"Implement a matrix multiply kernel for the AMD XDNA2 neural processing unit."

Result

Vector utilization: 30.52% ✓ GEPA reflected on compiler errors and profiling feedback to evolve domain-specific constraints. Paper result: ~1.5–7× improvement over all baselines.

4–19%

GPT-4o baseline (NPUEval)

30.52%

Same model with GEPA prompt

35×

Fewer rollouts than RL methods

The model already has the capability — it just needs the right instructions to activate it. GEPA's job is to automatically discover those instructions by letting the model read its own failures and write better prompts. No fine-tuning. No reward model. Just reflection.

Next: see how all the pieces fit together Paper Overview →

Paper Map

Paper at a Glance

GEPA has four key components that form a closed loop. Click any node to learn more, then navigate to its interactive section.

Paper Map — Click any component to explore

GEPA vs. RL-based Approaches

Property	RL-based (e.g. GRPO)	MIPROv2 (non-RL optimizer)	GEPA
Weight Updates?	Yes — billions of parameters updated via backprop	No — prompt-only optimizer	No — model weights never touched. Only the prompt text changes.
Complexity	High — requires policy gradient training	Medium — structured LLM-based search	Low — just LLM calls
Reward Model	Required — must be designed per task	None — uses task accuracy metrics	None — uses model self-reflection
Interpretability	Black-box — why did the policy improve?	Partial — prompts readable, process opaque	Human-readable — read the reflection
Compute	High — gradient computation + backprop	Moderate — more calls than manual	Moderate — inference-only
Transferability	Partial — policy tied to one model	Good — prompts transfer across models	Strong — prompts work across models
vs GEPA (AIME-2025)	Worse — GEPA beats by ~20%	Worse — GEPA beats by +12%	Best overall

MIPROv2 is the leading non-RL prompt optimizer prior to GEPA. GEPA outperforms it by +12% on AIME-2025 (verified from paper abstract). GRPO is the primary RL baseline used in the paper.

Next: understand the evolutionary algorithm that powers GEPA Evolutionary Algorithms →

Core Concept

Evolutionary Algorithms: The Intuition

Before understanding GEPA, it helps to understand the evolutionary algorithm framework it builds on. Evolutionary algorithms maintain a population of candidates, select the fittest, and breed the next generation.

🧬 Population

Start with N candidate prompts. These might be slight variations of a baseline, or a mix of hand-written options. Each candidate is a complete prompt template that can be evaluated on real tasks.

For GEPA, the initial population is often just one prompt: the simplest possible instruction. The algorithm builds from there.

Prompt A: "Answer: {q}"
score: ?

Prompt B: "Solve: {q}"
score: ?

Prompt C: "Q: {q} A:"
score: ?

Population Fitness Over Generations

Prompt Fitness Across 8 Generations

Each line = one prompt candidate. Watch the population converge.

Traditional evolutionary algorithms use random mutation: random word swaps, insertions, deletions. GEPA's mutation is guided — the model reads its failure cases and writes targeted improvements. This is what makes GEPA sample-efficient.

⚖ Exploration vs. Exploitation — A General EA Trade-off

This is a fundamental challenge in any evolutionary algorithm. Exploration means trying diverse, potentially very different prompt variants; exploitation means refining what already works. Too much exploitation → the population converges early on a locally good but globally suboptimal prompt. Too much exploration → no candidate gets refined enough to reach peak performance.

GEPA addresses this by keeping a small population (top-K=3 survivors), generating several candidates per generation (M=5 mutations), and using reflection-guided mutation — which reduces the cost of exploration because each new candidate is directed, not random. That said, early convergence remains a real risk, especially on harder tasks with smaller validation sets.

Next: see the full GEPA algorithm step by step How GEPA Works →

The Algorithm

How GEPA Works: Step by Step

GEPA is a closed loop that runs for a fixed number of iterations — called a budget — typically 5–10 generations. "Budget" here means compute budget: each generation costs LLM API calls, so you set a limit upfront. Each generation improves the prompt using the model's own reasoning as a guide.

GEPA Loop — Click any node to inspect

Step 1 — Initialize

Start with a baseline prompt — often the simplest possible instruction, like "Answer: {question}". No manual engineering is required. The algorithm will improve it automatically. The initial prompt forms the seed of the first-generation population.

How the Prompt Text Actually Evolves

Prompt Evolution Timeline — Generation 0 → 3 → 6

Each generation, GEPA reads failure cases and rewrites the prompt. Watch how a generic two-word instruction grows into a precise, task-specific guide.

Generation 0 — Baseline

Solve the problem: {problem}

Score: 46.6%

Generation 3 — Improving

Solve step by step. Show all intermediate values clearly. Verify your answer. Problem: {problem}

Score: 53.2%

Generation 6 — Optimized

Solve step by step: 1. Identify formula/method 2. Substitute values 3. Show each calculation 4. Verify result against all constraints Problem: {problem}

Score: 60.0% ✓

The prompt text is the only thing that changed. Same model, same weights — but the instructions went from 4 words to a structured checklist, guided entirely by the model's own failure analysis.

One Full Generation — Step by Step

What Happens Inside a Single Generation

We start with the current best prompt. In generation 1, this is the baseline. By generation 5, it already contains reflection-guided improvements.

Current best prompt: "Solve step by step. Show all intermediate values. Problem: {problem}" Population size: 3 prompts | Generation: 3 of 10

Algorithm Pseudocode

GEPA(task, baseline_prompt, budget): # budget = max number of generations (e.g. 10) population = [baseline_prompt] for gen in range(budget): # each iteration = 1 generation of LLM calls scores = evaluate(population, val_set) ← needs ground truth labels failures = get_failures(best(population), val_set) refl = reflect(model, failures) ← no reward model new_prompts = mutate(model, best(population), refl, M=5) population = select_top_k(population + new_prompts, k=3) return best(population)

⚠ Ground Truth Required

GEPA needs a labelled validation set — example inputs paired with known correct answers — to score each prompt. The fitness signal is simply: how many examples did this prompt get right? This means GEPA works best for tasks with clear, measurable correctness: math, QA, code generation, instruction following. For open-ended tasks like creative writing or summarization where there is no single correct answer, applying GEPA is significantly harder.

⚠ Important: GEPA does NOT update model weights

GEPA is a prompt optimizer, not a training algorithm. The LLM's parameters — all its billions of weights — are completely frozen throughout. What GEPA changes is only the text of the prompt: a string of instructions passed to the model at inference time. The paper describes it as working via "natural language reflection to learn high-level rules" rather than gradient descent. You could run GEPA on any model you only have API access to and can't see the weights of — like GPT-4o.

No reward model. No policy gradient. No value function. No backpropagation. Just the model reading its own mistakes and writing better instructions. Every component is a forward-pass LLM call.

Next: dive deeper into the reflection mechanism The Reflection Mechanism →

Core Mechanism

The Reflection Mechanism

Reflection is what separates GEPA from naive evolutionary search. Instead of random mutations, the model diagnoses exactly what the prompt is missing — then writes a fix.

Interactive Reflection Demo

Current Prompt

Answer the question:
{question}

Failed Example

Q: Who directed the 2010 film starring the lead actor of Inception?

Model: "Christopher Nolan"
(✗ skips intermediate reasoning)

Reflection Output

⚠ The prompt doesn't ask the model to identify intermediate entities first.

Fix: Require step-by-step entity resolution before answering.

Generation 0

Impact of Reflection on Learning

Random Mutation vs. Reflection-Guided Mutation

Without reflection, mutations are random word swaps — performance is noisy and improvement is slow.

Real Reflection Examples (from paper)

HotpotQA (Multi-hop)

"The prompt does not instruct the model to identify which documents are relevant before answering. Add: First identify relevant facts from each document, then connect them to reach the answer."

AIME (Competition Math)

"The model skips algebraic steps and jumps to a numerical answer. Add: Show every algebraic manipulation. Do not skip arithmetic or combine steps. Verify the answer by substitution."

IFBench (Instruction Following)

"The model does not self-check its output format before responding. Add: Before writing your response, re-read the format constraints. Verify your answer satisfies all of them."

This is why GEPA is interpretable: you can read the reflection and understand exactly why the prompt changed. RL-based methods update weights invisibly — GEPA updates prompts visibly.

Next: see how GEPA performs on real benchmarks Experiments →

Results

Experiments: Does It Actually Work?

GEPA was evaluated on 4 diverse benchmarks against baseline prompting, manual prompt engineering, and RL-based prompt optimization methods.

4/4

Benchmarks won vs RL

+47%

Avg. improvement vs baseline

2

Models tested (open + closed)

Performance by Benchmark

Illustrative — based on paper-reported trends

Compute Efficiency vs. RL Methods

Rollout Count Comparison (HotpotQA task)

Numbers from paper (HotpotQA with Qwen3-8B)

What is a rollout?

In RL methods (e.g. GRPO)

One rollout = the model generating one response to one input. To estimate a policy gradient reliably, you need thousands of rollouts for the same input — gradient estimation from sparse rewards is statistically noisy and requires many samples to average out.

In GEPA

One rollout = one LLM inference call. Rollouts happen when evaluating prompts on the validation set, generating a reflection, or producing new candidate prompts. GEPA needs far fewer because one reflection call gives a direct, interpretable diagnosis — far more signal per call than noisy gradient estimation.

GEPA achieves better results using 6,438 rollouts vs GRPO's 24,000 rollouts — nearly 4× more efficient. RL needs thousands of samples to estimate gradients reliably; GEPA gets equivalent signal from a single reflection call.

Cross-Model Prompt Transfer

Optimized prompts transfer across model families

A prompt optimized on GPT-4 retains most of its gains when applied to Qwen-3-8B, and vice versa. This suggests GEPA discovers task-level improvements, not model-specific hacks.

Next: see what happens when you remove each component Ablations →

What Matters

What Breaks Without Each Component?

The paper's ablation study isolates each component. Toggle features off to see what happens to the performance curve.

Ablation Simulator — Toggle components

🪞 Reflection

Removing reflection drops performance by ~31 points. Random mutations without diagnosis fail to fix the right things.

🧬 Selection

Without selection pressure, the population drifts. Quality stagnates around 55% because weak prompts aren't eliminated.

📊 Validation Eval

Without a fitness signal, there's no direction. The algorithm performs random walks — no better than chance after the first generation.

Next: key lessons and limitations Key Takeaways →

Key Takeaways

What GEPA Teaches Us

GEPA's success challenges a common assumption: that complex optimization requires complex machinery. Sometimes the model's own reasoning is the best optimizer.

🌱

Simple ≠ Worse

An evolutionary loop with model reflection beats RL-based baselines on 4/4 benchmarks, with no backpropagation and no reward model.

🪞

Models Can Self-Improve

The same model being optimized can diagnose its own failures and write better instructions. No external judge needed.

🔀

Prompts Are Portable

Prompts evolved for one model transfer effectively to other model families. GEPA finds task-level improvements, not model-specific tricks.

Known Limitations

🏷️

Requires Ground Truth Labels

GEPA scores prompts by measuring accuracy on a labelled validation set — inputs paired with known correct answers. Without ground truth, there is no fitness signal to guide evolution. This limits GEPA to tasks where correctness is measurable: math, QA, code, structured instruction-following.

✍️

Hard to Apply to Open-Ended Tasks

For subjective tasks — creative writing, summarization, open-ended dialogue — there is no single correct answer to compare against. GEPA cannot easily evaluate whether one prompt produces "better" creative output than another without a human judge or a separate scoring model, which reintroduces the complexity it was designed to avoid.

When Should You Use GEPA?

Decision Guide — Is GEPA Right for Your Task?

Before vs. After GEPA Optimization

Prompt Quality Over Time

Baseline (no opt.) GEPA optimized

RL isn't always the answer. When a model can articulate why it failed and write a better instruction, you don't need a reward function — you need a conversation. GEPA is that conversation, automated.

Want more interactive AI/ML paper explorations?

Subscribe to Visual Summary →

← → or J/K to navigate sections