GRPO — Group Relative Policy Optimization Interactive Explorer

Why RL? › PPO Problem › ① Group Sample › ② Advantage › ③ Clip + KL Loss › Results

Motivation · ~15 min read

Why LLMs Need Reinforcement Learning

Pretraining teaches an AI to predict the next word from the internet. But "predict the next word" is very different from "give correct math answers." Reinforcement learning bridges this gap — and GRPO does it cheaply and simply.

TL;DR — The Paper in One Paragraph

In 2024, DeepSeek introduced GRPO while training DeepSeekMath 7B on mathematical reasoning. Their key insight: instead of training a separate "judge" model (the critic in PPO) to evaluate each answer, just generate 8 answers for the same question and compare them to each other. Better-than-average answers get reinforced, worse-than-average get penalized. This cut memory usage roughly in half while matching PPO's performance — and the same idea later powered DeepSeek-R1, one of the most influential reasoning models of 2024–2025.

51.7%

MATH benchmark after GRPO (was 46.8%)

~50%

Memory saved vs PPO (no critic model)

G=8

Answers generated per question (typical)

β=0.04

KL penalty used in original paper

The Gap Between Pretraining and Being Helpful

A pretrained LLM learns from trillions of tokens of internet text. It becomes excellent at one thing: predicting the statistically likely next word. But this creates a gap:

📚

What Pretraining Teaches

"Given 'The capital of France is', predict the next word." The model learns statistical patterns — common word sequences, facts, styles — from the web.

🎯

What We Actually Need

"Solve this math problem correctly." "Follow these instructions." "Give a safe answer." These require outcomes to be correct, not just statistically plausible.

Reinforcement learning closes this gap: the model tries answers, gets scored on correctness, and learns to produce better answers over time.

The 3-Stage Pipeline: From Pretrained to Aligned

Stage 1 — Pretraining

Train on 10T+ tokens of web text. Model learns language, facts, reasoning patterns. Weeks on thousands of GPUs.

Stage 2 — SFT

Supervised fine-tuning on curated instruction-answer pairs. Teaches the right format and style of responses.

Stage 3 — RL (GRPO)

Trial and error: generate answers, score them, reinforce the good ones. This is where GRPO lives.

Why Math? The Verifiable Reward Advantage

GRPO works especially well for tasks where correctness can be automatically verified. You don't need a human or a reward model to judge — just check the answer.

Easy to verify automatically

🔢 Math — compare to known answer

💻 Code — run the tests

🌍 Factual questions — lookup

♟ Games / puzzles — rule checker

Hard — needs human / reward model

✍️ Creative writing — subjective

🤝 Open-ended chat — preference

🎨 Style / tone — personal taste

Everyday Analogy

"Pretraining is like reading every book in a library — you absorb enormous knowledge. But knowing facts doesn't mean you can pass an exam. Reinforcement learning is like actually taking practice exams and learning from your mistakes. GRPO is a clever way to run those practice exams cheaply — have the student answer each question 8 times, then learn from which answers were above-average."

The "Aha Moment" — Emergence Without Being Trained For It

DeepSeek-R1-Zero (trained with only GRPO, no SFT) spontaneously developed self-verification. Nobody programmed it — the model discovered that double-checking answers leads to higher reward. Watch how responses evolve over training:

Step 0 / 100

Press Play to see how outputs evolve as GRPO training progresses.

The existing RL method (PPO) works but is expensive. What's the problem? PPO: The Old Way →

The Predecessor

PPO: Powerful but Expensive

Proximal Policy Optimization (PPO) was the default RL algorithm for training LLMs since 2017. It works well but requires maintaining four separate models in memory simultaneously — a crippling cost at scale.

PPO's 4-Model Architecture

To train one model, PPO requires four models loaded in GPU memory at the same time:

Model size (billion params) 7B

Each 1B parameters ≈ 2 GB in bfloat16. The critic model is the same size as the policy model, effectively doubling the cost.

PPO Step by Step — What Makes It Complex

Generate one response per prompt

Policy model generates a single answer for each training question.

Score with reward model

A frozen reward model (itself a large LLM) gives a scalar quality score to the response.

Estimate value with critic model ← The expensive part

A second large model (the critic, same size as the policy) predicts per-token "how good is the trajectory from here?" values. This model must be trained alongside the policy — adding another gradient update step.

Compute advantage with GAE

Generalised Advantage Estimation (GAE) combines critic values and rewards into a per-token advantage. Requires tuning two additional hyperparameters (γ, λ).

Update policy with clipped objective

Apply KL-penalised PPO loss. Typically 2–4 gradient updates per batch using stored old-policy logits.

The core problem: The critic model needs to estimate "how good will this sequence be?" for every token. But LLM responses are graded as a whole (is the final answer correct?), not token by token. Training a critic for this task is noisy, expensive, and adds complexity. GRPO asks: do we really need the critic?

PPO vs GRPO — Simulated Training Trajectory

PPO's critic takes time to warm up — early steps are noisy. GRPO converges more smoothly because the group baseline is always fresh. Click Play to animate:

GRPO's answer: replace the critic with a group of completions. ① Group Sampling →

GRPO Algorithm — Step 1 of 3

Group Sampling: The Key Insight

Instead of generating one answer and asking a critic "how good was that?", GRPO generates G answers for the same question and asks "which ones were above average?" The group is the baseline — no critic needed.

The Rule in Plain English

For every training question: generate G different answers using the current model. Score each one. Compare scores within the group. Reward the above-average answers, penalise the below-average ones.

Interactive: Group Sampling in Action

Question: Calculate 2 + 2 × 6 (correct answer = 14, order of operations applies)

Group size G G = 8

Why Group Size G Matters

G = 2 — Unstable

Only 1 comparison. If both are right or both wrong, advantage is 0 for all. Wastes training steps. High variance.

G = 8–16 — Sweet spot

Enough diversity to get reliable estimates of which answers are better. Used by DeepSeekMath (G=8) and DeepSeek-R1.

G = 64+ — Diminishing returns

Better statistics but G× memory and compute cost for generation. Usually not worth it past G=32.

Try It: Score Your Own Answer

Problem: What is 2 + 3 × 4? (correct = 14). Type an answer and see how it gets scored and placed in a group:

🤔 Thought Experiment: What if G = 1?

With G=1 you have only a single answer. Let's compute the advantage:

mean(r) = r₁ std(r) = 0 ← only one value, no spread A₁ = (r₁ − r₁) / (0 + ε) ≈ 0

With one completion every advantage is exactly 0 — regardless of whether the answer was right or wrong. Loss = 0, gradient = 0, nothing is learned. You mathematically need at least G = 2 to get any signal. In practice G = 8–16 provides stable, informative baselines.

Sampling Temperature → Group Diversity → Learning Signal

When generating G completions, the sampling temperature controls how varied the answers are — and variety directly determines how much the model can learn. Drag to see the chain:

Sampling temperature T T = 0.8

Reward Hacking: When the Model Games the Reward

GRPO optimises reward — but what if the reward function can be gamed? Two strategies both get reward=1 for a correct final answer. GRPO can't tell them apart. Click each to compare:

Select a strategy above to see how GRPO perceives it.

The Difficulty Sweet Spot

GRPO needs tasks that are sometimes right, sometimes wrong. If every answer is correct (too easy) or all wrong (too hard), std=0 and every advantage=0. No gradient, no learning:

Task difficulty (model's success rate) 50%

GRPO vs PPO — Models Required

GRPO eliminates the critic entirely. The reference model is kept frozen (inference only) to compute the KL penalty. The reward model may be a rule-based function (e.g. "is the answer correct?") requiring no GPU memory beyond the policy itself.

Everyday Analogy

"Imagine you're a teacher grading essays without a rubric. PPO's approach: hire a specialist (critic) to predict how good each essay will be before you've read it. GRPO's approach: have each student write 8 drafts, then rank them. Essays above the group average pass; below-average fail. No specialist needed — the group itself is the standard."

Once we have G scored answers, how do we compute the advantage signal? ② Advantage Calculation →

GRPO Algorithm — Step 2 of 3

Advantage: How Much Better Than Average?

Each completion gets a reward score (e.g. 1 = correct, 0 = wrong). The advantage normalises these scores within the group: how many standard deviations above or below the group mean was this answer? Positive = reinforce, negative = suppress.

The Formula

r_i − mean(r₁, r₂, … , r_G) A_i = ───────────────────────────────── std(r₁, r₂, … , r_G) + ε

r_i = reward for completion i | ε = 1e-8 (prevents division by zero) | A_i is the same for every token in completion i

Interactive: Build Your Own Group — Watch Advantages Compute

Drag the reward sliders for 6 completions. See the mean, standard deviation, and advantage value computed live for each.

Completion 1 reward1.00

Completion 2 reward1.00

Completion 3 reward0.00

Completion 4 reward0.00

Completion 5 reward0.00

Completion 6 reward0.50

What the Advantage Signal Does

A > 0 (above average)

The model is nudged to increase the probability of producing this kind of answer. Tokens in this completion get positive gradient signal.

→ "Do more of this"

A < 0 (below average)

The model is nudged to decrease the probability of producing this kind of answer. Tokens in this completion get negative gradient signal.

→ "Do less of this"

Key insight: When all completions score identically (e.g., all correct or all wrong), all advantages are 0 — no gradient signal, no learning. This is why diverse groups (some correct, some wrong) are essential for training to make progress.

⚠ The "All Same Reward" Problem

If the task is too easy (model gets every answer right, all rewards = 1) or too hard (all wrong, rewards = 0), std(r) = 0 and all advantages = 0. No learning happens. GRPO requires problems at the right difficulty — hard enough that the model sometimes gets it wrong, easy enough that it sometimes gets it right. This is called the "difficulty sweet spot."

GRPO's Algorithmic Ancestors: From REINFORCE to GRPO

GRPO didn't appear from nowhere — it's a natural evolution of classic RL algorithms. Click any node to see how each one handles the baseline problem:

Click any algorithm node to see its approach to the baseline problem.

What Advantages Look Like Across Many Batches

Because we z-score normalise within each group, advantages are approximately standard-normal across batches — centred at 0, spread of ~1. This is by design. Click to simulate running many training batches:

Now we have advantage values. How do we use them to update the model? ③ The Loss Function →

GRPO Algorithm — Step 3 of 3

The Loss: 3 Components Working Together

The GRPO loss combines three ideas: a policy ratio (how much has the model changed?), a clip (don't change too much), and a KL penalty (don't drift too far from the original model). Together they make training stable.

The Complete GRPO Loss

L = − min( r_t(θ) · A_i , clip(r_t(θ), 1−ε, 1+ε) · A_i ) + β · KL(π_θ ‖ π_ref) where: r_t(θ) = π_θ(token_t) / π_θ_old(token_t) ← policy ratio per token A_i = (r_i − mean(r)) / std(r) ← group-relative advantage ε = 0.2 (clip range, typically) β = 0.04 (KL penalty weight, from DeepSeekMath paper)

This loss is computed per token, then averaged across all tokens in all completions in the batch.

Token-by-Token: Where the Loss Comes From

The GRPO loss is computed for every token in every completion. The math answer "The answer is 14." has 6 tokens — click any to inspect its individual loss contribution:

Click any token above to inspect its loss contribution.

Component 1 — Policy Ratio: Measuring How Much the Model Changed

For every token in every completion, we ask: how much more (or less) likely does the new policy consider this token compared to the old policy?

r_t(θ) = π_θ(token_t | context) / π_θ_old(token_t | context)

r = 1.0

New policy identical to old. No change for this token.

r = 1.5

New policy 50% more likely to produce this token.

r = 0.5

New policy 50% less likely to produce this token.

Interactive: Policy Ratio Explorer

Set the old and new policy probabilities for a single token. Watch how r_t and clipping interact:

Old policy π_old0.30

New policy π_θ0.45

Component 2 — Clipping: The Safety Belt

Without clipping, a high advantage could cause a massive update — the model overcorrects and becomes unstable. Clipping caps the policy ratio to [1−ε, 1+ε], limiting how much the model can change per step.

Policy ratio r_t r = 1.00

Advantage A (sign)

Component 3 — KL Penalty: Staying Close to Home

KL divergence measures how far the trained model has drifted from the original SFT model. A penalty keeps the model's "personality" stable — it becomes better at math without becoming incoherent in everything else.

KL(π_θ ‖ π_ref) ≈ log π_θ(token) − log π_ref(token) per token (DeepSeekMath uses improved estimator: exp(log π_ref − log π_θ) − (log π_ref − log π_θ) − 1)

β (KL penalty weight) β = 0.04

Putting It All Together: One Training Step

① Sample G completions per prompt using old policy
② Score each with reward function (rule-based or model)
③ Compute advantages: A_i = (r_i − mean) / std
④ Compute loss = −min(r·A, clip(r,1±ε)·A) + β·KL
⑤ Backpropagate, update policy weights, repeat

KL Divergence Over Training Time: Why the Penalty Is Necessary

Without the KL penalty, the policy drifts further and further from the reference model. With β=0.04 it stabilises. Adjust β and watch what happens over 200 training steps:

β (KL penalty weight) β = 0.04

GRPO's Blind Spot: The Credit Assignment Problem

GRPO assigns the same advantage to every token in a completion — good or bad. If the final answer is correct, even filler words like "Let me think…" get reinforced. Click tokens to see:

Compare GRPO vs PPO credit assignment

Outcome Reward vs Process Reward (PRM)

Standard GRPO uses outcome rewards — only the final answer is scored. Process reward models (PRM) score each reasoning step. Click a step to see where reward is injected:

Select a reward mode to see where signals are injected into the reasoning trace.

KL Intuition: Drag Token Probabilities Live

Directly adjust the trained model's probability for each word. Watch KL divergence update instantly — blue = reference model (fixed), purple = trained model (drag to change):

Worked Example: GRPO Loss Computed Step by Step

Let's trace the full loss for the token "4" in the correct answer "The answer is 14." — with real numbers:

Step 1 / 5

Everyday Analogy for the Full Loss

"The loss is a three-way contract with the student (AI): Clip = 'You can change your opinions, but not by more than 20% in any single lesson.' Advantage = 'Focus changes on the answers that were above or below your class average.' KL penalty = 'Don't change your personality so drastically that you forget how to speak English while learning math.'"

How well did it work? And who uses GRPO today? Results & Impact →

Results & Impact

GRPO in the Wild

GRPO's debut in DeepSeekMath was impressive. Its real breakthrough came when DeepSeek used the same idea in DeepSeek-R1 — a reasoning model that matched OpenAI o1 at a fraction of the cost, triggering an industry-wide shift toward GRPO-style training.

DeepSeekMath: Before and After GRPO

GRPO added ~5pp on MATH and ~5pp on GSM8K over a strong SFT baseline — with no increase in model size.

The GRPO Family Tree — Click Any Model

GRPO spawned an entire lineage of reasoning models. Click any node to learn more:

Click any model node above to see details.

PPO vs GRPO — Full Comparison

Property	PPO	GRPO
Models in memory	4 (policy, ref, reward, critic)	2 (policy + ref)
Critic model	Required (same size as policy)	Not needed
Advantage estimation	GAE (per token, trained critic)	Group normalisation (simple math)
Completions per prompt	1	G (typically 8–16)
Memory savings (7B model)	~60GB+	~30GB
Hyperparameters	Many (γ, λ, critic LR, clip, KL…)	Few (G, ε, β)
Best suited for	General RLHF (open-ended)	Verifiable tasks (math, code)

Who Uses GRPO Today (2024–2025)

🔢

DeepSeek-R1

The model that shocked the industry in Jan 2025. Pure GRPO training with rule-based rewards matched OpenAI o1 at a fraction of the compute cost.

🤗

TRL / Hugging Face

GRPOTrainer in the TRL library made GRPO accessible to everyone. Community fine-tuning of Qwen, Llama, Phi models with GRPO is now routine.

🎓

Research frontier

DAPO, REINFORCE++, Dr. GRPO are active 2025 papers improving GRPO's stability and extending it to non-binary rewards.

Open Questions (Active Research)

Does GRPO work for open-ended generation (not just verifiable tasks)? ▾

Partially. When rewards are binary (right/wrong), GRPO works cleanly. For soft rewards (a learned reward model scoring helpfulness), the group normalisation still works but can be noisier. Active research (e.g., DAPO from Alibaba) addresses this with leave-one-out baselines and dynamic sampling.

What happens when the reward function is imperfect? ▾

Reward hacking. If the reward function can be "gamed" (e.g., always outputting the correct number without the working), the model will find and exploit it. The KL penalty slows this down but doesn't eliminate it. Careful reward design is as important as the algorithm itself.

Is GRPO better than PPO in all cases? ▾

Not necessarily. PPO's critic model, despite its cost, gives per-token credit assignment — GRPO assigns the same advantage to every token in a completion. For long completions, early tokens that led to a wrong answer get the same signal as later tokens. Some research (Dr. GRPO) proposes fixes for this.

Can GRPO train from scratch (no SFT step)? ▾

DeepSeek-R1-Zero showed yes — but the model develops strange formatting and language mixing without SFT. The standard pipeline remains: pretrain → SFT → GRPO. The SFT step teaches the model to produce structured responses; GRPO then optimises correctness.

Live Hyperparameter Dashboard: Tune GRPO Yourself

Adjust ε, β, and G simultaneously. The simulated training curve shows how their combined effect shapes convergence speed and final performance:

ε (clip range)0.20

Controls max update size. Too small → slow. Too large → unstable.

β (KL weight)0.04

Controls drift penalty. 0 = free drift. High = slow but safe.

G (group size)8

More completions = better baseline but more compute.

The big takeaway: GRPO replaced a complex, expensive critic model with a simple statistical idea — compare answers within a group. That single insight cut the memory cost of RL training in half, enabled training on consumer hardware, and ultimately powered DeepSeek-R1's 2025 breakthrough. Sometimes the best algorithm is the one you can actually run.