Pretraining
⟶
SFT
⟶
Reward Model
⟶
PPO / RL
⟶
or DPO
⟶
Aligned LLM
Post 44 · Training & Alignment
The LLM Training Pipeline
How a raw language model goes from predicting text on the internet to a helpful, harmless, honest assistant — in three stages.
Stage 1
Pre-training
Learn language from raw text
📚 Trillions of tokens
💻 Months on thousands of GPUs
🧠 Next-token prediction
💰 $1M – $100M+ compute
💻 Months on thousands of GPUs
🧠 Next-token prediction
💰 $1M – $100M+ compute
➜
Stage 2
Supervised Fine-tuning
Learn to follow instructions
📝 10K – 1M instruction pairs
⏱ Hours to days
💬 Chat format training
💸 Much cheaper than pre-training
⏱ Hours to days
💬 Chat format training
💸 Much cheaper than pre-training
➜
Stage 3
Alignment (RLHF)
Align with human values
👨⚖️ Human preference labels
🎯 Reward model training
🔄 PPO reinforcement loop
⚖ KL divergence constraint
🎯 Reward model training
🔄 PPO reinforcement loop
⚖ KL divergence constraint
The Base Model
After pretraining, the model knows grammar, facts, reasoning patterns, and code. But it doesn't know how to have a conversation. Ask it a question and it might continue your text, not answer it.
After SFT
The model learns to respond to instructions — it now produces answers, not text completions. But it may still produce harmful, dishonest, or unhelpful responses because it hasn't learned human preferences.
After RLHF
The model is nudged toward responses that humans rate highly — helpful, harmless, and honest. It learns to refuse harmful requests, avoid hallucinations, and match human communication preferences.
Key papers: InstructGPT (Ouyang et al., 2022) — the paper that introduced this 3-stage pipeline publicly. Llama 2 (Touvron et al., 2023) — the most detailed public description of SFT + RLHF. Constitutional AI (Bai et al., 2022) — Anthropic's variant.
Framework
What Changes at Each Stage
The objective, data, and model behaviour are fundamentally different at each stage.
| Dimension | Stage 1: Pretraining | Stage 2: SFT | Stage 3: RLHF |
|---|---|---|---|
| Objective | Minimise next-token prediction loss | Minimise cross-entropy on (instruction, response) pairs | Maximise reward while staying close to SFT policy |
| Data | Raw web text, books, code (trillions of tokens) | Human-written (instruction, response) pairs (thousands–millions) | Human preference rankings of model outputs (thousands–tens of thousands) |
| Loss function | Cross-entropy over all tokens | Cross-entropy on response tokens only | PPO objective + KL penalty (or DPO loss) |
| What model learns | Language, facts, reasoning, code | Instruction-following format and style | Human value alignment, safety, tone |
| Compute | Dominant cost ($1M–$100M+) | ~1–5% of pretraining cost | ~1–10% of pretraining cost |
| Output quality | Coherent text — not conversational | Helpful responses — not always safe | Helpful, harmless, and honest |
| Real example | GPT-3 base, Llama-2-base | Alpaca, Vicuna, Llama-2-chat (SFT only) | ChatGPT, Claude, Llama-2-chat (full) |
The foundation. A randomly-initialised neural network learns grammar, facts, reasoning, code, and world knowledge by predicting the next word — billions of times.
The Core Objective
Given tokens x₁, x₂, …, xₙ, predict xₙ₊₁.
Loss: L = −(1/N) Σ log P(xₜ | x₁…xₜ₋₁)
Minimising this loss forces the model to learn everything needed to predict text well — meaning it must internalise grammar, facts, reasoning, and style from the training data.
Loss: L = −(1/N) Σ log P(xₜ | x₁…xₜ₋₁)
Minimising this loss forces the model to learn everything needed to predict text well — meaning it must internalise grammar, facts, reasoning, and style from the training data.
Why it works: There is no shortcut to predicting the next word well. The model must build internal representations of syntax, semantics, world knowledge, and even theory of mind — because all of these improve predictions.
Key Architecture: Transformer Decoder
Input tokens → Token Embeddings + Positional Encodings
↓
× N Transformer Blocks:
Causal Self-Attention (can't look forward)
Layer Norm
Feed-Forward Network (MLP)
Residual connections
↓
Final Layer Norm
↓
Linear + Softmax → Probability over vocab
↓
× N Transformer Blocks:
Causal Self-Attention (can't look forward)
Layer Norm
Feed-Forward Network (MLP)
Residual connections
↓
Final Layer Norm
↓
Linear + Softmax → Probability over vocab
The causal mask is what makes this "autoregressive" — each token can only attend to previous tokens, forcing left-to-right prediction.
Stage 1 — Data
The Pretraining Data Pipeline
Getting clean, diverse, high-quality text at scale is arguably harder than training the model itself.
Data Sources & Mix
Approximate mix based on GPT-3, Llama-2, and similar models. Exact ratios are proprietary and differ across models.
Data Cleaning Pipeline
1
Web Crawl
CommonCrawl, C4, RefinedWeb — petabytes of raw HTML from the public web. Starting point for most models.
2
Deduplication
Near-exact and fuzzy deduplication using MinHash LSH. Duplicates inflate loss reduction without teaching anything new.
3
Quality Filtering
Classifier-based filtering (trained on curated text like Wikipedia). Rule-based heuristics remove spam, boilerplate, and low-information text.
4
Tokenisation
BPE (Byte-Pair Encoding) or SentencePiece tokeniser trained on the cleaned corpus. Vocabulary size typically 32K–100K tokens.
5
Packing & Batching
Documents concatenated and packed into fixed-length context windows (2K–128K tokens). Special tokens mark document boundaries.
2T+
Tokens in Llama-2
pretraining corpus
pretraining corpus
175B
GPT-3 parameters
(300B tokens)
(300B tokens)
~3500
A100 GPU days
for Llama-2 70B
for Llama-2 70B
~30%
Data retained after
quality filtering
quality filtering
Stage 1 — Mechanism
Next-Token Prediction in Depth
The elegantly simple objective that produces surprisingly capable models.
The Training Loop
# Simplified pretraining loop
for batch in dataloader:
tokens = batch["input_ids"] # [B, T]
inputs = tokens[:, :-1] # x₁…xₙ₋₁
targets = tokens[:, 1:] # x₂…xₙ
logits = model(inputs) # [B, T, vocab]
loss = cross_entropy(logits, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Key insight: Each forward pass generates T training signals simultaneously — one per token position. This is why transformers are so data-efficient per GPU-second compared to RNNs.
What the Model Actually Learns
Syntax & Grammar
To predict the next word correctly, the model must understand sentence structure. Grammatically incorrect predictions score poorly → grammar is learned implicitly.
World Knowledge
"The Eiffel Tower is located in ___" → model must know geography to predict "Paris". All factual knowledge is encoded in weights through the prediction objective.
Reasoning Patterns
Mathematical proofs, logical arguments, and code follow structured patterns. Predicting the next token in a proof requires understanding the reasoning structure leading up to it.
Stage 1 — Scaling
Scaling Laws & Chinchilla
How much data and compute do you need? The Chinchilla paper (Hoffmann et al., 2022) answered this — and changed how the entire industry trains LLMs.
Before Chinchilla: Over-parameterised Models
GPT-3 mistake: 175B parameters trained on only 300B tokens (~1.7 tokens per parameter). Chinchilla showed this is deeply suboptimal — you'd do better with a much smaller model trained on more data.
The Chinchilla Rule
Optimal training tokens ≈
20 × N
where N = number of model parameters
e.g. 7B params → train on 140B tokens
70B params → train on 1.4T tokens
70B params → train on 1.4T tokens
Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params, 300B tokens) — a 4× larger model trained on far fewer tokens — on nearly every benchmark. Smaller + more data wins.
Scaling Laws Visualised
The Chinchilla frontier — given a fixed compute budget C, the optimal point balances parameters and tokens equally (roughly). Points to the left are undertrained; to the right, over-parameterised.
Interactive — Stage 1
Next-Token Prediction Demo
See how a language model assigns probabilities to the next token. Click a context to explore the model's distribution.
The base model knows how to complete text. SFT teaches it to respond to instructions — the difference between a raw model and an assistant.
What Changes During SFT
1
Same Architecture, Different Data
SFT uses the same transformer and the same cross-entropy loss — but the training data is (instruction, response) pairs instead of raw web text.
2
Loss Only on Response Tokens
The instruction tokens are masked — loss is computed only on the response portion. This stops the model from trying to predict the user's question.
3
Small Dataset, Big Impact
Remarkably, even 1,000–10,000 high-quality instruction pairs can produce a strong instruction-following model. Quality beats quantity here.
4
Learning Rate is Lower
SFT uses a much lower learning rate than pretraining (typically 10–100× lower) to update the weights without catastrophic forgetting of pretraining knowledge.
The SFT Loss Function
# SFT: only compute loss on response tokens
for batch in sft_dataloader:
input_ids = batch["input_ids"]
labels = batch["labels"]
# labels = -100 (ignore) for instruction tokens
# labels = token_id for response tokens
logits = model(input_ids)
loss = cross_entropy(
logits, labels,
ignore_index=-100 # skip instruction tokens
)
loss.backward()
LIMA result: Zhou et al. (2023) showed that training on just 1,000 carefully curated instruction pairs produced a model competitive with models trained on much larger datasets. "Alignment may be much easier than previously thought."
Stage 2 — Data
SFT Data Format & Chat Templates
How instructions and responses are structured for training — and why the format matters.
Chat Template (Llama 2 style)
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is the capital of France? [/INST]
The capital of France is Paris.
Paris has been the country's capital since
987 AD and is home to landmarks like the
Eiffel Tower and the Louvre.
</s>
Loss is computed only on the response text (shown in blue). The instruction, system prompt, and special tokens are masked with
-100.Key SFT Datasets
| Dataset | Size | Source |
|---|---|---|
| FLAN | 1.8M+ | Task-specific instructions, Google |
| Alpaca | 52K | GPT-3.5 generated, Stanford |
| ShareGPT | 90K | Real ChatGPT conversations |
| Dolly | 15K | Human-written, Databricks |
| OpenAssistant | 161K | Human-written, crowd |
| LIMA | 1K | Curated high-quality, Meta |
Data quality insight: Llama 2 used ~27,540 high-quality SFT examples — far fewer than many open-source models — but prioritised diversity and quality over volume.
Interactive — Stage 2
Base Model vs SFT Model
The same prompt produces very different outputs before and after SFT. Click a prompt to compare.
SFT teaches format. RLHF teaches values. It fine-tunes the model using human preference signals — pushing it toward responses people actually prefer.
Why SFT Alone Isn't Enough
The Imitation Problem
SFT trains the model to imitate human-written responses. But human annotators aren't always right, consistent, or writing the absolute best possible response. The model learns to imitate, not to be good.
Distributional Mismatch
SFT data is written before seeing what the model generates. In deployment, the model's own outputs become the inputs for multi-turn conversations. RLHF trains on the model's own distribution.
Hard to Specify Good Behaviour
It's easier for humans to compare two responses ("which is better?") than to write the ideal response from scratch. RLHF leverages this comparative judgement signal.
The 3-Step RLHF Process
Step 1
Collect Human Preferences
Show annotators pairs of model outputs for the same prompt. They pick which is better. Thousands of such comparisons are collected.
↓
Step 2
Train a Reward Model
A separate model is trained to predict the human preference score for any (prompt, response) pair. This is the proxy for "what humans want".
↓
Step 3
RL Fine-tuning with PPO
The SFT model is updated using PPO — it generates outputs, gets reward scores, and is nudged toward higher-reward responses while a KL penalty keeps it from drifting too far from SFT.
Stage 3 — Reward Model
Training the Reward Model
A neural network that learns to score how "good" a response is — trained entirely on human preference comparisons, no explicit reward function needed.
Reward Model Architecture
Base: SFT model weights (same architecture)
+ Replace final language-model head
+ Add scalar output head: R(prompt, response) → ℝ
Training objective:
Maximise P(human prefers y_w over y_l)
= σ(R(x, y_w) − R(x, y_l))
# y_w = preferred ("winner") response
# y_l = dispreferred ("loser") response
+ Replace final language-model head
+ Add scalar output head: R(prompt, response) → ℝ
Training objective:
Maximise P(human prefers y_w over y_l)
= σ(R(x, y_w) − R(x, y_l))
# y_w = preferred ("winner") response
# y_l = dispreferred ("loser") response
Why initialise from SFT? The reward model needs to understand language to score responses. Starting from SFT weights means it inherits the SFT model's language understanding — it only needs to learn the scoring head, not language from scratch.
What Reward Models Learn to Prefer
Higher reward ↑
Helpful, complete answers · Honest ("I don't know") · Appropriate refusals · Clear reasoning · Balanced perspectives · Good formatting
Lower reward ↓
Harmful instructions · Hallucinated facts stated confidently · Sycophantic yes-saying · Unnecessary verbosity · Irrelevant digressions · Harmful stereotypes
Reward hacking warning: The reward model is an imperfect proxy. PPO can "game" it — finding responses that get high reward scores while being actually unhelpful. This is why the KL penalty is critical.
Interactive — Stage 3
Reward Model in Action
For each prompt, choose which response you think a reward model would score higher — then see the explanation.
Stage 3 — PPO
PPO Training Loop
Proximal Policy Optimisation — the RL algorithm that fine-tunes the language model using reward signals while preventing catastrophic drift.
The PPO Objective
Maximize:
E[R(x,y)] − β · KL(π_θ || π_ref)
# R(x,y) = reward model score
# KL(·||·) = KL divergence from SFT policy
# β = KL penalty coefficient (~0.1–0.5)
# π_θ = current RL policy (being trained)
# π_ref = frozen SFT policy (reference)
E[R(x,y)] − β · KL(π_θ || π_ref)
# R(x,y) = reward model score
# KL(·||·) = KL divergence from SFT policy
# β = KL penalty coefficient (~0.1–0.5)
# π_θ = current RL policy (being trained)
# π_ref = frozen SFT policy (reference)
The KL penalty is critical. Without it, the model would maximise reward by producing nonsensical strings that fool the reward model. The KL penalty forces the model to stay close to what the SFT model would generate — keeping it grounded.
PPO Iteration Cycle
💬
Generate
RL policy samples responses for batch of prompts
🎯
Score
Reward model assigns scalar reward to each response
⚙
Update
PPO gradient update on RL policy weights
📏
KL Check
Measure drift from SFT ref; apply penalty if needed
4 simultaneous models in memory: (1) RL policy being trained, (2) frozen reference SFT model for KL, (3) reward model, (4) value model (PPO critic). This is why RLHF is compute-intensive.
Training Dynamics Across All 3 Stages
Illustrative training dynamics. Pretraining loss decreases over billions of steps. SFT converges quickly. RLHF reward increases while KL divergence is constrained.
Stage 3 — Modern Alternative
DPO — Direct Preference Optimisation
A simpler, more stable alternative to RLHF that eliminates the reward model entirely — achieving comparable alignment without the 4-model complexity of PPO.
RLHF vs DPO
| Aspect | RLHF (PPO) | DPO |
|---|---|---|
| Reward model | Explicit, trained separately | Implicit in the objective |
| Models in memory | 4 (policy, ref, reward, value) | 2 (policy, reference) |
| Training stability | Tricky — PPO is sensitive | Stable — supervised-like |
| Hyperparameters | Many (β, clip range, etc.) | Mainly just β |
| Data format | Preference pairs + prompts | Same preference pairs |
| Quality | State of art for complex alignment | Competitive, often slightly lower |
| Used by | ChatGPT, Llama-2-chat | Zephyr, Mistral instruct, many OSS |
The DPO Objective
Minimize:
−log σ( β log(π_θ(y_w|x)/π_ref(y_w|x))
− β log(π_θ(y_l|x)/π_ref(y_l|x)) )
# y_w = preferred response
# y_l = dispreferred response
# Reward model replaced by log probability ratio
−log σ( β log(π_θ(y_w|x)/π_ref(y_w|x))
− β log(π_θ(y_l|x)/π_ref(y_l|x)) )
# y_w = preferred response
# y_l = dispreferred response
# Reward model replaced by log probability ratio
Key insight (Rafailov et al., 2023): The optimal policy under RLHF can be expressed in closed form as a function of the reference policy. DPO directly optimises for this — removing the need for an explicit reward model entirely.
Practical impact: DPO made alignment accessible to the open-source community — no need for 4× GPU memory and complex PPO tuning. Models like Zephyr-7B achieved competitive performance with much less compute.
Interactive
Chinchilla Scaling Calculator
Given a training budget, compute the optimal model size and token count according to Chinchilla scaling laws.
Training Budget
Summary
The Full Pipeline at a Glance
From a random initialisation to an aligned assistant — everything that happens across all 3 stages.
Stage 1
Pre-training
Randomly init → Language model that can complete text
Trillions of tokens · Months · $M–$100M
⬇
Stage 2
Supervised Fine-tuning
Text completer → Instruction follower
1K–1M examples · Hours–days · Much cheaper
⬇
Stage 3a
Reward Model Training
Human preference pairs → scalar reward signal
~10K–100K comparisons · Days
⬇
Stage 3b
RL Fine-tuning (PPO or DPO)
Instruction follower → Aligned assistant
~1K–10K prompts · Days · 4× memory
Real Models at Each Stage
Base Models (Post Stage 1)
GPT-3 (175B), Llama-2-base (7B/13B/70B), Mistral-7B-base, Falcon-40B-base, MPT-7B. These are text completers — powerful but not assistants.
SFT Models (Post Stage 2)
Alpaca-7B, Vicuna-13B, FLAN-T5, text-davinci-001. Can follow instructions but may produce harmful or inconsistent responses.
Aligned Models (Post Stage 3)
ChatGPT (GPT-3.5/4), Claude 1/2/3, Llama-2-chat, Zephyr-7B (DPO). Production-grade assistants that refuse harmful requests and help reliably.
What comes next? Modern pipelines are evolving: RLHF → DPO → RLAIF (Constitutional AI, AI feedback instead of human feedback) → Rejection Sampling → Iterative DPO. The 3-stage pipeline described here is the foundation — but the field keeps building on it.
Previous Post
Post 43 — FinCriticalED
Next Post →
Post 45 — Foundations of Reinforcement Learning