🧠
LLM Training Pipeline
Pretraining → SFT → RLHF · Visual Summary
Incorrect password. Try again.
Pretraining
SFT
Reward Model
PPO / RL
or DPO
Aligned LLM
The LLM Training Pipeline
How a raw language model goes from predicting text on the internet to a helpful, harmless, honest assistant — in three stages.
Stage 1
Pre-training
Learn language from raw text
📚 Trillions of tokens
💻 Months on thousands of GPUs
🧠 Next-token prediction
💰 $1M – $100M+ compute
Stage 2
Supervised Fine-tuning
Learn to follow instructions
📝 10K – 1M instruction pairs
⏱ Hours to days
💬 Chat format training
💸 Much cheaper than pre-training
Stage 3
Alignment (RLHF)
Align with human values
👨‍⚖️ Human preference labels
🎯 Reward model training
🔄 PPO reinforcement loop
⚖ KL divergence constraint
The Base Model
After pretraining, the model knows grammar, facts, reasoning patterns, and code. But it doesn't know how to have a conversation. Ask it a question and it might continue your text, not answer it.
After SFT
The model learns to respond to instructions — it now produces answers, not text completions. But it may still produce harmful, dishonest, or unhelpful responses because it hasn't learned human preferences.
After RLHF
The model is nudged toward responses that humans rate highly — helpful, harmless, and honest. It learns to refuse harmful requests, avoid hallucinations, and match human communication preferences.
Key papers: InstructGPT (Ouyang et al., 2022) — the paper that introduced this 3-stage pipeline publicly. Llama 2 (Touvron et al., 2023) — the most detailed public description of SFT + RLHF. Constitutional AI (Bai et al., 2022) — Anthropic's variant.
What Changes at Each Stage
The objective, data, and model behaviour are fundamentally different at each stage.
DimensionStage 1: PretrainingStage 2: SFTStage 3: RLHF
ObjectiveMinimise next-token prediction lossMinimise cross-entropy on (instruction, response) pairsMaximise reward while staying close to SFT policy
DataRaw web text, books, code (trillions of tokens)Human-written (instruction, response) pairs (thousands–millions)Human preference rankings of model outputs (thousands–tens of thousands)
Loss functionCross-entropy over all tokensCross-entropy on response tokens onlyPPO objective + KL penalty (or DPO loss)
What model learnsLanguage, facts, reasoning, codeInstruction-following format and styleHuman value alignment, safety, tone
ComputeDominant cost ($1M–$100M+)~1–5% of pretraining cost~1–10% of pretraining cost
Output qualityCoherent text — not conversationalHelpful responses — not always safeHelpful, harmless, and honest
Real exampleGPT-3 base, Llama-2-baseAlpaca, Vicuna, Llama-2-chat (SFT only)ChatGPT, Claude, Llama-2-chat (full)
Stage 1
Pre-training — Learning Language from Scratch
The foundation. A randomly-initialised neural network learns grammar, facts, reasoning, code, and world knowledge by predicting the next word — billions of times.
The Core Objective
Given tokens x₁, x₂, …, xₙ, predict xₙ₊₁.

Loss: L = −(1/N) Σ log P(xₜ | x₁…xₜ₋₁)

Minimising this loss forces the model to learn everything needed to predict text well — meaning it must internalise grammar, facts, reasoning, and style from the training data.
Why it works: There is no shortcut to predicting the next word well. The model must build internal representations of syntax, semantics, world knowledge, and even theory of mind — because all of these improve predictions.
Key Architecture: Transformer Decoder
Input tokens → Token Embeddings + Positional Encodings
  ↓
× N Transformer Blocks:
   Causal Self-Attention (can't look forward)
   Layer Norm
   Feed-Forward Network (MLP)
   Residual connections
  ↓
Final Layer Norm
  ↓
Linear + Softmax → Probability over vocab
The causal mask is what makes this "autoregressive" — each token can only attend to previous tokens, forcing left-to-right prediction.
The Pretraining Data Pipeline
Getting clean, diverse, high-quality text at scale is arguably harder than training the model itself.
Data Sources & Mix
Approximate mix based on GPT-3, Llama-2, and similar models. Exact ratios are proprietary and differ across models.
Data Cleaning Pipeline
1
Web Crawl
CommonCrawl, C4, RefinedWeb — petabytes of raw HTML from the public web. Starting point for most models.
2
Deduplication
Near-exact and fuzzy deduplication using MinHash LSH. Duplicates inflate loss reduction without teaching anything new.
3
Quality Filtering
Classifier-based filtering (trained on curated text like Wikipedia). Rule-based heuristics remove spam, boilerplate, and low-information text.
4
Tokenisation
BPE (Byte-Pair Encoding) or SentencePiece tokeniser trained on the cleaned corpus. Vocabulary size typically 32K–100K tokens.
5
Packing & Batching
Documents concatenated and packed into fixed-length context windows (2K–128K tokens). Special tokens mark document boundaries.
2T+
Tokens in Llama-2
pretraining corpus
175B
GPT-3 parameters
(300B tokens)
~3500
A100 GPU days
for Llama-2 70B
~30%
Data retained after
quality filtering
Next-Token Prediction in Depth
The elegantly simple objective that produces surprisingly capable models.
The Training Loop
# Simplified pretraining loop for batch in dataloader: tokens = batch["input_ids"] # [B, T] inputs = tokens[:, :-1] # x₁…xₙ₋₁ targets = tokens[:, 1:] # x₂…xₙ logits = model(inputs) # [B, T, vocab] loss = cross_entropy(logits, targets) optimizer.zero_grad() loss.backward() optimizer.step()
Key insight: Each forward pass generates T training signals simultaneously — one per token position. This is why transformers are so data-efficient per GPU-second compared to RNNs.
What the Model Actually Learns
Syntax & Grammar
To predict the next word correctly, the model must understand sentence structure. Grammatically incorrect predictions score poorly → grammar is learned implicitly.
World Knowledge
"The Eiffel Tower is located in ___" → model must know geography to predict "Paris". All factual knowledge is encoded in weights through the prediction objective.
Reasoning Patterns
Mathematical proofs, logical arguments, and code follow structured patterns. Predicting the next token in a proof requires understanding the reasoning structure leading up to it.
Scaling Laws & Chinchilla
How much data and compute do you need? The Chinchilla paper (Hoffmann et al., 2022) answered this — and changed how the entire industry trains LLMs.
Before Chinchilla: Over-parameterised Models
GPT-3 mistake: 175B parameters trained on only 300B tokens (~1.7 tokens per parameter). Chinchilla showed this is deeply suboptimal — you'd do better with a much smaller model trained on more data.
The Chinchilla Rule
Optimal training tokens ≈
20 × N
where N = number of model parameters
e.g. 7B params → train on 140B tokens
70B params → train on 1.4T tokens
Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params, 300B tokens) — a 4× larger model trained on far fewer tokens — on nearly every benchmark. Smaller + more data wins.
Scaling Laws Visualised
The Chinchilla frontier — given a fixed compute budget C, the optimal point balances parameters and tokens equally (roughly). Points to the left are undertrained; to the right, over-parameterised.
Next-Token Prediction Demo
See how a language model assigns probabilities to the next token. Click a context to explore the model's distribution.
Stage 2
Supervised Fine-tuning (SFT) — Learning to Follow Instructions
The base model knows how to complete text. SFT teaches it to respond to instructions — the difference between a raw model and an assistant.
What Changes During SFT
1
Same Architecture, Different Data
SFT uses the same transformer and the same cross-entropy loss — but the training data is (instruction, response) pairs instead of raw web text.
2
Loss Only on Response Tokens
The instruction tokens are masked — loss is computed only on the response portion. This stops the model from trying to predict the user's question.
3
Small Dataset, Big Impact
Remarkably, even 1,000–10,000 high-quality instruction pairs can produce a strong instruction-following model. Quality beats quantity here.
4
Learning Rate is Lower
SFT uses a much lower learning rate than pretraining (typically 10–100× lower) to update the weights without catastrophic forgetting of pretraining knowledge.
The SFT Loss Function
# SFT: only compute loss on response tokens for batch in sft_dataloader: input_ids = batch["input_ids"] labels = batch["labels"] # labels = -100 (ignore) for instruction tokens # labels = token_id for response tokens logits = model(input_ids) loss = cross_entropy( logits, labels, ignore_index=-100 # skip instruction tokens ) loss.backward()
LIMA result: Zhou et al. (2023) showed that training on just 1,000 carefully curated instruction pairs produced a model competitive with models trained on much larger datasets. "Alignment may be much easier than previously thought."
SFT Data Format & Chat Templates
How instructions and responses are structured for training — and why the format matters.
Chat Template (Llama 2 style)
[INST] <<SYS>> You are a helpful assistant. <</SYS>> What is the capital of France? [/INST] The capital of France is Paris. Paris has been the country's capital since 987 AD and is home to landmarks like the Eiffel Tower and the Louvre. </s>
Loss is computed only on the response text (shown in blue). The instruction, system prompt, and special tokens are masked with -100.
Key SFT Datasets
DatasetSizeSource
FLAN1.8M+Task-specific instructions, Google
Alpaca52KGPT-3.5 generated, Stanford
ShareGPT90KReal ChatGPT conversations
Dolly15KHuman-written, Databricks
OpenAssistant161KHuman-written, crowd
LIMA1KCurated high-quality, Meta
Data quality insight: Llama 2 used ~27,540 high-quality SFT examples — far fewer than many open-source models — but prioritised diversity and quality over volume.
Base Model vs SFT Model
The same prompt produces very different outputs before and after SFT. Click a prompt to compare.
Stage 3
Alignment — Reinforcement Learning from Human Feedback (RLHF)
SFT teaches format. RLHF teaches values. It fine-tunes the model using human preference signals — pushing it toward responses people actually prefer.
Why SFT Alone Isn't Enough
The Imitation Problem
SFT trains the model to imitate human-written responses. But human annotators aren't always right, consistent, or writing the absolute best possible response. The model learns to imitate, not to be good.
Distributional Mismatch
SFT data is written before seeing what the model generates. In deployment, the model's own outputs become the inputs for multi-turn conversations. RLHF trains on the model's own distribution.
Hard to Specify Good Behaviour
It's easier for humans to compare two responses ("which is better?") than to write the ideal response from scratch. RLHF leverages this comparative judgement signal.
The 3-Step RLHF Process
Step 1
Collect Human Preferences
Show annotators pairs of model outputs for the same prompt. They pick which is better. Thousands of such comparisons are collected.
Step 2
Train a Reward Model
A separate model is trained to predict the human preference score for any (prompt, response) pair. This is the proxy for "what humans want".
Step 3
RL Fine-tuning with PPO
The SFT model is updated using PPO — it generates outputs, gets reward scores, and is nudged toward higher-reward responses while a KL penalty keeps it from drifting too far from SFT.
Training the Reward Model
A neural network that learns to score how "good" a response is — trained entirely on human preference comparisons, no explicit reward function needed.
Reward Model Architecture
Base: SFT model weights (same architecture)
  + Replace final language-model head
  + Add scalar output head: R(prompt, response) → ℝ

Training objective:
  Maximise P(human prefers y_w over y_l)
  = σ(R(x, y_w) − R(x, y_l))

# y_w = preferred ("winner") response
# y_l = dispreferred ("loser") response
Why initialise from SFT? The reward model needs to understand language to score responses. Starting from SFT weights means it inherits the SFT model's language understanding — it only needs to learn the scoring head, not language from scratch.
What Reward Models Learn to Prefer
Higher reward ↑
Helpful, complete answers · Honest ("I don't know") · Appropriate refusals · Clear reasoning · Balanced perspectives · Good formatting
Lower reward ↓
Harmful instructions · Hallucinated facts stated confidently · Sycophantic yes-saying · Unnecessary verbosity · Irrelevant digressions · Harmful stereotypes
Reward hacking warning: The reward model is an imperfect proxy. PPO can "game" it — finding responses that get high reward scores while being actually unhelpful. This is why the KL penalty is critical.
Reward Model in Action
For each prompt, choose which response you think a reward model would score higher — then see the explanation.
PPO Training Loop
Proximal Policy Optimisation — the RL algorithm that fine-tunes the language model using reward signals while preventing catastrophic drift.
The PPO Objective
Maximize:
 E[R(x,y)] − β · KL(π_θ || π_ref)

# R(x,y) = reward model score
# KL(·||·) = KL divergence from SFT policy
# β = KL penalty coefficient (~0.1–0.5)
# π_θ = current RL policy (being trained)
# π_ref = frozen SFT policy (reference)
The KL penalty is critical. Without it, the model would maximise reward by producing nonsensical strings that fool the reward model. The KL penalty forces the model to stay close to what the SFT model would generate — keeping it grounded.
PPO Iteration Cycle
💬
Generate
RL policy samples responses for batch of prompts
🎯
Score
Reward model assigns scalar reward to each response
Update
PPO gradient update on RL policy weights
📏
KL Check
Measure drift from SFT ref; apply penalty if needed
4 simultaneous models in memory: (1) RL policy being trained, (2) frozen reference SFT model for KL, (3) reward model, (4) value model (PPO critic). This is why RLHF is compute-intensive.
Training Dynamics Across All 3 Stages
Illustrative training dynamics. Pretraining loss decreases over billions of steps. SFT converges quickly. RLHF reward increases while KL divergence is constrained.
DPO — Direct Preference Optimisation
A simpler, more stable alternative to RLHF that eliminates the reward model entirely — achieving comparable alignment without the 4-model complexity of PPO.
RLHF vs DPO
AspectRLHF (PPO)DPO
Reward modelExplicit, trained separatelyImplicit in the objective
Models in memory4 (policy, ref, reward, value)2 (policy, reference)
Training stabilityTricky — PPO is sensitiveStable — supervised-like
HyperparametersMany (β, clip range, etc.)Mainly just β
Data formatPreference pairs + promptsSame preference pairs
QualityState of art for complex alignmentCompetitive, often slightly lower
Used byChatGPT, Llama-2-chatZephyr, Mistral instruct, many OSS
The DPO Objective
Minimize:
 −log σ( β log(π_θ(y_w|x)/π_ref(y_w|x))
            − β log(π_θ(y_l|x)/π_ref(y_l|x)) )

# y_w = preferred response
# y_l = dispreferred response
# Reward model replaced by log probability ratio
Key insight (Rafailov et al., 2023): The optimal policy under RLHF can be expressed in closed form as a function of the reference policy. DPO directly optimises for this — removing the need for an explicit reward model entirely.
Practical impact: DPO made alignment accessible to the open-source community — no need for 4× GPU memory and complex PPO tuning. Models like Zephyr-7B achieved competitive performance with much less compute.
Chinchilla Scaling Calculator
Given a training budget, compute the optimal model size and token count according to Chinchilla scaling laws.
Training Budget
Model parameters
7B
GPU count (A100 80GB)
64
Training days
14d
The Full Pipeline at a Glance
From a random initialisation to an aligned assistant — everything that happens across all 3 stages.
Stage 1 Pre-training Randomly init → Language model that can complete text Trillions of tokens · Months · $M–$100M
Stage 2 Supervised Fine-tuning Text completer → Instruction follower 1K–1M examples · Hours–days · Much cheaper
Stage 3a Reward Model Training Human preference pairs → scalar reward signal ~10K–100K comparisons · Days
Stage 3b RL Fine-tuning (PPO or DPO) Instruction follower → Aligned assistant ~1K–10K prompts · Days · 4× memory
Real Models at Each Stage
Base Models (Post Stage 1)
GPT-3 (175B), Llama-2-base (7B/13B/70B), Mistral-7B-base, Falcon-40B-base, MPT-7B. These are text completers — powerful but not assistants.
SFT Models (Post Stage 2)
Alpaca-7B, Vicuna-13B, FLAN-T5, text-davinci-001. Can follow instructions but may produce harmful or inconsistent responses.
Aligned Models (Post Stage 3)
ChatGPT (GPT-3.5/4), Claude 1/2/3, Llama-2-chat, Zephyr-7B (DPO). Production-grade assistants that refuse harmful requests and help reliably.
What comes next? Modern pipelines are evolving: RLHF → DPO → RLAIF (Constitutional AI, AI feedback instead of human feedback) → Rejection Sampling → Iterative DPO. The 3-stage pipeline described here is the foundation — but the field keeps building on it.
Previous Post
Post 43 — FinCriticalED
Next Post →
Post 45 — Foundations of Reinforcement Learning