LLM Training Pipeline — Pretraining → SFT → RLHF

Post 44 · Training & Alignment

The LLM Training Pipeline

How a raw language model goes from predicting text on the internet to a helpful, harmless, honest assistant — in three stages.

Stage 1

Pre-training

Learn language from raw text

📚 Trillions of tokens
💻 Months on thousands of GPUs
🧠 Next-token prediction
💰 $1M – $100M+ compute

➜

Stage 2

Supervised Fine-tuning

Learn to follow instructions

📝 10K – 1M instruction pairs
⏱ Hours to days
💬 Chat format training
💸 Much cheaper than pre-training

➜

Stage 3

Alignment (RLHF)

Align with human values

👨‍⚖️ Human preference labels
🎯 Reward model training
🔄 PPO reinforcement loop
⚖ KL divergence constraint

The Base Model

After pretraining, the model knows grammar, facts, reasoning patterns, and code. But it doesn't know how to have a conversation. Ask it a question and it might continue your text, not answer it.

After SFT

The model learns to respond to instructions — it now produces answers, not text completions. But it may still produce harmful, dishonest, or unhelpful responses because it hasn't learned human preferences.

After RLHF

The model is nudged toward responses that humans rate highly — helpful, harmless, and honest. It learns to refuse harmful requests, avoid hallucinations, and match human communication preferences.

    Key papers: InstructGPT (Ouyang et al., 2022) — the paper that introduced this 3-stage pipeline publicly. Llama 2 (Touvron et al., 2023) — the most detailed public description of SFT + RLHF. Constitutional AI (Bai et al., 2022) — Anthropic's variant.
  

Framework

What Changes at Each Stage

The objective, data, and model behaviour are fundamentally different at each stage.

Dimension	Stage 1: Pretraining	Stage 2: SFT	Stage 3: RLHF
Objective	Minimise next-token prediction loss	Minimise cross-entropy on (instruction, response) pairs	Maximise reward while staying close to SFT policy
Data	Raw web text, books, code (trillions of tokens)	Human-written (instruction, response) pairs (thousands–millions)	Human preference rankings of model outputs (thousands–tens of thousands)
Loss function	Cross-entropy over all tokens	Cross-entropy on response tokens only	PPO objective + KL penalty (or DPO loss)
What model learns	Language, facts, reasoning, code	Instruction-following format and style	Human value alignment, safety, tone
Compute	Dominant cost ($1M–$100M+)	~1–5% of pretraining cost	~1–10% of pretraining cost
Output quality	Coherent text — not conversational	Helpful responses — not always safe	Helpful, harmless, and honest
Real example	GPT-3 base, Llama-2-base	Alpaca, Vicuna, Llama-2-chat (SFT only)	ChatGPT, Claude, Llama-2-chat (full)

The foundation. A randomly-initialised neural network learns grammar, facts, reasoning, code, and world knowledge by predicting the next word — billions of times.

The Core Objective

        Given tokens x₁, x₂, …, xₙ, predict xₙ₊₁.

        Loss: L = −(1/N) Σ log P(xₜ | x₁…xₜ₋₁)

        Minimising this loss forces the model to learn everything needed to predict text well — meaning it must internalise grammar, facts, reasoning, and style from the training data.

        Why it works: There is no shortcut to predicting the next word well. The model must build internal representations of syntax, semantics, world knowledge, and even theory of mind — because all of these improve predictions.
      

Key Architecture: Transformer Decoder

        Input tokens → Token Embeddings + Positional Encodings

          ↓

        × N Transformer Blocks:

           Causal Self-Attention (can't look forward)

           Layer Norm

           Feed-Forward Network (MLP)

           Residual connections

          ↓

        Final Layer Norm

          ↓

        Linear + Softmax → Probability over vocab

The causal mask is what makes this "autoregressive" — each token can only attend to previous tokens, forcing left-to-right prediction.

Stage 1 — Data

The Pretraining Data Pipeline

Getting clean, diverse, high-quality text at scale is arguably harder than training the model itself.

Data Sources & Mix

Approximate mix based on GPT-3, Llama-2, and similar models. Exact ratios are proprietary and differ across models.

Data Cleaning Pipeline

1

Web Crawl

CommonCrawl, C4, RefinedWeb — petabytes of raw HTML from the public web. Starting point for most models.

2

Deduplication

Near-exact and fuzzy deduplication using MinHash LSH. Duplicates inflate loss reduction without teaching anything new.

3

Quality Filtering

Classifier-based filtering (trained on curated text like Wikipedia). Rule-based heuristics remove spam, boilerplate, and low-information text.

4

Tokenisation

BPE (Byte-Pair Encoding) or SentencePiece tokeniser trained on the cleaned corpus. Vocabulary size typically 32K–100K tokens.

5

Packing & Batching

Documents concatenated and packed into fixed-length context windows (2K–128K tokens). Special tokens mark document boundaries.

2T+

Tokens in Llama-2
pretraining corpus

175B

GPT-3 parameters
(300B tokens)

~3500

A100 GPU days
for Llama-2 70B

~30%

Data retained after
quality filtering

Stage 1 — Mechanism

Next-Token Prediction in Depth

The elegantly simple objective that produces surprisingly capable models.

The Training Loop

# Simplified pretraining loop
for batch in dataloader:
    tokens = batch["input_ids"]      # [B, T]
    inputs  = tokens[:, :-1]          # x₁…xₙ₋₁
    targets = tokens[:, 1:]           # x₂…xₙ

    logits = model(inputs)             # [B, T, vocab]
    loss   = cross_entropy(logits, targets)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

        Key insight: Each forward pass generates T training signals simultaneously — one per token position. This is why transformers are so data-efficient per GPU-second compared to RNNs.
      

What the Model Actually Learns

Syntax & Grammar

To predict the next word correctly, the model must understand sentence structure. Grammatically incorrect predictions score poorly → grammar is learned implicitly.

World Knowledge

"The Eiffel Tower is located in ___" → model must know geography to predict "Paris". All factual knowledge is encoded in weights through the prediction objective.

Reasoning Patterns

Mathematical proofs, logical arguments, and code follow structured patterns. Predicting the next token in a proof requires understanding the reasoning structure leading up to it.

Stage 1 — Scaling

Scaling Laws & Chinchilla

How much data and compute do you need? The Chinchilla paper (Hoffmann et al., 2022) answered this — and changed how the entire industry trains LLMs.

Before Chinchilla: Over-parameterised Models

        GPT-3 mistake: 175B parameters trained on only 300B tokens (~1.7 tokens per parameter). Chinchilla showed this is deeply suboptimal — you'd do better with a much smaller model trained on more data.
      

The Chinchilla Rule

Optimal training tokens ≈

20 × N

where N = number of model parameters

e.g. 7B params → train on 140B tokens
70B params → train on 1.4T tokens

        Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params, 300B tokens) — a 4× larger model trained on far fewer tokens — on nearly every benchmark. Smaller + more data wins.
      

Scaling Laws Visualised

The Chinchilla frontier — given a fixed compute budget C, the optimal point balances parameters and tokens equally (roughly). Points to the left are undertrained; to the right, over-parameterised.

Interactive — Stage 1

Next-Token Prediction Demo

See how a language model assigns probabilities to the next token. Click a context to explore the model's distribution.

The base model knows how to complete text. SFT teaches it to respond to instructions — the difference between a raw model and an assistant.

What Changes During SFT

1

Same Architecture, Different Data

SFT uses the same transformer and the same cross-entropy loss — but the training data is (instruction, response) pairs instead of raw web text.

2

Loss Only on Response Tokens

The instruction tokens are masked — loss is computed only on the response portion. This stops the model from trying to predict the user's question.

3

Small Dataset, Big Impact

Remarkably, even 1,000–10,000 high-quality instruction pairs can produce a strong instruction-following model. Quality beats quantity here.

4

Learning Rate is Lower

SFT uses a much lower learning rate than pretraining (typically 10–100× lower) to update the weights without catastrophic forgetting of pretraining knowledge.

The SFT Loss Function

# SFT: only compute loss on response tokens
for batch in sft_dataloader:
    input_ids    = batch["input_ids"]
    labels       = batch["labels"]
    # labels = -100 (ignore) for instruction tokens
    # labels = token_id for response tokens

    logits = model(input_ids)
    loss   = cross_entropy(
        logits, labels,
        ignore_index=-100   # skip instruction tokens
    )
    loss.backward()

        LIMA result: Zhou et al. (2023) showed that training on just 1,000 carefully curated instruction pairs produced a model competitive with models trained on much larger datasets. "Alignment may be much easier than previously thought."
      

Stage 2 — Data

SFT Data Format & Chat Templates

How instructions and responses are structured for training — and why the format matters.

Chat Template (Llama 2 style)

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

What is the capital of France? [/INST]

The capital of France is Paris.
Paris has been the country's capital since
987 AD and is home to landmarks like the
Eiffel Tower and the Louvre.

</s>

Loss is computed only on the response text (shown in blue). The instruction, system prompt, and special tokens are masked with -100.

Key SFT Datasets

Dataset	Size	Source
FLAN	1.8M+	Task-specific instructions, Google
Alpaca	52K	GPT-3.5 generated, Stanford
ShareGPT	90K	Real ChatGPT conversations
Dolly	15K	Human-written, Databricks
OpenAssistant	161K	Human-written, crowd
LIMA	1K	Curated high-quality, Meta

        Data quality insight: Llama 2 used ~27,540 high-quality SFT examples — far fewer than many open-source models — but prioritised diversity and quality over volume.
      

Interactive — Stage 2

Base Model vs SFT Model

The same prompt produces very different outputs before and after SFT. Click a prompt to compare.

SFT teaches format. RLHF teaches values. It fine-tunes the model using human preference signals — pushing it toward responses people actually prefer.

Why SFT Alone Isn't Enough

The Imitation Problem

SFT trains the model to imitate human-written responses. But human annotators aren't always right, consistent, or writing the absolute best possible response. The model learns to imitate, not to be good.

Distributional Mismatch

SFT data is written before seeing what the model generates. In deployment, the model's own outputs become the inputs for multi-turn conversations. RLHF trains on the model's own distribution.

Hard to Specify Good Behaviour

It's easier for humans to compare two responses ("which is better?") than to write the ideal response from scratch. RLHF leverages this comparative judgement signal.

The 3-Step RLHF Process

Step 1

Collect Human Preferences

Show annotators pairs of model outputs for the same prompt. They pick which is better. Thousands of such comparisons are collected.

↓

Step 2

Train a Reward Model

A separate model is trained to predict the human preference score for any (prompt, response) pair. This is the proxy for "what humans want".

↓

Step 3

RL Fine-tuning with PPO

The SFT model is updated using PPO — it generates outputs, gets reward scores, and is nudged toward higher-reward responses while a KL penalty keeps it from drifting too far from SFT.

Stage 3 — Reward Model

Training the Reward Model

A neural network that learns to score how "good" a response is — trained entirely on human preference comparisons, no explicit reward function needed.

Reward Model Architecture

        Base: SFT model weights (same architecture)

          + Replace final language-model head

          + Add scalar output head: R(prompt, response) → ℝ

        Training objective:

          Maximise P(human prefers y_w over y_l)

          = σ(R(x, y_w) − R(x, y_l))

        # y_w = preferred ("winner") response

        # y_l = dispreferred ("loser") response

        Why initialise from SFT? The reward model needs to understand language to score responses. Starting from SFT weights means it inherits the SFT model's language understanding — it only needs to learn the scoring head, not language from scratch.
      

What Reward Models Learn to Prefer

Higher reward ↑

Helpful, complete answers · Honest ("I don't know") · Appropriate refusals · Clear reasoning · Balanced perspectives · Good formatting

Lower reward ↓

Harmful instructions · Hallucinated facts stated confidently · Sycophantic yes-saying · Unnecessary verbosity · Irrelevant digressions · Harmful stereotypes

        Reward hacking warning: The reward model is an imperfect proxy. PPO can "game" it — finding responses that get high reward scores while being actually unhelpful. This is why the KL penalty is critical.
      

Interactive — Stage 3

Reward Model in Action

For each prompt, choose which response you think a reward model would score higher — then see the explanation.

Stage 3 — PPO

PPO Training Loop

Proximal Policy Optimisation — the RL algorithm that fine-tunes the language model using reward signals while preventing catastrophic drift.

The PPO Objective

        Maximize:

         E[R(x,y)] − β · KL(π_θ || π_ref)

        # R(x,y)   = reward model score

        # KL(·||·)  = KL divergence from SFT policy

        # β         = KL penalty coefficient (~0.1–0.5)

        # π_θ       = current RL policy (being trained)

        # π_ref     = frozen SFT policy (reference)

        The KL penalty is critical. Without it, the model would maximise reward by producing nonsensical strings that fool the reward model. The KL penalty forces the model to stay close to what the SFT model would generate — keeping it grounded.
      

PPO Iteration Cycle

💬

Generate

RL policy samples responses for batch of prompts

🎯

Score

Reward model assigns scalar reward to each response

⚙

Update

PPO gradient update on RL policy weights

📏

KL Check

Measure drift from SFT ref; apply penalty if needed

        4 simultaneous models in memory: (1) RL policy being trained, (2) frozen reference SFT model for KL, (3) reward model, (4) value model (PPO critic). This is why RLHF is compute-intensive.
      

Training Dynamics Across All 3 Stages

Illustrative training dynamics. Pretraining loss decreases over billions of steps. SFT converges quickly. RLHF reward increases while KL divergence is constrained.

Stage 3 — Modern Alternative

DPO — Direct Preference Optimisation

A simpler, more stable alternative to RLHF that eliminates the reward model entirely — achieving comparable alignment without the 4-model complexity of PPO.

RLHF vs DPO

Aspect	RLHF (PPO)	DPO
Reward model	Explicit, trained separately	Implicit in the objective
Models in memory	4 (policy, ref, reward, value)	2 (policy, reference)
Training stability	Tricky — PPO is sensitive	Stable — supervised-like
Hyperparameters	Many (β, clip range, etc.)	Mainly just β
Data format	Preference pairs + prompts	Same preference pairs
Quality	State of art for complex alignment	Competitive, often slightly lower
Used by	ChatGPT, Llama-2-chat	Zephyr, Mistral instruct, many OSS

The DPO Objective

        Minimize:

         −log σ( β log(π_θ(y_w|x)/π_ref(y_w|x))

                    − β log(π_θ(y_l|x)/π_ref(y_l|x)) )

        # y_w = preferred response

        # y_l = dispreferred response

        # Reward model replaced by log probability ratio

        Key insight (Rafailov et al., 2023): The optimal policy under RLHF can be expressed in closed form as a function of the reference policy. DPO directly optimises for this — removing the need for an explicit reward model entirely.
      

        Practical impact: DPO made alignment accessible to the open-source community — no need for 4× GPU memory and complex PPO tuning. Models like Zephyr-7B achieved competitive performance with much less compute.
      

Interactive

Chinchilla Scaling Calculator

Given a training budget, compute the optimal model size and token count according to Chinchilla scaling laws.

Training Budget

Model parameters

7B

GPU count (A100 80GB)

64

Training days

14d

Summary

The Full Pipeline at a Glance

From a random initialisation to an aligned assistant — everything that happens across all 3 stages.

Stage 1 Pre-training Randomly init → Language model that can complete text Trillions of tokens · Months · $M–$100M

⬇

Stage 2 Supervised Fine-tuning Text completer → Instruction follower 1K–1M examples · Hours–days · Much cheaper

⬇

Stage 3a Reward Model Training Human preference pairs → scalar reward signal ~10K–100K comparisons · Days

⬇

Stage 3b RL Fine-tuning (PPO or DPO) Instruction follower → Aligned assistant ~1K–10K prompts · Days · 4× memory

Real Models at Each Stage

Base Models (Post Stage 1)

GPT-3 (175B), Llama-2-base (7B/13B/70B), Mistral-7B-base, Falcon-40B-base, MPT-7B. These are text completers — powerful but not assistants.

SFT Models (Post Stage 2)

Alpaca-7B, Vicuna-13B, FLAN-T5, text-davinci-001. Can follow instructions but may produce harmful or inconsistent responses.

Aligned Models (Post Stage 3)

ChatGPT (GPT-3.5/4), Claude 1/2/3, Llama-2-chat, Zephyr-7B (DPO). Production-grade assistants that refuse harmful requests and help reliably.

    What comes next? Modern pipelines are evolving: RLHF → DPO → RLAIF (Constitutional AI, AI feedback instead of human feedback) → Rejection Sampling → Iterative DPO. The 3-stage pipeline described here is the foundation — but the field keeps building on it.