In-Context Learning — How LLMs Learn Without Learning

The Phenomenon · ~12 min read

The Discovery

In 2020, OpenAI scaled GPT to 175 billion parameters — and found something nobody designed. By seeing a few input→output examples in the prompt, the frozen model learned new tasks instantly. No retraining. No weight updates. Just examples in a prompt.

TL;DR: In 2020, OpenAI scaled GPT to 175 billion parameters and found something unexpected: the model could perform brand-new tasks just by reading a few examples in the prompt — without any retraining. They called it in-context learning. Nobody designed this. It emerged. Three years of research later, we still don't fully understand it.

175B

GPT-3 parameters

3

Competing theories

<5%

Accuracy drop with wrong labels

30%

Gap from bad example order

What is In-Context Learning?

Click the buttons to see how the same frozen model changes behavior based on examples it sees in the prompt.

Fine-tuning (Traditional)

✗ Thousands of labelled examples
✗ Gradient updates, backprop
✗ New model checkpoint saved
✗ Hours of GPU compute
✗ Weights permanently changed
✗ One model per task

In-Context Learning (ICL)

✓ 0–8 examples in the prompt
✓ Zero gradient updates
✓ No new checkpoint needed
✓ Milliseconds to adapt
✓ Weights completely frozen
✓ One model, infinite tasks

Versatility

One frozen model can perform classification, translation, summarization, coding — just by changing the examples in the prompt.

Speed

New task in seconds, not hours. No training infrastructure needed. Change the prompt and you have a new capability.

Cost

No GPU cluster needed for adaptation. API access is sufficient. ICL democratized LLM capabilities for millions of developers.

ICL works — but how? We need to see the exact mechanics of the prompt. How It Works →

The Mechanics

How It Works

Every ICL prompt has the same anatomy: a sequence of (input, label) pairs — called demonstrations — followed by a test input. The model predicts the label for the final input. Simple in structure, mysterious in operation.

Prompt Anatomy — Sentiment Classification

Every ICL prompt follows this structure. The model "reads" all demonstrations, infers the pattern, and fills in the final label.

The Three Choices That Matter

Adjust the sliders to see how k (number of examples), order, and label correctness affect accuracy. The results are surprising.

k (examples shown) 4

Example order Optimal

Label correctness Correct

Key insight: Label correctness barely matters — order matters far more than whether labels are right or wrong!

The Cost Formula

ICL cost = (k + 1) x single-inference cost k-shot: process k demos + 1 test input 4-shot = 5x the tokens = 5x compute cost 8-shot = 9x cost vs zero-shot

Every additional demonstration is another pass through the full context. ICL is powerful but not free — cost scales linearly with k.

What the Model "Sees"

Attention heatmap: the final prediction token (right) attends strongly across demonstration tokens — especially similar examples.

Three theories compete to explain what the model is actually doing inside. Theory 1: Task Location →

The Task Location Hypothesis

Theory 1: Task Location

Min et al. (2022) ran a provocative experiment: what if you scrambled all the labels? They expected performance to collapse. Instead, accuracy barely dropped. Their conclusion: demonstrations don't teach the model the task — they locate which pre-trained task to use.

The Random Label Experiment

Min et al. (2022) swapped all labels to random wrong ones. Click "Shuffle Labels" to see what happened.

Result: With correct labels: ~89% accuracy. With random (wrong) labels: ~85% accuracy. Only a 4% drop. The labels aren't the point.

Input Distribution

Shows the model what kind of inputs to expect — sentiment text, not arithmetic or translation.

Output Format

Establishes the label space: "Positive"/"Negative" — not "Yes"/"No" or "1"/"0".

Task Framing

Makes clear this is a classification task vs. a generation or translation task.

Pattern Space

Activates the right "subroutine" in the pre-trained model — locates the task in weight space.

Demonstrations as an Address, Not a Lesson

"The model already knows sentiment analysis from pre-training. The demonstrations say: do THAT."

Key finding: Random inputs (scrambled text, not just wrong labels) drop accuracy by 16%. So inputs matter more than labels. The content of the input, not the correctness of the label, drives ICL performance.

If demonstrations just locate tasks, what's happening computationally? Theory 2 offers an algorithmic answer. Theory 2: Gradient Descent →

The Gradient Descent Hypothesis

Theory 2: Implicit Gradient Descent

Von Oswald et al. (2022) proved something remarkable: a transformer's forward pass, when processing in-context demonstrations, is mathematically equivalent to one step of gradient descent. The model isn't just reading examples — it's running an optimizer.

Gradient Descent ≡ Attention Mechanism

Both standard gradient descent and the ICL forward pass minimize the same objective function. One updates weights explicitly; the other does it implicitly through attention.

The GD View of ICL

▸ Each demonstration = 1 training example
▸ Attention mechanism = the optimizer
▸ Forward pass = the training loop
▸ More examples = more gradient steps

Implications

▸ Why more examples help: more steps
▸ Why order matters: steps are sequential
▸ Quality bounded by pre-training
▸ Early demos = early gradient steps

Interactive Loss Landscape — k Examples = k Steps Downhill

Drag the slider to add more demonstrations. Each new example moves the optimizer one step closer to the task optimum.

k (demonstrations) 0

k=0: The model starts at its pre-trained position — optimized for language modeling, not your task.

Critique: The GD theory explains why more examples help and why order matters. But it doesn't explain why random labels barely hurt — gradient descent with wrong labels should fail badly. Theory 3 fills this gap.

Theory 2 gives us an algorithm. Theory 3 gives us a probabilistic framework that unifies the others. Theory 3: Bayesian Inference →

The Bayesian Framework

Theory 3: Bayesian Inference

Xie et al. (2022) propose the cleanest theory: ICL is Bayesian inference. The model has a prior over tasks from pre-training. Each demonstration is evidence. The model updates its belief about which task you want, then predicts accordingly.

Bayesian Belief Update — Add Demonstrations

Watch how the model's task belief distribution sharpens as you add more demonstrations. Click "Add Demo" to step through.

0 demonstrations shown

P(concept | demos) ∝ P(demos | concept) × P(concept) Prior: what tasks appear in pre-training data? Likelihood: do these examples fit that task? Posterior: updated belief about what you want

The Pre-training Prior

The long-tailed distribution of tasks in pre-training data IS the prior. Common tasks are easy to locate with few examples.

The Latent Concept

The hidden concept C is never told to the model. It infers C from demonstrations, then uses C to predict the test label.

Why random labels barely hurt: Wrong labels are weak, misleading evidence. The prior from pre-training is strong. The model's posterior is dominated by the prior, not by a handful of wrong demonstrations. This is why 4 random labels barely move the needle.

Theory Comparison

Property	Task Location	Implicit GD	Bayesian
Random labels barely hurt	✓ labels just locate	✗ GD needs correct signal	✓ strong prior dominates
Order sensitivity	Partial	✓ steps are sequential	Partial
More examples = better	✓	✓	✓
Explains emergence at scale	Partial	Partial	✓

All three theories are partial explanations. Chain-of-Thought extends ICL by making the reasoning visible. Chain-of-Thought →

The Extension

Chain-of-Thought

Standard ICL shows input→output. Chain-of-Thought shows input→reasoning→output. Adding intermediate steps to demonstrations dramatically improves performance on multi-step reasoning tasks — math, logic, commonsense.

Standard ICL vs Chain-of-Thought

Toggle between approaches and watch how the reasoning changes — and how accuracy on math problems jumps from 20% to 70%.

20%

Accuracy on math benchmarks

~100B+

Minimum params for CoT to work

Decomposition

Complex task → sequence of simple tasks. Each reasoning step is ICL-easy. The model solves hard problems by chaining simple ones.

Verification

Intermediate steps can be checked. Errors caught earlier. Humans can trace the reasoning to find exactly where it went wrong.

Emergence

Only works reliably above ~100B parameters. CoT requires genuine reasoning capacity — small models produce fluent-but-wrong chains.

When Should You Use CoT?

Work through this decision tree to find the right prompting strategy for your task.

Is your task multi-step? (math, logic, reasoning, planning)

Standard few-shot ICL is the right choice. CoT adds overhead without benefit for single-step tasks like classification or extraction.

Is the model larger than ~100B parameters? (GPT-4, Claude, Gemini)

Standard ICL may work better. CoT in small models (<7B) often produces plausible-sounding but wrong reasoning steps.

Do you have reasoning examples to include?

Use few-shot Chain-of-Thought. Include 3–5 examples with full step-by-step reasoning before your actual question.

Use zero-shot CoT: append "Let's think step by step." to your prompt. This single phrase activates chain-of-thought reasoning without examples.

Zero-shot CoT: Simply appending "Let's think step by step." to a prompt activates chain-of-thought reasoning — no examples needed. This phrase has become one of the most-studied "magic strings" in NLP research.

ICL and CoT have transformed how we use LLMs. What does this mean for you practically? What This Means →

The Legacy & Practical Guide

What This Means

ICL is now the default way to use LLMs. Understanding its mechanics — and its limits — is the difference between prompts that work and prompts that don't.

Task Vectors — Compressing ICL

2023 finding (Hendel et al.): a few ICL demonstrations can be compressed into a single "task vector" — an activation injected at a specific layer. Same performance, 1/k the inference cost. The model encodes "the task" as a direction in activation space.

Timeline of In-Context Learning

Click any milestone on the timeline to learn more.

Write Better Prompts — Practical Guide

1. Order matters — put the best examples last▼

The most relevant examples should be closest to your test input (last in the prompt). Recency bias means the model weighs recent demonstrations more heavily. Shuffled examples can drop accuracy by up to 30% vs. optimal ordering.

2. Label space > label correctness▼

Always use the real output format — "Positive"/"Negative" not "1"/"0" — even if some labels are approximate. The format of the label matters far more than whether every single label is perfectly correct.

3. k=4 is often the sweet spot▼

Beyond k=8, returns diminish sharply while cost grows linearly. For most tasks, 4 high-quality, well-ordered examples outperform 16 random ones. Quality and order beat quantity.

4. Use CoT for multi-step tasks▼

Add reasoning traces to your few-shot examples for math, logic, planning, and commonsense tasks. If you don't have examples, just add "Let's think step by step." to the prompt — zero-shot CoT works surprisingly well on large models.

5. Small models (<7B params) — standard ICL beats CoT▼

Chain-of-thought requires the model to have genuine reasoning capacity. Models below ~100B parameters tend to produce fluent-sounding but wrong reasoning chains. Standard few-shot ICL often gives better results on small models.

Context Window

Bounded by model's max tokens. Can't fit 100 examples. Token budget forces hard trade-offs between demo count and input length.

Cost Scaling

k-shot = (k+1)x inference cost. Expensive at API scale. Fine-tuning becomes cost-effective beyond ~10,000 calls/day with the same k examples.

No Persistence

Learning disappears after the conversation. The model doesn't remember previous interactions. Every new call starts fresh from frozen weights.

Order Fragility

Performance can swing 30% from shuffled examples. ICL is sensitive to the exact sequence. Always validate ordering on a held-out set.

Open Questions — Click to Explore

Click a question bubble to explore current research.

175B

GPT-3 params (Brown 2020)

<5%

Accuracy drop (wrong labels)

30%

Gap from bad ordering

3

Competing theories

∞

Tasks (same model)