The Discovery
In 2020, OpenAI scaled GPT to 175 billion parameters β and found something nobody designed. By seeing a few inputβoutput examples in the prompt, the frozen model learned new tasks instantly. No retraining. No weight updates. Just examples in a prompt.
TL;DR: In 2020, OpenAI scaled GPT to 175 billion parameters and found something unexpected: the model could perform brand-new tasks just by reading a few examples in the prompt β without any retraining. They called it in-context learning. Nobody designed this. It emerged. Three years of research later, we still don't fully understand it.
Click the buttons to see how the same frozen model changes behavior based on examples it sees in the prompt.
- β Thousands of labelled examples
- β Gradient updates, backprop
- β New model checkpoint saved
- β Hours of GPU compute
- β Weights permanently changed
- β One model per task
- β 0β8 examples in the prompt
- β Zero gradient updates
- β No new checkpoint needed
- β Milliseconds to adapt
- β Weights completely frozen
- β One model, infinite tasks
One frozen model can perform classification, translation, summarization, coding β just by changing the examples in the prompt.
New task in seconds, not hours. No training infrastructure needed. Change the prompt and you have a new capability.
No GPU cluster needed for adaptation. API access is sufficient. ICL democratized LLM capabilities for millions of developers.
How It Works
Every ICL prompt has the same anatomy: a sequence of (input, label) pairs β called demonstrations β followed by a test input. The model predicts the label for the final input. Simple in structure, mysterious in operation.
Every ICL prompt follows this structure. The model "reads" all demonstrations, infers the pattern, and fills in the final label.
Adjust the sliders to see how k (number of examples), order, and label correctness affect accuracy. The results are surprising.
Key insight: Label correctness barely matters β order matters far more than whether labels are right or wrong!
Every additional demonstration is another pass through the full context. ICL is powerful but not free β cost scales linearly with k.
Attention heatmap: the final prediction token (right) attends strongly across demonstration tokens β especially similar examples.
Theory 1: Task Location
Min et al. (2022) ran a provocative experiment: what if you scrambled all the labels? They expected performance to collapse. Instead, accuracy barely dropped. Their conclusion: demonstrations don't teach the model the task β they locate which pre-trained task to use.
Min et al. (2022) swapped all labels to random wrong ones. Click "Shuffle Labels" to see what happened.
Result: With correct labels: ~89% accuracy. With random (wrong) labels: ~85% accuracy. Only a 4% drop. The labels aren't the point.
Shows the model what kind of inputs to expect β sentiment text, not arithmetic or translation.
Establishes the label space: "Positive"/"Negative" β not "Yes"/"No" or "1"/"0".
Makes clear this is a classification task vs. a generation or translation task.
Activates the right "subroutine" in the pre-trained model β locates the task in weight space.
"The model already knows sentiment analysis from pre-training. The demonstrations say: do THAT."
Key finding: Random inputs (scrambled text, not just wrong labels) drop accuracy by 16%. So inputs matter more than labels. The content of the input, not the correctness of the label, drives ICL performance.
Theory 2: Implicit Gradient Descent
Von Oswald et al. (2022) proved something remarkable: a transformer's forward pass, when processing in-context demonstrations, is mathematically equivalent to one step of gradient descent. The model isn't just reading examples β it's running an optimizer.
Both standard gradient descent and the ICL forward pass minimize the same objective function. One updates weights explicitly; the other does it implicitly through attention.
- βΈ Each demonstration = 1 training example
- βΈ Attention mechanism = the optimizer
- βΈ Forward pass = the training loop
- βΈ More examples = more gradient steps
- βΈ Why more examples help: more steps
- βΈ Why order matters: steps are sequential
- βΈ Quality bounded by pre-training
- βΈ Early demos = early gradient steps
Drag the slider to add more demonstrations. Each new example moves the optimizer one step closer to the task optimum.
Critique: The GD theory explains why more examples help and why order matters. But it doesn't explain why random labels barely hurt β gradient descent with wrong labels should fail badly. Theory 3 fills this gap.
Theory 3: Bayesian Inference
Xie et al. (2022) propose the cleanest theory: ICL is Bayesian inference. The model has a prior over tasks from pre-training. Each demonstration is evidence. The model updates its belief about which task you want, then predicts accordingly.
Watch how the model's task belief distribution sharpens as you add more demonstrations. Click "Add Demo" to step through.
The long-tailed distribution of tasks in pre-training data IS the prior. Common tasks are easy to locate with few examples.
The hidden concept C is never told to the model. It infers C from demonstrations, then uses C to predict the test label.
Why random labels barely hurt: Wrong labels are weak, misleading evidence. The prior from pre-training is strong. The model's posterior is dominated by the prior, not by a handful of wrong demonstrations. This is why 4 random labels barely move the needle.
| Property | Task Location | Implicit GD | Bayesian |
|---|---|---|---|
| Random labels barely hurt | β labels just locate | β GD needs correct signal | β strong prior dominates |
| Order sensitivity | Partial | β steps are sequential | Partial |
| More examples = better | β | β | β |
| Explains emergence at scale | Partial | Partial | β |
Chain-of-Thought
Standard ICL shows inputβoutput. Chain-of-Thought shows inputβreasoningβoutput. Adding intermediate steps to demonstrations dramatically improves performance on multi-step reasoning tasks β math, logic, commonsense.
Toggle between approaches and watch how the reasoning changes β and how accuracy on math problems jumps from 20% to 70%.
Complex task β sequence of simple tasks. Each reasoning step is ICL-easy. The model solves hard problems by chaining simple ones.
Intermediate steps can be checked. Errors caught earlier. Humans can trace the reasoning to find exactly where it went wrong.
Only works reliably above ~100B parameters. CoT requires genuine reasoning capacity β small models produce fluent-but-wrong chains.
Work through this decision tree to find the right prompting strategy for your task.
Zero-shot CoT: Simply appending "Let's think step by step." to a prompt activates chain-of-thought reasoning β no examples needed. This phrase has become one of the most-studied "magic strings" in NLP research.
What This Means
ICL is now the default way to use LLMs. Understanding its mechanics β and its limits β is the difference between prompts that work and prompts that don't.
2023 finding (Hendel et al.): a few ICL demonstrations can be compressed into a single "task vector" β an activation injected at a specific layer. Same performance, 1/k the inference cost. The model encodes "the task" as a direction in activation space.
Bounded by model's max tokens. Can't fit 100 examples. Token budget forces hard trade-offs between demo count and input length.
k-shot = (k+1)x inference cost. Expensive at API scale. Fine-tuning becomes cost-effective beyond ~10,000 calls/day with the same k examples.
Learning disappears after the conversation. The model doesn't remember previous interactions. Every new call starts fresh from frozen weights.
Performance can swing 30% from shuffled examples. ICL is sensitive to the exact sequence. Always validate ordering on a held-out set.