๐Ÿ”’
Visual Summary
How AI Generates Text โ€” Sampling Techniques
This interactive tool is exclusive to paid subscribers.
Enter the password from your subscriber email to unlock.
Not a subscriber yet? Join Visual Summary โ†’
The Problem โ€บ How It Works โ€บ โ‘  Greedy โ€บ โ‘ก Beam Search โ€บ โ‘ข Temperature โ€บ โ‘ฃ Top-K โ€บ โ‘ค Nucleus โ€บ Compare All
Why AI Writing Goes Wrong
AI language models are excellent at understanding text, but something goes wrong the moment they start generating it. Even the smartest models โ€” the same ones that ace reading comprehension tests โ€” produce repetitive, robotic, or nonsensical text. Why? And what's the fix?
TL;DR โ€” The Paper in One Paragraph

In 2019, researchers at University of Washington discovered that the way AI picks its next word โ€” the decoding strategy โ€” matters more than the model itself. They showed that 6 different strategies produce wildly different text from the same model, and proposed Nucleus Sampling as the fix that gets closest to human-like writing. This interactive explorer lets you see exactly why each method works or fails.

6
Decoding strategies compared
~0%
Repetition with Nucleus Sampling
98%
Greedy text is repetitive by 200 tokens
2019
Year โ€” still used in every LLM today
Watch the Degeneration Happen

Both texts start with the same prompt: "Once upon a time, there was a curious robot who". Press Generate to see how a human continues the story vs what an AI does when it only picks the most likely word each time.

Human Writer
AI (Greedy โ€” always picks most likely word)
What is a "Decoding Strategy"?

Every time an AI writes a word, it first computes a probability score for every word in its vocabulary (often 50,000+ words). The scores look like this:

The AI assigns each word a score. A decoding strategy is the rule for which word to pick next. Should you always pick the top one? Pick randomly? Pick from only the top 10? The answer changes everything.

Everyday Analogy

"Imagine you're finishing someone's sentence and you have a dictionary in front of you with a score next to each word. The strategy is your rule: always pick the highest score? Roll a dice? Only consider the top 5? Each rule gives a completely different result โ€” even though the dictionary (the AI model) is identical."

Now let's explore each strategy one by one, starting with the simplest: โ‘  Greedy Decoding โ†’
How the Model Picks Its Next Word
Before comparing strategies, you need to understand the machinery underneath. Every sampling technique works with two things: logits (raw scores the model produces) and probabilities (what sampling actually uses). Here's where both come from.
The Full Pipeline: From Your Prompt to a Probability Score

Every time the model needs to pick the next word, it runs through this pipeline โ€” compress the entire conversation into a single vector, then score every word in the vocabulary against it:

The Logit Formula (One Line)
logit(word_i) = h ยท W_i = hโ‚ร—W_iโ‚ + hโ‚‚ร—W_iโ‚‚ + โ€ฆ + h_dร—W_id
h
The hidden state โ€” a vector of ~4,096 numbers that summarises everything the model knows about the context so far. It's the output of all the transformer layers.
W_i
The word weight row for word i โ€” also ~4,096 numbers, one row of the vocabulary weight matrix. Encodes "what this word looks like." Learned during training.
ยท
The dot product โ€” multiply matching elements, sum them all. Higher = word i fits this context well.
Interactive: Watch the Dot Product Compute a Logit

Simplified to 4 dimensions (real models use ~4,096). Context: "The capital of France is ___". Select a word to see its dot product with the context vector computed step by step.

Where Do the Weights W Come From?

W is learned during training. Every time the model predicted the wrong word, training slightly adjusted the weights so the correct word scores higher. After trillions of tokens, the weights encode a deep understanding of which words follow which contexts.

๐Ÿ“š
Llama-3 (8B)
h = 4,096 dims
W = 128k ร— 4,096
= ~500M logit weights
๐Ÿง 
GPT-4 class
h โ‰ˆ 12,288 dims
W โ‰ˆ 100k ร— 12,288
= ~1.2B logit weights
โฑ
Per token
All ~50kโ€“100k dot
products computed
in one GPU step
Everyday Analogy

"Imagine you're a matchmaker. You have a client profile h ('wants: outdoorsy, funny, tall'). You have a database of candidates, each with their own profile W_i. The dot product is the compatibility score. The candidate with the highest score is your top pick. That's what the model does โ€” with 4,096 traits and 50,000 candidates, in milliseconds."

What Are Logits โ€” and How Do They Become Probabilities?

The dot products above are called logits โ€” raw confidence scores, one per word. They can't be used directly for sampling because they can be negative and don't sum to 100%. Softmax converts them:

Logits โ€” Raw scores (NOT probabilities)
โœ— Can be negative
โœ— Don't add up to 100%
โœ— Can't be used directly
After Softmax โ€” Probabilities
โœ“ Always between 0% and 100%
โœ“ All words sum to exactly 100%
โœ“ Ready to use for sampling
3 Key Properties of Logits
1
Any real number โ€” positive or negative
A word with logit -5.0 has very low (but nonzero) probability. A word with logit +5.0 has a high probability.
2
Only the gaps between logits matter
[4, 2, -1] gives identical probabilities to [104, 102, 99]. Temperature works by dividing (shrinking gaps), not adding.
3
Softmax converts logits โ†’ probabilities
Squashes any set of numbers into values between 0 and 1 that sum to exactly 1.0 (100%). The bridge between raw scores and usable probabilities.
Interactive: Logit Gaps Control the Probability Distribution

Compress or expand the logit gaps. Zero gaps โ†’ all words equal. Large gaps โ†’ one word dominates. This is the core mechanic that temperature sampling manipulates.

Logit gap (spread) gap = 1.0ร—
Everyday Analogy

"Logits are like raw exam scores before grading. Student A: 85, B: 72, C: 40. Those numbers can't tell you the pass rate directly โ€” you need to normalise them (softmax). And if all scores are within 1 point (small gap), it's a near-tie. If one student scores 200 points above everyone (large gap), they dominate โ€” just like a high-logit word dominates probability."

Foundation set. Now let's explore what each strategy does with these probability scores: How the Model Works โ†’
Greedy Decoding
The simplest rule: at every step, pick the word with the highest probability. No randomness. No planning. Just always take the "safest" choice.
The Rule in Plain English

Look at all the word scores. Find the highest one. Pick that word. Move to the next word. Repeat. Forever.

Step-by-Step Greedy Generation

Seed: "Once upon a time, a curious robot" โ€” press Next Word to see greedy pick at each step.

Once upon a time, a curious robot
Step 0 of 8
The Repetition Trap

Here's what happens after 50+ words. Greedy gets trapped in a loop โ€” the same high-probability phrase keeps winning, over and over:

"The robot walked to the door . The door was very big . The door was very big . The door was very big . The door was very big . The door was very big . The door was very big . The door was very big ."

This is called neural text degeneration โ€” the model isn't broken, it's just doing exactly what it was told. "Very big" is always the highest-probability continuation, so it gets picked every single time.

Everyday Analogy

"Greedy decoding is like texting with predictive text and always tapping the first suggestion. Your message quickly becomes 'the the the the the' โ€” grammatically correct, completely meaningless. The phone is doing exactly what you asked; the rule is just bad."

Key weakness: Greedy never looks more than one word ahead. It can't see that picking a slightly-less-likely word now might lead to a much better sentence overall. It's the ultimate short-sighted strategy.

What if the AI planned several words ahead at once? โ‘ก Beam Search โ†’
Beam Search
Beam search fixes greedy's short-sightedness by exploring several possible continuations simultaneously โ€” keeping the best ones, discarding the rest. It's like a chess player thinking 5 moves ahead instead of just 1.
The Rule in Plain English

Keep track of the top W sequences at all times (W = "beam width"). At each step, expand every sequence by one word, score them all, keep only the best W. At the end, pick the overall winner.

Beam Search Tree โ€” Watch it Explore and Prune
Beam width W W = 3

W=1 is identical to greedy. Higher W explores more paths. The highlighted path is the winner.

Seed: "a curious robot" โ†’ exploring 3 steps ahead with W beams

The Surprising Problem: Planning Ahead Makes Things Worse

You'd expect more planning = better text. The paper found the opposite for open-ended generation:

Human text (natural)
"She climbed the worn stone steps slowly, each one a memory. At the top, the sea stretched endlessly โ€” grey, immense, indifferent. She didn't feel small. She felt free."
Beam Search (W=5)
"She was the most important person in the world and she was the best person in the world and she was very very happy and she was very very good and very very ..."

Beam search finds the highest-probability text โ€” but humans don't write high-probability text! We use surprising words, specific details, and unexpected turns. Generic phrases like "very very good" score high because they're common, not because they're interesting.

Everyday Analogy

"Beam search is like a chef planning a 5-course meal by scoring thousands of menu combinations and picking the highest-rated one. The problem: 'chicken + salad + bread + soup + water' scores highest because those are the most common items โ€” but it's the most boring meal imaginable. The best chefs make unexpected combinations."

Key insight from the paper: Human text is NOT the highest-probability text. Humans regularly pick surprising, specific words. Any strategy that maximizes probability will always diverge from how humans actually write.

Both deterministic methods fail. What if we add some randomness? โ‘ข Temperature Sampling โ†’
Temperature Sampling
Instead of always picking the top word, what if we used the probabilities as a menu โ€” rolling a weighted dice? Temperature controls how adventurous the dice is. Too cold = same word every time. Too hot = random gibberish.
The Rule in Plain English

Before rolling the dice, reshape the probability scores using a temperature knob T:
Low T (e.g. 0.3) โ†’ make the highest-scoring word even more dominant (safe, boring).
High T (e.g. 1.5) โ†’ flatten all scores so every word seems equally likely (wild, random).
T = 1.0 โ†’ use the original probabilities unchanged (neutral baseline).
The allowed range varies: Anthropic's Claude caps at 1.0; OpenAI and Google allow up to 2.0. Mathematically there's no upper bound โ€” but above ~1.5 text rapidly becomes incoherent.

See the Temperature Effect in Real Time

Drag the temperature slider and watch the word probabilities reshape. The taller the bar, the more likely that word gets picked.

Range depends on the system: Mathematically, T can be any positive number (0โ†’โˆž). In practice: OpenAI caps at 2.0, Anthropic (Claude) caps at 1.0, Google Gemini allows up to 2.0. The concepts below apply equally โ€” T=1 is always the neutral baseline.
Temperature T T = 1.0
The Three Temperature Zones
๐Ÿฅถ
T < 0.5 โ€” Too Cold
The highest-probability word dominates completely. Becomes nearly identical to greedy. Text is fluent but repetitive and generic.
"The robot walked to the door. The door was very big. The door was very big..."
๐Ÿ˜Š
T โ‰ˆ 0.7โ€“1.0 โ€” Balanced
Original probabilities preserved or slightly sharpened. Some randomness and variety. Often a reasonable starting point.
"The robot walked through the city, searching for something it couldn't name."
๐Ÿ”ฅ
T > 1.2 โ€” Too Hot
All words become nearly equally likely. Rare, strange words get picked just as often as common ones. Text becomes incoherent.
"The robot banana danced kaleidoscope beneath seventeen purple mathematics."
Everyday Analogy

"Temperature is like adjusting the seasoning on food. Too little salt (low T) = bland, predictable, boring. Perfect amount = delicious and interesting. Too much salt (high T) = inedible chaos. The problem with temperature is there's no 'perfect' value that works for all sentences โ€” sometimes the distribution naturally needs more freedom, sometimes less."

Key weakness: Temperature is a one-size-fits-all knob. It reshapes ALL the probabilities by the same amount, regardless of whether the current word slot is obvious (only one good choice) or ambiguous (many equally good choices). It needs a smarter companion.

The Mathematics โ€” How Temperature Actually Works

You've seen in the Foundation section that logits are raw scores and softmax converts them to probabilities. Temperature adds one step between them โ€” dividing every logit by T before softmax is applied:

The Temperature Formula (Step by Step)
1
Start with raw logits (model's raw scores)
zโ‚, zโ‚‚, zโ‚ƒ, ... , z_V (one score per word in vocabulary V)
Example: z_Paris = 4.2, z_city = 1.6, z_cheese = -1.4
2
Divide every logit by temperature T
scaled logit = zแตข / T
T < 1 โ†’ dividing by a small number makes the scores more extreme (gaps get bigger)
T > 1 โ†’ dividing by a large number makes the scores closer together (gaps shrink)
T = 1 โ†’ scores unchanged
3
Apply softmax to convert to probabilities (0โ€“100% each, all summing to 100%)
exp(zแตข / T) p(word_i) = โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ฮฃโฑผ exp(zโฑผ / T)
exp() = e^x, a standard math function. The denominator sums over all words, ensuring probabilities add to 1.
Why T Can Go From 0 to Infinity โ€” and What Happens at the Extremes
โ†’ 0
T โ†’ 0
One word gets 100%. Identical to Greedy.
๐Ÿฅถ
T = 0.5
Sharper than original. Top word very dominant.
โœ“
T = 1.0
Original distribution. No change.
๐Ÿ”ฅ
T = 2.0
Flatter. All words more equally likely.
๐Ÿ’ฅ
T โ†’ โˆž
Perfectly uniform. All words equally likely.
What Happens at T = โˆž?

As T grows very large, every logit zแตข/T โ†’ 0. And exp(0) = 1 for every word. So every word gets the same numerator, and the formula becomes:

p(word_i) = 1 / |V| (identical for every word)

With a vocabulary of ~50,000 words, each word gets exactly 0.002% probability. The AI is rolling a 50,000-sided die. "Paris" has the exact same chance as "banana" or "seventeen" or any nonsense word. The model's learned knowledge becomes completely irrelevant โ€” it's pure noise.

At T=1: "Paris" = 52%, "banana" = 0.001%
โ†’
At T=โˆž: "Paris" = 0.002%, "banana" = 0.002%
What Happens at T โ†’ 0?

As T shrinks toward 0, dividing by a tiny number amplifies score gaps enormously. The highest-scoring word's probability approaches 100%; all others approach 0%. This is exactly greedy decoding โ€” always picking the single most likely word, no randomness.

As T โ†’ 0: p(argmax word) โ†’ 1.0, p(all others) โ†’ 0.0

This is why T=0 is impossible to compute directly (division by zero) โ€” but T=0.01 is effectively identical to greedy decoding in practice.

Interactive: Extend the Slider to Extremes

The slider below extends to T=50 so you can watch the distribution flatten toward uniformity. Compare what you see to the standard 0โ€“2 range above.

Temperature T T = 1.00
Top-K Sampling
Instead of choosing from all 50,000 words, what if we kept only the top K most likely words and picked randomly from those? This avoids incoherence while adding variety. The catch: K is a fixed number that can't know what the context needs.
The Rule in Plain English

Sort all words by probability. Keep only the top K. Throw away all the rest (set their probability to zero). Randomly pick from the remaining K words using their relative probabilities.

Adjust K โ€” See Which Words Are Allowed
Top-K value K = 5
The Core Problem: K Can't Adapt to Context

Here are two different sentences the AI might be completing. The right K is completely different for each:

Situation A โ€” Obvious context
Prompt: "The capital of France is ___"
Only 1โ€“2 words are sensible ("Paris", "situated"). K=50 would let the AI say "cheese" or "beautiful" โ€” wrong!
Ideal K: 2โ€“3
Situation B โ€” Creative context
Prompt: "She looked up at the night sky and felt ___"
Dozens of words work: "lonely", "small", "infinite", "hopeful", "amazed"... K=3 would cut out most good options!
Ideal K: 20โ€“50

Since K is fixed before generation starts, it can't know which situation it's in. This is the fundamental limitation that Nucleus Sampling was designed to solve.

Everyday Analogy

"Imagine a music app that always shows you exactly 5 song recommendations, no matter what. On a lazy Sunday when you feel like anything, 5 options is too limiting. On a busy morning when you just want your usual commute playlist, 5 options has too much noise. The right number depends on the situation โ€” but Top-K forces you to pick a number before you know the situation."

Key weakness: K is blind to context. When the distribution is peaked (one clear winner), even K=50 includes nonsense. When the distribution is flat (many good choices), K=5 cuts out most of the good options. We need K to adapt automatically.

What if instead of a fixed K, we used the distribution itself to decide how many words to consider? โ‘ค Nucleus Sampling โ†’
Nucleus Sampling (Top-p)
The insight: instead of a fixed count of words, keep the smallest group of words whose combined probability reaches a threshold p. This group โ€” the "nucleus" โ€” automatically shrinks when context is obvious and expands when context is creative.
The Rule in Plain English

Sort words by probability (highest first). Keep adding words to a list until their combined probability reaches p (e.g., 90%). That list is the "nucleus." Pick randomly from only those words. The size of the list changes automatically every single word.

Watch the Nucleus Form โ€” and Change Size

The shaded region is the nucleus โ€” the words the AI is allowed to pick from. Toggle between a peaked and flat distribution to see the nucleus automatically resize.

Threshold p p = 0.90
Nucleus vs Top-K โ€” The Same Two Contexts
Peaked context: "The capital of France is ___"
Top-K (K=50)
Includes 48 irrelevant words. Might pick "cheese" or "beautiful" with small but nonzero probability.
Nucleus (p=0.9)
Nucleus contains only "Paris" + "located". Automatically excludes all wrong answers. โœ“
Flat context: "She looked up and felt ___"
Top-K (K=5)
Only 5 words allowed. Cuts out "infinite", "small", "hopeful", "amazed" โ€” all equally good options!
Nucleus (p=0.9)
Nucleus expands to include ~20 words. All good creative choices remain available. โœ“
Everyday Analogy

"Nucleus sampling is like a smart waiter at a restaurant. Instead of always offering you the same fixed number of dishes, they look at today's menu quality. On a day when one dish is clearly the star, they say 'you should really try this one.' On a day when everything is equally good, they bring you the full menu. The list size adapts to what makes sense."

Why Not Just Sample from the Full Distribution?

Sampling from all 50,000+ words means the 0.0001% words (completely random, bizarre words) occasionally get picked. Nucleus sampling cuts off this "unreliable tail" โ€” the long list of words the model assigns near-zero probability to. The tail is where incoherence lives.

The red region is the unreliable tail. Nucleus sampling never picks from there. Pure sampling occasionally does โ€” that's where "banana danced kaleidoscope" comes from.

Key advantage: Nucleus sampling always maintains a coherent range of choices โ€” not too few (no creativity), not too many (no incoherence). The nucleus automatically adjusts every single word, informed by the model's own confidence.

Let's put all 6 techniques side by side and see the full picture: โ‘ฅ Compare All โ†’
All 6 Methods โ€” Side by Side
Same model. Same seed. Six different strategies. Here's what comes out โ€” and why it matters for every AI system you use today.
Quality vs. Diversity โ€” The Fundamental Trade-off

Every decoding strategy lives somewhere on this chart. The goal is the top-right corner โ€” high quality AND high diversity. Click any point to highlight it.

Same Prompt โ€” 6 Different Outputs

Prompt: "Once upon a time, there was a curious robot who"

โ‘  Greedy Decoding Degenerate
"...walked to the door . The door was very big . The door was very big . The door was very big . The door was very big . The door was very big ."
โ‘ก Beam Search (W=5) Generic
"...was the most curious robot in the world and it was very happy and it was the best robot in the world and it was always very very happy and it was ..."
โ‘ข Temperature T=0.3 Stiff
"...walked through the city every day. The city was very big and very busy. The robot worked in a small shop. The shop was very old and very quiet."
โ‘ข Temperature T=1.8 Incoherent
"...banana elephant spinning galaxy beneath seventeen purple mathematics answered crackle hummingbird forgotten yesterday telescope river moonbeam seventeen ."
โ‘ฃ Top-K Sampling (K=5) Okay
"...walked through the old city streets and found the door to a very dark room. The room was old and the door was heavy. The robot looked around slowly."
โ‘ค Nucleus Sampling (p=0.9) Best
"...had never seen a sunrise. One morning it climbed to the highest rooftop in the city, stretched its metal arms wide, and watched the sky shift from dark purple to gold. For the first time, it understood why humans called beautiful things priceless."
Summary: Strengths, Weaknesses, When to Use
Method Diversity Coherence Best for Avoid when
โ‘  Greedy None High (short) Short answers, fact lookup Any creative writing
โ‘ก Beam Search Very low High (short) Translation, summarization Open-ended generation
โ‘ข Temperature Tunable Tunable When you know the right T Long text (T drifts wrong)
โ‘ฃ Top-K Medium Medium Moderate creative text Highly variable contexts
โ‘ค Nucleus (Top-p) High, adaptive High Most creative generation Needs p tuning (<0.5 = greedy)
These Settings Are in Every AI System You Use
๐Ÿ’ฌ
ChatGPT / Claude
Uses nucleus sampling (pโ‰ˆ0.9) + mild temperature for most chat. Beam search for structured outputs like code formatting.
๐ŸŽต
AI Music / Image Prompts
High temperature (Tโ‰ˆ1.2) for creative variety. Lower temperature when following strict style guidelines.
๐Ÿ”
Search / Translation
Beam search dominates here โ€” you want the single best translation or most relevant answer, not creative variation.
Open Questions (Still Active Research in 2024)
Can we auto-tune p and T without user input? โ–พ
Modern systems like Mirostat and DynaTemp try to adapt temperature during generation to keep the "surprise level" of text constant. Instead of a fixed T, they monitor how surprised the model is at each word and adjust T in real time.
Does nucleus sampling work for code and math? โ–พ
Less well. Code and math have many "wrong" answers and few "right" ones โ€” the distribution is often very peaked. Nucleus sampling can still introduce errors by picking from the tail. Greedy or beam search with verification tends to work better for structured outputs.
What about combining methods? โ–พ
Yes โ€” most production systems combine techniques. A common recipe: Top-K + Nucleus + Temperature together. You first apply Top-K (remove obvious garbage), then Nucleus (remove the long tail), then Temperature (fine-tune balance). ChatGPT and Claude use variations of this combination.
Is there a theoretically optimal decoding strategy? โ–พ
Not yet proven. The 2019 paper showed nucleus sampling is empirically the best, but there's no mathematical proof it's optimal. Research continues on whether there's a fundamentally better approach for matching the distribution of human text.

The big takeaway: The AI model is not what makes text feel human or robotic. The same model, with the same weights, produces wildly different text depending on the decoding strategy. Nucleus sampling is currently the closest we have to replicating the natural unpredictability of human writing โ€” while keeping it coherent.