Post 38 · Inference & Serving

Speculative Decoding

Generate several tokens in one target-model call instead of one — by having a small, fast drafter propose candidates and the large model verify them all in parallel. Identical outputs, 2–3× faster.

2–3×

Speedup on T5-XXL

2.5×

Speedup on Chinchilla 70B

10×

Speedup with memory offload

0

Output quality loss

2022

Leviathan et al. paper

    The key insight: In autoregressive decoding, generating K tokens requires K serial forward passes through the large model — but the real bottleneck is memory bandwidth, not computation. Loading the model's weights once and running a forward pass over a batch of K tokens takes nearly the same wall-clock time as a single-token forward pass. Speculative decoding exploits this: a cheap drafter proposes K tokens, the large model validates all of them in one pass, and any accepted tokens are kept for free.
  

The Problem

Autoregressive LLMs generate one token per forward pass. Producing 100 tokens requires 100 passes through the full model — sequential, slow, and memory-bandwidth-bound. There's no way to parallelize without changing the output distribution.

The Solution

Use a small draft model to speculatively generate K candidate tokens. The large target model then validates all K candidates in a single batched forward pass — accepting correct tokens and resampling the first wrong one.

The Guarantee

The acceptance–rejection sampling scheme is lossless: the distribution of final outputs is mathematically identical to sampling directly from the target model, token by token. No quality tradeoff.

Why "Speculative"?

The name borrows from speculative execution in CPUs: predict the future ahead of time and do the work early. If the prediction was right, you've saved real time. If wrong, you discard and recover. The draft model is the CPU's branch predictor — fast and usually right.

Leviathan et al. (2022) note that "hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models." The draft model handles the easy parts; the target model handles the hard ones — but both work together in one joint forward pass.

Section 2

Why Is Decoding So Slow?

The bottleneck is not arithmetic — it's memory bandwidth. Understanding this is why speculative decoding works at all.

The Memory Bandwidth Bottleneck

During a forward pass, the GPU must load every layer's weight matrix from high-bandwidth memory (HBM) into the processor's compute cores. For a 70B-parameter model at FP16, that's ~140 GB of data per forward pass.

The actual matrix multiplications — the "computation" — are trivially fast. The bottleneck is the memory read. This means generating 1 token or 8 tokens in a batch takes nearly identical wall-clock time.

The Free Lunch

If you pass a sequence of K tokens into the model instead of 1, the model's self-attention naturally produces logits for all K positions in that single forward pass — at essentially no extra latency cost.

This is the core free lunch speculative decoding exploits: the target model can verify K draft tokens in one pass almost as fast as generating 1 token. If even 3–4 of those K tokens are correct, you've generated 3–4× as fast.

Memory access vs. compute cost during inference (illustrative). The flat "load weights" bar dominates — adding tokens in a batch barely changes total time.

Sequential Decoding

O(n) forward passes for n tokens.
Each pass: load full model weights, compute 1 token. Memory bandwidth cost paid n times.

Speculative Decoding

O(n/k) target-model passes.
k ≈ average tokens accepted per call. Memory bandwidth cost paid n/k times.

Perfect Draft Model

O(1) theoretical complexity.
If the drafter is always right, one target pass generates all remaining tokens.

    Arithmetic intensity mismatch: Modern GPUs are compute-bound when doing large batch matrix multiplications (e.g., training). But autoregressive inference with batch size 1 is memory-bound — the GPU is mostly waiting for data, not doing math. Speculative decoding makes the memory access worth more by amortizing it over multiple tokens.
  

Section 3

The Algorithm

Two models, two phases, one loop. The draft model proposes, the target model judges.

Step 1

Draft Phase — Small Model Proposes K Tokens

The small drafter model autoregressively generates K candidate tokens from the current context. It runs K serial forward passes — but these are fast because the model is tiny (e.g., 10–100× smaller than the target). The drafter stores the probability distribution p(xᵢ) it assigned to each draft token.

Step 2

Verify Phase — Target Model Scores All K Tokens At Once

The target model receives the current context plus all K draft tokens in a single batched forward pass. It outputs logits — and thus probability distributions q(xᵢ) — for each of the K positions simultaneously. This one pass replaces what would otherwise be K sequential target-model calls.

Step 3

Accept / Reject — Modified Rejection Sampling

Scan draft tokens left to right. For each token xᵢ: accept with probability min(1, q(xᵢ)/p(xᵢ)). If accepted, keep it and move to xᵢ₊₁. If rejected, resample one token from the corrected distribution normalize(max(0, q−p)) and discard all tokens after it. Always append one final token sampled from the target model's distribution.

Step 4

Repeat — With Adaptive K

Continue from the new context. Implementations typically adapt K dynamically: if all K tokens were accepted, increase K (the drafter is doing well); if most were rejected, decrease K (the drafter is unreliable for this context). Typical range: 3–12 draft tokens per iteration.

    Why one extra token? After accepting or rejecting the K draft tokens, the algorithm always generates exactly one more token directly from the target model's output distribution at the last verified position. This ensures the algorithm always makes progress (at least 1 new token per target-model call) and maintains the correct output distribution.
  

Greedy Decoding Case

With greedy decoding (argmax), acceptance is deterministic: a draft token is accepted if and only if it equals the token the target model would have chosen (argmax of q). No randomness needed — just compare the two argmax tokens at each position.

Stochastic Sampling Case

With temperature sampling, the acceptance probability is min(1, q(xᵢ)/p(xᵢ)) — the ratio of target to draft probability at the actual draft token. If the target agrees or would have been even more likely to produce xᵢ, accept. Otherwise accept probabilistically.

Section 4

Acceptance Sampling — The Math

The acceptance rule is designed to preserve the target distribution exactly. Here's how it works.

// Draft probability of token x at position i p(x) = drafter's probability for the proposed token // Target probability of token x at position i q(x) = target model's probability for the same token // Acceptance probability α(x) = min( 1, q(x) / p(x) ) // If rejected: resample from corrected distribution p'(x) = normalize( max(0, q(x) − p(x)) )

Case 1: q(x) ≥ p(x)

The target model is at least as likely to produce this token as the drafter. Accept always (α = 1). The drafter was conservative — the target would have been happy with this choice or even more enthusiastic about it.

Case 2: q(x) < p(x)

The target model is less likely than the drafter to produce this token. Accept with probability q(x)/p(x) < 1. The drafter was overconfident. Partial acceptance keeps the expected distribution correct.

Interactive Acceptance Calculator

Set the draft and target probabilities for a token to see the acceptance decision.

Draft p(x) 0.40

Target q(x) 0.60

1.00

Acceptance prob α

Always Accept

Decision

1.50

q/p ratio

Why does this preserve the target distribution? ›

Proof sketch: Let x be any token. Its probability in the output is:

P(output = x) = P(draft chose x) × P(accept x) + P(draft didn't choose x or rejected) × P(resample = x)

= p(x) · min(1, q(x)/p(x)) + [1 − Σ_y p(y)·min(1, q(y)/p(y))] · p'(x)

After algebra, this simplifies to exactly q(x). The corrected distribution p'(x) = normalize(max(0, q−p)) is precisely designed so this identity holds. The method is provably lossless.

What happens at the boundary — when q and p are very different? ›

If the drafter and target are very different, almost every draft token gets rejected. The algorithm degrades gracefully back to standard sampling: in the worst case, every draft token is rejected, we resample one token from p'(x) ≈ q(x), and the next iteration starts fresh. No tokens are lost — but no speedup is gained either. The algorithm is safe even when the drafter is terrible.

Does it work for temperature sampling and top-p/top-k? ›

Yes. The acceptance probability formula works for any distribution q(x) — including those produced by temperature scaling, top-p (nucleus) sampling, or top-k. The key is that both p(x) and q(x) are defined over the same vocabulary at each position. As long as the drafter uses the same tokenizer, any sampling strategy on the target model is compatible. However, high-temperature sampling (flat distributions) reduces acceptance rates because the target has lower certainty, making the draft less likely to match.

Section 5

Expected Speedup Analysis

How many tokens does speculative decoding generate per target-model call? The answer depends on the acceptance rate and K.

// Let α = per-token acceptance rate (average), K = draft tokens Expected tokens per target call: E[tokens] = (1 − αᴷ⁺¹) / (1 − α) // Example: α = 0.8, K = 5 E[tokens] = (1 − 0.8⁶) / (1 − 0.8) = (1 − 0.262) / 0.2 ≈ 3.69 tokens per call // Effective speedup (ignoring draft model cost): Speedup ≈ E[tokens] = 3.69× (if draft model is free)

    Important caveat: The draft model is not free — it adds latency too. The real speedup is E[tokens] × (target latency) / (K × draft latency + target latency). Since the drafter is much smaller, its K passes are still fast. In practice: drafter is 10–100× faster per token, so the overhead is small relative to the savings.
  

Speedup Calculator

Accept rate α 0.75

Draft tokens K 5

Draft/target speed 20×

3.36

Expected tokens / call

2.4×

Net speedup

12%

Draft overhead

Best Case: Code Generation

Code has high predictability — keywords, variable names, boilerplate. Acceptance rates reach 80–90%. With K=7, E[tokens] ≈ 5–6×. Speculative decoding shines here.

Average Case: Summarization

Input-grounded tasks give the drafter strong context clues. Typical acceptance rate 60–80%. E[tokens] ≈ 3–4×. Still a major win.

Hard Case: High-Temperature Sampling

Creative generation with high temperature randomises outputs. Acceptance rates drop to 40–50%. Speedup shrinks to 1.5–2×. The method still works, but benefits are reduced.

Section 6

Token Generation Simulator

Watch speculative decoding in action. The drafter proposes tokens; the target model accepts or rejects each one.

token Context draft Draft (proposed) ✓ Accepted ✗ Rejected ~ Resampled

Generated Sequence

Step Log

Steps so far

0

Total tokens generated

0

Avg tokens / step

—

Section 7

Variants & Extensions

Speculative decoding spawned a family of techniques. Each variant addresses a different limitation or extends the core idea.

Variant Comparison

Variant	Drafter Type	Key Innovation	Speedup	Quality Loss
Leviathan et al. (2022)	Separate smaller model	Original draft-then-verify framework	2–3×	None
Speculative Sampling (Chen et al.)	Separate smaller model	Modified rejection sampling proof for stochastic decoding	2–2.5×	None
Medusa	Heads on target model	Lightweight prediction heads; no separate model needed	2.2–3.6×	None (with tree decoding)
EAGLE	Autoregressor on features	Draft on internal feature space, not token space	2.7–6.5×	None
SpecInfer	Multiple draft models	Token trees: multiple draft paths verified in one pass	1.5–3.5×	None

    Common thread: Every variant preserves the core invariant — the output distribution matches the target model's distribution. The differences are in how draft candidates are generated (separate model, attached heads, feature-level autoregression, or token trees) and how those candidates are verified (single sequence vs. tree-structured batch).
  

Section 8

Real-World Results

Benchmarks from Leviathan et al. (2022), Chen et al. (2023), and HuggingFace production deployments.

Speedup vs Standard Decoding (relative wall-clock time)

T5-XXL (11B) — standard

1.0×

T5-XXL — Speculative (2×)

2.0×

T5-XXL — Speculative (3×)

3.0×

Chinchilla 70B — standard

1.0×

Chinchilla 70B — Spec. Sampling

2.5×

Any model — INT8 + offload

10×

Variant Speedup Comparison (bars = max reported)

When is speculative decoding most effective? ›

Best results on input-grounded tasks: summarization, translation, code completion from a stub, and automatic speech recognition transcription. These tasks have high token predictability from context — giving the drafter a strong signal and raising the acceptance rate above 75%. Speedup scales with the quality of the draft model relative to the task.

Does it work at batch size > 1? ›

Classic speculative decoding is designed for batch size 1, where memory bandwidth is the primary bottleneck. At larger batch sizes, the GPU becomes compute-bound (matrix multiplications are the bottleneck, not memory), and adding K draft tokens adds real computation cost. At batch sizes ≥ 8, the benefit diminishes significantly. Extensions like SpecInfer and batched speculative decoding address this limitation for throughput-focused deployments.

Requirements for a good draft model ›

Same tokenizer: The drafter must use the exact same tokenizer as the target model. Re-encoding tokens between models would be too slow and introduce alignment errors.

Size gap: The drafter should be at least 10× smaller than the target. Otherwise the draft passes eat too much of the time saved from fewer target passes.

Distribution overlap: The drafter should assign high probability to the same tokens as the target. A drafter fine-tuned on similar data, or a smaller model from the same family, works best.

Production adoption and ecosystem support ›

Speculative decoding is now available out-of-the-box in HuggingFace Transformers via model.generate(assistant_model=...). It's also supported in vLLM, TensorRT-LLM, and llama.cpp. Major deployment frameworks have adopted it as a first-class latency optimization for single-user, real-time inference scenarios.

Section 9

Paper Sources

Primary references used in this visual summary.

Primary Paper

Leviathan, Y., Kalman, M., & Matias, Y. (2022)
Fast Inference from Transformers via Speculative Decoding
Google Research. arXiv:2211.17192 — ICML 2023

↗ arXiv:2211.17192

The founding paper. Introduces the draft-then-verify framework, proves distribution-preserving acceptance sampling, and demonstrates 2–3× speedup on T5-XXL without any model changes or quality loss.

Concurrent Work — Speculative Sampling

Chen, C., Borgeaud, S., Irving, G., Lespiau, J-B., Sifre, L., & Jumper, J. (2023)
Accelerating Large Language Model Decoding with Speculative Sampling
DeepMind. arXiv:2302.01318

↗ arXiv:2302.01318

Independently develops the same idea for stochastic sampling. Proves distribution preservation for general sampling with the modified rejection scheme. Demonstrates 2–2.5× speedup on Chinchilla 70B.

Additional Resources

HuggingFace Blog (2023) — Assisted Generation: a new direction toward low-latency text generation

↗ HuggingFace Blog

Cai et al. (2024) — Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. arXiv:2401.10774

Li et al. (2024) — EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv:2401.15077

Miao et al. (2023) — SpecInfer: Accelerating Generative LLM Serving with Tree-based Speculative Inference. arXiv:2305.09781