🔒
Visual Summary
Tokenizers — The Hidden Step That Shapes Every LLM
Exclusive to paid subscribers.
Enter the password from your email to unlock.
Not a subscriber? Join Visual Summary →
Hidden Step Spectrum BPE WordPiece Token Tax The Fix Future
The Hidden Step
Every time you type a message to ChatGPT, Claude, or Gemini, your text is silently broken into fragments before the model ever sees it. This invisible step — tokenization — shapes model cost, language support, and what an LLM can reason about. Yet it's almost never discussed.
TL;DR — The Survey in One Paragraph

Mielke et al. (arXiv 2112.10508, 2021) survey how text can be represented at every granularity — from raw bytes to whole words — and trace the evolution of open-vocabulary modeling. The core finding: there is no silver bullet. Every tokenization strategy makes trade-offs between vocabulary size, sequence length, unknown-word handling, and multilingual fairness. The subword goldilocks zone (BPE, WordPiece, Unigram) currently dominates production LLMs — but Meta's 2024 Byte Latent Transformer suggests the entire paradigm may be about to change. Understanding tokenization is the prerequisite to understanding why models are biased, expensive, and sometimes wrong.

50,257
GPT-2 vocabulary size
30,522
BERT vocabulary size
2.36×
Korean token cost vs English
~4 bytes
avg bytes per subword token
Live Tokenizer — Pick an Algorithm

See how the same sentence gets split differently depending on the tokenizer. Each colored chip is one token.

? tokens
Why Does Tokenization Matter?
💰
Cost
LLM APIs charge per token. More tokens = higher bill. Non-English text can cost 2–3× more for the same semantic content.
📐
Context Window
All models have a token limit. Inefficient tokenization means less text fits in the window. Poor tokenizers waste your context budget.
⚖️
Fairness
English gets efficient encoding. Korean, Arabic, and Hindi speakers pay a "token tax" — higher costs and worse model accuracy.
Guess the Token Count — Can You Beat the Model?

Before reading further, guess how many tokens GPT-4 uses for each sentence. Drag the slider, then click Reveal. Most people guess too low for non-English text.

The Kitchen Prep Analogy

"Tokenization is like a chef who pre-chops every ingredient before cooking. The way they chop determines the recipe's speed, cost, and flavour — but nobody reads the prep-cook's instructions. They just taste the final dish. Understanding the chopping reveals why some cuisines (languages) are cheaper and faster to prepare than others."

Every tokenizer makes a choice. What are the options? The Spectrum →
The Spectrum
Text can be broken at any granularity: whole words, individual characters, raw bytes, or anything in between. Each extreme has a fatal flaw. Subword tokenization occupies the sweet spot — and it's why GPT-2, BERT, T5, LLaMA, and Claude all use it.
Drag the Slider — See How "unhappiness" Gets Split

Move from word-level (left) to character-level (right). Watch the token count, vocabulary size, and OOV risk change.

The Fatal Flaws at Each Extreme
❌ Word-Level
Vocabulary: 30K–100K words. Fatal flaw: OOV (Out-of-Vocabulary). Any unseen word — "grokked", "ChatGPT", "COVID-19" — collapses to a single [UNK] token, erasing all meaning.
★ Subword (Goldilocks)
Vocabulary: 30K–50K. Best of both worlds: no unknowns, reasonable sequence length, handles morphology. BPE splits "unhappiness" → ["un","happi","ness"]. Unknown words break into known parts.
un happi ness
★ Used by GPT-2, BERT, T5, LLaMA, Claude
❌ Character-Level
Vocabulary: ~256 chars. Fatal flaw: sequence explosion. A 10-word sentence → 50+ tokens, exhausting context windows and crushing performance on long-range tasks.
Which Model Uses Which Tokenizer?
ModelTokenizerVocab SizeReleased
GPT-2 / GPT-3Byte-level BPE50,2572019/2020
BERTWordPiece30,5222018
T5Unigram (SentencePiece)32,0002019
RoBERTaByte-level BPE50,2652019
LLaMA 3Byte-level BPE (tiktoken)128,0002024
ALBERTUnigram (SentencePiece)30,0002019
Same Sentence — 5 Models — Click a Bar to Compare

How many tokens does each model need for the same sentence? More tokens = more cost + less context window. Click any model bar to see its split.

Type Your Own Text — See It Tokenized Live

Type anything — your name, a sentence in your language, a piece of code. See simulated BPE tokenization appear in real time.

0 tokens
Tip: try your name, a Korean/Arabic sentence, or emoji — and watch the token count change.
Subword wins. Which subword algorithm — and how does it work? Byte Pair Encoding →
Byte Pair Encoding
BPE was originally a data-compression algorithm from 1994. Philip Gage never imagined it would become the foundation of GPT-2, GPT-3, and most modern LLMs. The idea: repeatedly merge the most frequent pair of adjacent symbols until you hit your vocabulary budget.
Step-by-Step BPE Merge Animator

Watch BPE build its vocabulary from scratch. Each step finds the most frequent pair and merges it into a new token.

Step 0 of 8 — initial character vocabulary
The BPE Merge Rule
Algorithm BPE(corpus, target_vocab_size): vocab ← all unique characters in corpus while len(vocab) < target_vocab_size: pairs ← count_adjacent_pairs(corpus) best ← argmax(pairs) // most frequent pair vocab ← vocab + [best] // add merged token corpus ← replace(corpus, best) // update corpus return vocab

The key insight: merging the most frequent pair gives the maximum compression gain per step. Run 50,000 merges and you get GPT-2's vocabulary.

What Is the Vocabulary Budget? — Drag the Slider

The vocabulary budget is the maximum number of tokens a model is allowed to know. Think of it like a fixed number of "slots" on a shelf — you decide upfront how many slots exist, then BPE fills them one merge at a time. Drag the slider to see the trade-offs at different budget sizes.

Budget size: 32K
How the budget shapes the vocabulary
Too small (≤ 8K)
BPE runs out of merges quickly. Common words like "running" get split into parts. More tokens per sentence → fills context window faster.
Sweet spot (32K–50K)
Enough merges to capture most common English words + frequent suffixes/prefixes. BERT: 30,522. GPT-2: 50,257. The industry standard.
Very large (128K+)
LLaMA 3 uses 128K. More budget = whole words in more languages fit as single tokens. Better multilingual fairness but bigger embedding matrix.
BPE Tokenization — Click a Word

See how a trained BPE model splits common words into subword tokens. The split reflects which merge sequences were most frequent during training.

What Makes Byte-Level BPE Special (GPT-2)
❌ Regular BPE
Starts from characters. Fails on emoji, accented characters, rare Unicode. The word "café" might produce [UNK] for the "é".
✓ Byte-level BPE (GPT-2)
Starts from 256 raw bytes. Any UTF-8 text — emoji, Arabic, Chinese, code — is always encodable. Zero unknown tokens. Ever.
GPT-2 vocabulary breakdown: 256 base bytes ← covers all possible bytes + 50,000 BPE merges ← learned from web text + 1 special token ← <|endoftext|> = 50,257 total tokens
GPT-2 GPT-3 GPT-4 RoBERTa BART LLaMA 3
BPE merges by raw frequency. WordPiece asks: is this pair surprising? WordPiece & Unigram →
WordPiece & Unigram
BERT uses WordPiece; T5 uses Unigram. Both fill a gap in BPE: raw frequency doesn't always produce the most meaningful units. WordPiece uses mutual information. Unigram goes in reverse — starting large and pruning down.
BPE vs WordPiece — The Core Difference
The ## Prefix — BERT's Word-Boundary Signal

WordPiece uses ## to mark continuation tokens — subwords that are not the start of a word. This lets the model know where word boundaries lie, even after tokenization.

Unigram — Top-Down Pruning with Viterbi

Unigram flips the script: start with a huge vocabulary (~50K) and prune the least important tokens until the target size is reached. During inference, it finds the most probable segmentation using the Viterbi algorithm.

Algorithm Comparison
PropertyBPEWordPieceUnigram
DirectionBottom-up mergeBottom-up mergeTop-down prune
SelectionMax frequencyMax mutual informationMin perplexity loss
SegmentationDeterministicDeterministicProbabilistic (Viterbi)
Continuation markerNone##prefixNone
Used byGPT-2, RoBERTa, BARTBERT, DistilBERT, ElectraT5, ALBERT, XLNet
BPE and WordPiece assume spaces as word boundaries. SentencePiece doesn't. The Token Tax →
The Token Tax
Token fertility — tokens per word — reliably predicts model accuracy across languages. English gets 1.0×. Korean pays 2.36×. Russian and Hebrew pay 3×. This isn't a bug; it's baked into every tokenizer trained predominantly on English text.
1.0×
English fertility (baseline)
2.36×
Korean token fertility
Russian / Hebrew fertility
Economic cost multiplier
The Price Tag — Same Message, Different Cost

The same semantic content costs vastly different amounts depending on language. Click any price tag to see the breakdown.

Token Fertility by Language — Click a Region

Fertility rate = average tokens needed per word, relative to English. Higher fertility = more expensive inference and lower model accuracy.

Token Tax Calculator

Estimate the token overhead for the same content in different languages. API cost at $0.01 per 1K tokens.

Base English text (estimated tokens):
Why Does the Tax Exist?
Training Data Bias
Internet text used to train tokenizers is ~50% English. English subword units get most of the 50,000 merge budget. Other languages get the leftovers.
Byte-to-Character Ratio
Latin script ≈ 1 byte/char. Arabic, CJK characters ≈ 2–4 bytes/char. Byte-level BPE penalizes complex scripts at the very foundation.
The Merge Budget
50,000 BPE merges must cover all languages. English consumes the lion's share, leaving less representational capacity for every other language.
Research Finding

"Tokenization is a tax that low-resource languages cannot afford to pay, charged on every token, in every layer. Scale partially recovers the gap — meaning smaller models spend raw parameter budget reconstructing what should have been clean input from the start."

Fertility vs Model Accuracy — The Negative Correlation

Languages with higher token fertility consistently show lower downstream model accuracy. The correlation is strong and holds across model families.

The problem is real. SentencePiece was built specifically to solve it. SentencePiece — The Fix →
SentencePiece & Byte-BPE
The token tax exists because every algorithm so far assumes spaces = word boundaries. SentencePiece eliminates that assumption entirely — treating raw text as a byte stream and encoding the space as just another character (▁). One change, universal coverage.
The Space Boundary Problem

Traditional tokenizers pre-tokenize on whitespace — but many languages have no whitespace. SentencePiece skips pre-tokenization entirely.

SentencePiece ▁ Demo — The Space Becomes a Token

Select a language to see how SentencePiece handles it. The ▁ symbol marks where a space existed in the original text — making decoding fully reversible.

SentencePiece at a Glance
50K
sentences/sec
6MB
memory footprint
50+
languages, same pipeline
0
unknown tokens ever
Byte-Level BPE — "café" and 🚀 Without [UNK]
tiktoken — OpenAI's Production Tokenizer Speed

tiktoken (used in GPT-4) achieves 3–6× faster tokenization than comparable open-source implementations, using a Rust core under a Python interface.

Now you understand the problem. Here's the fix designed for it. The Future →
The Future of Tokenization
The 2021 survey concluded: "there is and likely will never be a silver bullet." Three years later, Meta's Byte Latent Transformer challenged that. The future might not be better tokenization — it might be no tokenization at all.
The Tokenization Timeline — Click a Milestone
Meta BLT — "No Tokenization" — How It Works

The Byte Latent Transformer (2024) operates directly on raw bytes. Instead of fixed tokens, it groups bytes into dynamic "patches" based on entropy — complex regions get more compute, simple regions get less.

50%
fewer flops at inference vs Llama 3
Equal
quality to Llama 3 at same compute
Better
on non-English and rare scripts
What This Means for You
Choosing a tokenizer for your RAG pipeline
Prefer SentencePiece-based models (T5, ALBERT) for multilingual RAG. For English-only, BPE (OpenAI embeddings, RoBERTa) is the standard. Always check token fertility for your target language before budgeting API costs.
Cost estimation: multiply by fertility rate for non-English
If your budget assumes English token counts, multiply by the fertility rate for your actual language. A Korean chatbot costs 2.36× more than an English one at identical message length. A Russian document pipeline costs ~3×.
Why LLaMA 3 uses a 128K vocabulary (not 50K)
Larger vocabularies allocate more merge budget to non-English languages, reducing their fertility penalty. LLaMA 3's 128K tokens is a deliberate multilingual fairness choice — the same words now tokenize more efficiently in Korean, Arabic, and Hindi compared to LLaMA 2.
When character-level is still the right choice
Character-level tokenization still wins for: fraud detection (character-level typos matter), OCR post-processing (noisy character sequences), password strength analysis, and languages with character-based morphology (Thai, Tibetan). Subword is not always best.
The tokenization-free future: watch BLT
Meta's BLT (Byte Latent Transformer, 2024) eliminates tokenization entirely. It processes raw bytes with dynamic patch boundaries — allocating compute where text is complex, skipping where it's predictable. If this approach scales, the entire subword tokenization ecosystem (BPE, WordPiece, SentencePiece, tiktoken) becomes legacy infrastructure.
Try All Algorithms — Side by Side

Pick a sentence and see how each algorithm tokenizes it. The token count difference shows the efficiency gap.

The Survey's Key Findings
The paper's verdict (2021)
"There is and likely will never be a silver bullet singular solution for all applications." — Mielke et al.
What 2024 says
Meta BLT's 50% flop reduction suggests tokenization itself may be the next thing to eliminate — not optimize.
50,257 GPT-2 vocab
|
2.36× Korean tax
|
50% fewer flops (BLT)
|
No silver bullet