Every time you type a message to ChatGPT, Claude, or Gemini, your text is silently broken into fragments before the model ever sees it. This invisible step — tokenization — shapes model cost, language support, and what an LLM can reason about. Yet it's almost never discussed.
TL;DR — The Survey in One Paragraph
Mielke et al. (arXiv 2112.10508, 2021) survey how text can be represented at every granularity — from raw bytes to whole words — and trace the evolution of open-vocabulary modeling. The core finding: there is no silver bullet. Every tokenization strategy makes trade-offs between vocabulary size, sequence length, unknown-word handling, and multilingual fairness. The subword goldilocks zone (BPE, WordPiece, Unigram) currently dominates production LLMs — but Meta's 2024 Byte Latent Transformer suggests the entire paradigm may be about to change. Understanding tokenization is the prerequisite to understanding why models are biased, expensive, and sometimes wrong.
50,257
GPT-2 vocabulary size
30,522
BERT vocabulary size
2.36×
Korean token cost vs English
~4 bytes
avg bytes per subword token
Live Tokenizer — Pick an Algorithm
See how the same sentence gets split differently depending on the tokenizer. Each colored chip is one token.
? tokens
Why Does Tokenization Matter?
💰
Cost
LLM APIs charge per token. More tokens = higher bill. Non-English text can cost 2–3× more for the same semantic content.
📐
Context Window
All models have a token limit. Inefficient tokenization means less text fits in the window. Poor tokenizers waste your context budget.
⚖️
Fairness
English gets efficient encoding. Korean, Arabic, and Hindi speakers pay a "token tax" — higher costs and worse model accuracy.
Guess the Token Count — Can You Beat the Model?
Before reading further, guess how many tokens GPT-4 uses for each sentence. Drag the slider, then click Reveal. Most people guess too low for non-English text.
The Kitchen Prep Analogy
"Tokenization is like a chef who pre-chops every ingredient before cooking. The way they chop determines the recipe's speed, cost, and flavour — but nobody reads the prep-cook's instructions. They just taste the final dish. Understanding the chopping reveals why some cuisines (languages) are cheaper and faster to prepare than others."
Every tokenizer makes a choice. What are the options?The Spectrum →
The Landscape
The Spectrum
Text can be broken at any granularity: whole words, individual characters, raw bytes, or anything in between. Each extreme has a fatal flaw. Subword tokenization occupies the sweet spot — and it's why GPT-2, BERT, T5, LLaMA, and Claude all use it.
Drag the Slider — See How "unhappiness" Gets Split
Move from word-level (left) to character-level (right). Watch the token count, vocabulary size, and OOV risk change.
The Fatal Flaws at Each Extreme
❌ Word-Level
Vocabulary: 30K–100K words. Fatal flaw: OOV (Out-of-Vocabulary). Any unseen word — "grokked", "ChatGPT", "COVID-19" — collapses to a single [UNK] token, erasing all meaning.
★ Subword (Goldilocks)
Vocabulary: 30K–50K. Best of both worlds: no unknowns, reasonable sequence length, handles morphology. BPE splits "unhappiness" → ["un","happi","ness"]. Unknown words break into known parts.
unhappiness
★ Used by GPT-2, BERT, T5, LLaMA, Claude
❌ Character-Level
Vocabulary: ~256 chars. Fatal flaw: sequence explosion. A 10-word sentence → 50+ tokens, exhausting context windows and crushing performance on long-range tasks.
Which Model Uses Which Tokenizer?
Model
Tokenizer
Vocab Size
Released
GPT-2 / GPT-3
Byte-level BPE
50,257
2019/2020
BERT
WordPiece
30,522
2018
T5
Unigram (SentencePiece)
32,000
2019
RoBERTa
Byte-level BPE
50,265
2019
LLaMA 3
Byte-level BPE (tiktoken)
128,000
2024
ALBERT
Unigram (SentencePiece)
30,000
2019
Same Sentence — 5 Models — Click a Bar to Compare
How many tokens does each model need for the same sentence? More tokens = more cost + less context window. Click any model bar to see its split.
Type Your Own Text — See It Tokenized Live
Type anything — your name, a sentence in your language, a piece of code. See simulated BPE tokenization appear in real time.
0 tokens
Tip: try your name, a Korean/Arabic sentence, or emoji — and watch the token count change.
Subword wins. Which subword algorithm — and how does it work?Byte Pair Encoding →
The Algorithm
Byte Pair Encoding
BPE was originally a data-compression algorithm from 1994. Philip Gage never imagined it would become the foundation of GPT-2, GPT-3, and most modern LLMs. The idea: repeatedly merge the most frequent pair of adjacent symbols until you hit your vocabulary budget.
Step-by-Step BPE Merge Animator
Watch BPE build its vocabulary from scratch. Each step finds the most frequent pair and merges it into a new token.
Step 0 of 8 — initial character vocabulary
The BPE Merge Rule
Algorithm BPE(corpus, target_vocab_size):
vocab ← all unique characters in corpus
while len(vocab) < target_vocab_size:
pairs ← count_adjacent_pairs(corpus)
best ← argmax(pairs) // most frequent pair
vocab ← vocab + [best] // add merged token
corpus ← replace(corpus, best) // update corpus
return vocab
The key insight: merging the most frequent pair gives the maximum compression gain per step. Run 50,000 merges and you get GPT-2's vocabulary.
What Is the Vocabulary Budget? — Drag the Slider
The vocabulary budget is the maximum number of tokens a model is allowed to know. Think of it like a fixed number of "slots" on a shelf — you decide upfront how many slots exist, then BPE fills them one merge at a time. Drag the slider to see the trade-offs at different budget sizes.
Budget size:32K
How the budget shapes the vocabulary
Too small (≤ 8K)
BPE runs out of merges quickly. Common words like "running" get split into parts. More tokens per sentence → fills context window faster.
Sweet spot (32K–50K)
Enough merges to capture most common English words + frequent suffixes/prefixes. BERT: 30,522. GPT-2: 50,257. The industry standard.
Very large (128K+)
LLaMA 3 uses 128K. More budget = whole words in more languages fit as single tokens. Better multilingual fairness but bigger embedding matrix.
BPE Tokenization — Click a Word
See how a trained BPE model splits common words into subword tokens. The split reflects which merge sequences were most frequent during training.
What Makes Byte-Level BPE Special (GPT-2)
❌ Regular BPE
Starts from characters. Fails on emoji, accented characters, rare Unicode. The word "café" might produce [UNK] for the "é".
✓ Byte-level BPE (GPT-2)
Starts from 256 raw bytes. Any UTF-8 text — emoji, Arabic, Chinese, code — is always encodable. Zero unknown tokens. Ever.
GPT-2 vocabulary breakdown:
256 base bytes ← covers all possible bytes
+ 50,000 BPE merges ← learned from web text
+ 1 special token ← <|endoftext|>
= 50,257 total tokens
GPT-2GPT-3GPT-4RoBERTaBARTLLaMA 3
BPE merges by raw frequency. WordPiece asks: is this pair surprising?WordPiece & Unigram →
The Alternatives
WordPiece & Unigram
BERT uses WordPiece; T5 uses Unigram. Both fill a gap in BPE: raw frequency doesn't always produce the most meaningful units. WordPiece uses mutual information. Unigram goes in reverse — starting large and pruning down.
BPE vs WordPiece — The Core Difference
The ## Prefix — BERT's Word-Boundary Signal
WordPiece uses ## to mark continuation tokens — subwords that are not the start of a word. This lets the model know where word boundaries lie, even after tokenization.
Unigram — Top-Down Pruning with Viterbi
Unigram flips the script: start with a huge vocabulary (~50K) and prune the least important tokens until the target size is reached. During inference, it finds the most probable segmentation using the Viterbi algorithm.
Algorithm Comparison
Property
BPE
WordPiece
Unigram
Direction
Bottom-up merge
Bottom-up merge
Top-down prune
Selection
Max frequency
Max mutual information
Min perplexity loss
Segmentation
Deterministic
Deterministic
Probabilistic (Viterbi)
Continuation marker
None
##prefix
None
Used by
GPT-2, RoBERTa, BART
BERT, DistilBERT, Electra
T5, ALBERT, XLNet
BPE and WordPiece assume spaces as word boundaries. SentencePiece doesn't.The Token Tax →
The Consequences
The Token Tax
Token fertility — tokens per word — reliably predicts model accuracy across languages. English gets 1.0×. Korean pays 2.36×. Russian and Hebrew pay 3×. This isn't a bug; it's baked into every tokenizer trained predominantly on English text.
1.0×
English fertility (baseline)
2.36×
Korean token fertility
3×
Russian / Hebrew fertility
4×
Economic cost multiplier
The Price Tag — Same Message, Different Cost
The same semantic content costs vastly different amounts depending on language. Click any price tag to see the breakdown.
Token Fertility by Language — Click a Region
Fertility rate = average tokens needed per word, relative to English. Higher fertility = more expensive inference and lower model accuracy.
Token Tax Calculator
Estimate the token overhead for the same content in different languages. API cost at $0.01 per 1K tokens.
Base English text (estimated tokens):
Why Does the Tax Exist?
Training Data Bias
Internet text used to train tokenizers is ~50% English. English subword units get most of the 50,000 merge budget. Other languages get the leftovers.
Byte-to-Character Ratio
Latin script ≈ 1 byte/char. Arabic, CJK characters ≈ 2–4 bytes/char. Byte-level BPE penalizes complex scripts at the very foundation.
The Merge Budget
50,000 BPE merges must cover all languages. English consumes the lion's share, leaving less representational capacity for every other language.
Research Finding
"Tokenization is a tax that low-resource languages cannot afford to pay, charged on every token, in every layer. Scale partially recovers the gap — meaning smaller models spend raw parameter budget reconstructing what should have been clean input from the start."
Fertility vs Model Accuracy — The Negative Correlation
Languages with higher token fertility consistently show lower downstream model accuracy. The correlation is strong and holds across model families.
The token tax exists because every algorithm so far assumes spaces = word boundaries. SentencePiece eliminates that assumption entirely — treating raw text as a byte stream and encoding the space as just another character (▁). One change, universal coverage.
The Space Boundary Problem
Traditional tokenizers pre-tokenize on whitespace — but many languages have no whitespace. SentencePiece skips pre-tokenization entirely.
SentencePiece ▁ Demo — The Space Becomes a Token
Select a language to see how SentencePiece handles it. The ▁ symbol marks where a space existed in the original text — making decoding fully reversible.
SentencePiece at a Glance
50K
sentences/sec
6MB
memory footprint
50+
languages, same pipeline
0
unknown tokens ever
Byte-Level BPE — "café" and 🚀 Without [UNK]
tiktoken — OpenAI's Production Tokenizer Speed
tiktoken (used in GPT-4) achieves 3–6× faster tokenization than comparable open-source implementations, using a Rust core under a Python interface.
Now you understand the problem. Here's the fix designed for it.The Future →
The Legacy & Next Steps
The Future of Tokenization
The 2021 survey concluded: "there is and likely will never be a silver bullet." Three years later, Meta's Byte Latent Transformer challenged that. The future might not be better tokenization — it might be no tokenization at all.
The Tokenization Timeline — Click a Milestone
Meta BLT — "No Tokenization" — How It Works
The Byte Latent Transformer (2024) operates directly on raw bytes. Instead of fixed tokens, it groups bytes into dynamic "patches" based on entropy — complex regions get more compute, simple regions get less.
50%
fewer flops at inference vs Llama 3
Equal
quality to Llama 3 at same compute
Better
on non-English and rare scripts
What This Means for You
Choosing a tokenizer for your RAG pipeline▼
Prefer SentencePiece-based models (T5, ALBERT) for multilingual RAG. For English-only, BPE (OpenAI embeddings, RoBERTa) is the standard. Always check token fertility for your target language before budgeting API costs.
Cost estimation: multiply by fertility rate for non-English▼
If your budget assumes English token counts, multiply by the fertility rate for your actual language. A Korean chatbot costs 2.36× more than an English one at identical message length. A Russian document pipeline costs ~3×.
Why LLaMA 3 uses a 128K vocabulary (not 50K)▼
Larger vocabularies allocate more merge budget to non-English languages, reducing their fertility penalty. LLaMA 3's 128K tokens is a deliberate multilingual fairness choice — the same words now tokenize more efficiently in Korean, Arabic, and Hindi compared to LLaMA 2.
When character-level is still the right choice▼
Character-level tokenization still wins for: fraud detection (character-level typos matter), OCR post-processing (noisy character sequences), password strength analysis, and languages with character-based morphology (Thai, Tibetan). Subword is not always best.
The tokenization-free future: watch BLT▼
Meta's BLT (Byte Latent Transformer, 2024) eliminates tokenization entirely. It processes raw bytes with dynamic patch boundaries — allocating compute where text is complex, skipping where it's predictable. If this approach scales, the entire subword tokenization ecosystem (BPE, WordPiece, SentencePiece, tiktoken) becomes legacy infrastructure.
Try All Algorithms — Side by Side
Pick a sentence and see how each algorithm tokenizes it. The token count difference shows the efficiency gap.
The Survey's Key Findings
The paper's verdict (2021)
"There is and likely will never be a silver bullet singular solution for all applications." — Mielke et al.
What 2024 says
Meta BLT's 50% flop reduction suggests tokenization itself may be the next thing to eliminate — not optimize.