Tokenizers — The Hidden Step That Shapes Every LLM

The Problem · ~14 min read

The Hidden Step

Every time you type a message to ChatGPT, Claude, or Gemini, your text is silently broken into fragments before the model ever sees it. This invisible step — tokenization — shapes model cost, language support, and what an LLM can reason about. Yet it's almost never discussed.

TL;DR — The Survey in One Paragraph

Mielke et al. (arXiv 2112.10508, 2021) survey how text can be represented at every granularity — from raw bytes to whole words — and trace the evolution of open-vocabulary modeling. The core finding: there is no silver bullet. Every tokenization strategy makes trade-offs between vocabulary size, sequence length, unknown-word handling, and multilingual fairness. The subword goldilocks zone (BPE, WordPiece, Unigram) currently dominates production LLMs — but Meta's 2024 Byte Latent Transformer suggests the entire paradigm may be about to change. Understanding tokenization is the prerequisite to understanding why models are biased, expensive, and sometimes wrong.

50,257

GPT-2 vocabulary size

30,522

BERT vocabulary size

2.36×

Korean token cost vs English

~4 bytes

avg bytes per subword token

Live Tokenizer — Pick an Algorithm

See how the same sentence gets split differently depending on the tokenizer. Each colored chip is one token.

? tokens

Why Does Tokenization Matter?

💰

Cost

LLM APIs charge per token. More tokens = higher bill. Non-English text can cost 2–3× more for the same semantic content.

📐

Context Window

All models have a token limit. Inefficient tokenization means less text fits in the window. Poor tokenizers waste your context budget.

⚖️

Fairness

English gets efficient encoding. Korean, Arabic, and Hindi speakers pay a "token tax" — higher costs and worse model accuracy.

Guess the Token Count — Can You Beat the Model?

Before reading further, guess how many tokens GPT-4 uses for each sentence. Drag the slider, then click Reveal. Most people guess too low for non-English text.

The Kitchen Prep Analogy

"Tokenization is like a chef who pre-chops every ingredient before cooking. The way they chop determines the recipe's speed, cost, and flavour — but nobody reads the prep-cook's instructions. They just taste the final dish. Understanding the chopping reveals why some cuisines (languages) are cheaper and faster to prepare than others."

Every tokenizer makes a choice. What are the options? The Spectrum →

The Landscape

The Spectrum

Text can be broken at any granularity: whole words, individual characters, raw bytes, or anything in between. Each extreme has a fatal flaw. Subword tokenization occupies the sweet spot — and it's why GPT-2, BERT, T5, LLaMA, and Claude all use it.

Drag the Slider — See How "unhappiness" Gets Split

Move from word-level (left) to character-level (right). Watch the token count, vocabulary size, and OOV risk change.

The Fatal Flaws at Each Extreme

❌ Word-Level

Vocabulary: 30K–100K words. Fatal flaw: OOV (Out-of-Vocabulary). Any unseen word — "grokked", "ChatGPT", "COVID-19" — collapses to a single [UNK] token, erasing all meaning.

★ Subword (Goldilocks)

Vocabulary: 30K–50K. Best of both worlds: no unknowns, reasonable sequence length, handles morphology. BPE splits "unhappiness" → ["un","happi","ness"]. Unknown words break into known parts.

un happi ness

★ Used by GPT-2, BERT, T5, LLaMA, Claude

❌ Character-Level

Vocabulary: ~256 chars. Fatal flaw: sequence explosion. A 10-word sentence → 50+ tokens, exhausting context windows and crushing performance on long-range tasks.

Which Model Uses Which Tokenizer?

Model	Tokenizer	Vocab Size	Released
GPT-2 / GPT-3	Byte-level BPE	50,257	2019/2020
BERT	WordPiece	30,522	2018
T5	Unigram (SentencePiece)	32,000	2019
RoBERTa	Byte-level BPE	50,265	2019
LLaMA 3	Byte-level BPE (tiktoken)	128,000	2024
ALBERT	Unigram (SentencePiece)	30,000	2019

Same Sentence — 5 Models — Click a Bar to Compare

How many tokens does each model need for the same sentence? More tokens = more cost + less context window. Click any model bar to see its split.

Type Your Own Text — See It Tokenized Live

Type anything — your name, a sentence in your language, a piece of code. See simulated BPE tokenization appear in real time.

0 tokens

Tip: try your name, a Korean/Arabic sentence, or emoji — and watch the token count change.

Subword wins. Which subword algorithm — and how does it work? Byte Pair Encoding →

The Algorithm

Byte Pair Encoding

BPE was originally a data-compression algorithm from 1994. Philip Gage never imagined it would become the foundation of GPT-2, GPT-3, and most modern LLMs. The idea: repeatedly merge the most frequent pair of adjacent symbols until you hit your vocabulary budget.

Step-by-Step BPE Merge Animator

Watch BPE build its vocabulary from scratch. Each step finds the most frequent pair and merges it into a new token.

Step 0 of 8 — initial character vocabulary

The BPE Merge Rule

Algorithm BPE(corpus, target_vocab_size): vocab ← all unique characters in corpus while len(vocab) < target_vocab_size: pairs ← count_adjacent_pairs(corpus) best ← argmax(pairs) // most frequent pair vocab ← vocab + [best] // add merged token corpus ← replace(corpus, best) // update corpus return vocab

The key insight: merging the most frequent pair gives the maximum compression gain per step. Run 50,000 merges and you get GPT-2's vocabulary.

What Is the Vocabulary Budget? — Drag the Slider

The vocabulary budget is the maximum number of tokens a model is allowed to know. Think of it like a fixed number of "slots" on a shelf — you decide upfront how many slots exist, then BPE fills them one merge at a time. Drag the slider to see the trade-offs at different budget sizes.

Budget size: 32K

How the budget shapes the vocabulary

Too small (≤ 8K)

BPE runs out of merges quickly. Common words like "running" get split into parts. More tokens per sentence → fills context window faster.

Sweet spot (32K–50K)

Enough merges to capture most common English words + frequent suffixes/prefixes. BERT: 30,522. GPT-2: 50,257. The industry standard.

Very large (128K+)

LLaMA 3 uses 128K. More budget = whole words in more languages fit as single tokens. Better multilingual fairness but bigger embedding matrix.

BPE Tokenization — Click a Word

See how a trained BPE model splits common words into subword tokens. The split reflects which merge sequences were most frequent during training.

What Makes Byte-Level BPE Special (GPT-2)

❌ Regular BPE

Starts from characters. Fails on emoji, accented characters, rare Unicode. The word "café" might produce [UNK] for the "é".

✓ Byte-level BPE (GPT-2)

Starts from 256 raw bytes. Any UTF-8 text — emoji, Arabic, Chinese, code — is always encodable. Zero unknown tokens. Ever.

GPT-2 vocabulary breakdown: 256 base bytes ← covers all possible bytes + 50,000 BPE merges ← learned from web text + 1 special token ← <|endoftext|> = 50,257 total tokens

GPT-2 GPT-3 GPT-4 RoBERTa BART LLaMA 3

BPE merges by raw frequency. WordPiece asks: is this pair surprising? WordPiece & Unigram →

The Alternatives

WordPiece & Unigram

BERT uses WordPiece; T5 uses Unigram. Both fill a gap in BPE: raw frequency doesn't always produce the most meaningful units. WordPiece uses mutual information. Unigram goes in reverse — starting large and pruning down.

BPE vs WordPiece — The Core Difference

The ## Prefix — BERT's Word-Boundary Signal

WordPiece uses ## to mark continuation tokens — subwords that are not the start of a word. This lets the model know where word boundaries lie, even after tokenization.

Unigram — Top-Down Pruning with Viterbi

Unigram flips the script: start with a huge vocabulary (~50K) and prune the least important tokens until the target size is reached. During inference, it finds the most probable segmentation using the Viterbi algorithm.

Algorithm Comparison

Property	BPE	WordPiece	Unigram
Direction	Bottom-up merge	Bottom-up merge	Top-down prune
Selection	Max frequency	Max mutual information	Min perplexity loss
Segmentation	Deterministic	Deterministic	Probabilistic (Viterbi)
Continuation marker	None	##prefix	None
Used by	GPT-2, RoBERTa, BART	BERT, DistilBERT, Electra	T5, ALBERT, XLNet

BPE and WordPiece assume spaces as word boundaries. SentencePiece doesn't. The Token Tax →

The Consequences

The Token Tax

Token fertility — tokens per word — reliably predicts model accuracy across languages. English gets 1.0×. Korean pays 2.36×. Russian and Hebrew pay 3×. This isn't a bug; it's baked into every tokenizer trained predominantly on English text.

1.0×

English fertility (baseline)

2.36×

Korean token fertility

3×

Russian / Hebrew fertility

4×

Economic cost multiplier

The Price Tag — Same Message, Different Cost

The same semantic content costs vastly different amounts depending on language. Click any price tag to see the breakdown.

Token Fertility by Language — Click a Region

Fertility rate = average tokens needed per word, relative to English. Higher fertility = more expensive inference and lower model accuracy.

Token Tax Calculator

Estimate the token overhead for the same content in different languages. API cost at $0.01 per 1K tokens.

Base English text (estimated tokens):

Why Does the Tax Exist?

Training Data Bias

Internet text used to train tokenizers is ~50% English. English subword units get most of the 50,000 merge budget. Other languages get the leftovers.

Byte-to-Character Ratio

Latin script ≈ 1 byte/char. Arabic, CJK characters ≈ 2–4 bytes/char. Byte-level BPE penalizes complex scripts at the very foundation.

The Merge Budget

50,000 BPE merges must cover all languages. English consumes the lion's share, leaving less representational capacity for every other language.

Research Finding

"Tokenization is a tax that low-resource languages cannot afford to pay, charged on every token, in every layer. Scale partially recovers the gap — meaning smaller models spend raw parameter budget reconstructing what should have been clean input from the start."

Fertility vs Model Accuracy — The Negative Correlation

Languages with higher token fertility consistently show lower downstream model accuracy. The correlation is strong and holds across model families.

The problem is real. SentencePiece was built specifically to solve it. SentencePiece — The Fix →

The Fix — Language-Agnostic

SentencePiece & Byte-BPE

The token tax exists because every algorithm so far assumes spaces = word boundaries. SentencePiece eliminates that assumption entirely — treating raw text as a byte stream and encoding the space as just another character (▁). One change, universal coverage.

The Space Boundary Problem

Traditional tokenizers pre-tokenize on whitespace — but many languages have no whitespace. SentencePiece skips pre-tokenization entirely.

SentencePiece ▁ Demo — The Space Becomes a Token

Select a language to see how SentencePiece handles it. The ▁ symbol marks where a space existed in the original text — making decoding fully reversible.

SentencePiece at a Glance

50K

sentences/sec

6MB

memory footprint

50+

languages, same pipeline

0

unknown tokens ever

Byte-Level BPE — "café" and 🚀 Without [UNK]

tiktoken — OpenAI's Production Tokenizer Speed

tiktoken (used in GPT-4) achieves 3–6× faster tokenization than comparable open-source implementations, using a Rust core under a Python interface.

Now you understand the problem. Here's the fix designed for it. The Future →

The Legacy & Next Steps

The Future of Tokenization

The 2021 survey concluded: "there is and likely will never be a silver bullet." Three years later, Meta's Byte Latent Transformer challenged that. The future might not be better tokenization — it might be no tokenization at all.

The Tokenization Timeline — Click a Milestone

Meta BLT — "No Tokenization" — How It Works

The Byte Latent Transformer (2024) operates directly on raw bytes. Instead of fixed tokens, it groups bytes into dynamic "patches" based on entropy — complex regions get more compute, simple regions get less.

50%

fewer flops at inference vs Llama 3

Equal

quality to Llama 3 at same compute

Better

on non-English and rare scripts

What This Means for You

Choosing a tokenizer for your RAG pipeline▼

Prefer SentencePiece-based models (T5, ALBERT) for multilingual RAG. For English-only, BPE (OpenAI embeddings, RoBERTa) is the standard. Always check token fertility for your target language before budgeting API costs.

Cost estimation: multiply by fertility rate for non-English▼

If your budget assumes English token counts, multiply by the fertility rate for your actual language. A Korean chatbot costs 2.36× more than an English one at identical message length. A Russian document pipeline costs ~3×.

Why LLaMA 3 uses a 128K vocabulary (not 50K)▼

Larger vocabularies allocate more merge budget to non-English languages, reducing their fertility penalty. LLaMA 3's 128K tokens is a deliberate multilingual fairness choice — the same words now tokenize more efficiently in Korean, Arabic, and Hindi compared to LLaMA 2.

When character-level is still the right choice▼

Character-level tokenization still wins for: fraud detection (character-level typos matter), OCR post-processing (noisy character sequences), password strength analysis, and languages with character-based morphology (Thai, Tibetan). Subword is not always best.

The tokenization-free future: watch BLT▼

Meta's BLT (Byte Latent Transformer, 2024) eliminates tokenization entirely. It processes raw bytes with dynamic patch boundaries — allocating compute where text is complex, skipping where it's predictable. If this approach scales, the entire subword tokenization ecosystem (BPE, WordPiece, SentencePiece, tiktoken) becomes legacy infrastructure.

Try All Algorithms — Side by Side

Pick a sentence and see how each algorithm tokenizes it. The token count difference shows the efficiency gap.

The Survey's Key Findings

The paper's verdict (2021)

"There is and likely will never be a silver bullet singular solution for all applications." — Mielke et al.

What 2024 says

Meta BLT's 50% flop reduction suggests tokenization itself may be the next thing to eliminate — not optimize.

50,257 GPT-2 vocab

|

2.36× Korean tax

|

50% fewer flops (BLT)

|

No silver bullet