When ChatGPT reads your 100-page document, it stores up to 4GB of float16 vectors per layer just to answer one question. When Spotify searches 100 million songs for "similar to this one," it must scan 600GB of embeddings in milliseconds. Both come down to the same question: can we compress high-dimensional vectors 4Γ without losing quality? TurboQuant answers yes β provably, in 0.001 seconds per vector.
Scenario 1 β ChatGPT reads your document (KV Cache)
You upload a 100-page contract to an LLM and ask: "Are there any auto-renewal clauses?" Before it can answer, the model must read every word.
For each word (token), the model computes a key vector and a value vector β each a list of 4,096 numbers in float16 (2 bytes each). Together these are the KV cache β the model's working memory for your conversation.
At 128,000 tokens (β100 pages), the KV cache alone consumes 4GB of GPU memory per layer. A 32-layer model: 128GB. That's an entire A100 GPU just for one conversation.
Scenario 2 β Spotify finds your next song (Vector Database)
Spotify's recommendation engine converts every song into a 512-number embedding β a point in 512-dimensional space where similar songs are close together. When you press "go to song radio," it finds the 50 nearest songs to the one you're playing.
At 100 million songs Γ 512 dimensions Γ 4 bytes = 200GB of embeddings. Searching this naively means computing 100M distances per query β in real time, for millions of concurrent users.
Quantizing from 32-bit float to 4-bit (8Γ compression): 200GB β 25GB. Distance computations are 8Γ cheaper. Same recommendations, fraction of the hardware.
What exactly is a "vector"?
A vector is just a list of numbers. For an LLM key vector, each number encodes one abstract feature of a token's meaning. For a song embedding, each number encodes one aspect of the song's musical style.
In both cases: more numbers = more expressive = harder to store. A 4,096-dimension LLM key stored as float16 = 8,192 bytes per token. Quantizing to 3.5 bits = 1,792 bytes. Same meaning, 4.6Γ smaller.
Think of a vector like GPS coordinates β but instead of 2 numbers (lat, long), you have 4,096. Each number pins down a different abstract "direction" in meaning-space. Compressing it is like rounding your GPS from 6 decimal places to 2: you still end up on the right street, just not the exact centimeter. TurboQuant proves you can do this rounding optimally β no more precision lost than mathematically necessary.
Offline quantization (Product Quantization, RabitQ): train first β scan your full dataset, cluster it, build a codebook. Only then can you compress new vectors. Takes 37β497 seconds. Useless if data arrives in real time.
Online quantization (TurboQuant): no training β the compression formula is derived purely from mathematics (random rotation + Beta distribution + Lloyd-Max). Each new vector is compressed independently, instantly. Works in 0.001 seconds per vector, on any vector, any time.
Offline vs Online: Interactive Comparison
The Approximate Nearest Neighbor (ANN) Search Problem
To find the most similar vector in a database of N vectors, the naive approach computes the distance to all N vectors β O(N Γ d) operations. At 100M songs with 512 dimensions: 51 billion multiplications per query.
Approximate Nearest Neighbor (ANN) search accepts a small quality loss in exchange for speed. Quantization is the key enabler: compressed vectors are smaller β fit in CPU cache β fewer cache misses β faster comparisons.
The tradeoff: compress more β faster search but lower recall. TurboQuant hits the sweet spot β better recall than Product Quantization at the same compression ratio.
"Online" is like a coffee shop barista who memorized the recipes (math) and can make any drink immediately. "Offline" is like a barista who has to attend a 10-hour training course before serving the first customer. For a busy airport cafΓ© (real-time vector DB), the offline barista is useless β you need someone who can start serving the moment doors open.
Concept 1 β What exactly is a "coordinate"?
When ChatGPT reads your contract and encounters the word "renewal", it doesn't store the word itself β it stores a list of ~4,096 numbers called a key vector. Each number (= one coordinate) measures how strongly that token activates one abstract "detector" the model learned during training.
You can think of them like a nutrition label β each row is a different dimension of meaning. Some dimensions are highly activated ("renewal" strongly triggers legal-term detectors), others are near zero (it barely activates colour or food detectors).
Click any word below to see its key vector. Notice how different words light up completely different coordinates β and how some coordinates are always much bigger than others.
Concept 2 β What does "energy" mean?
"Energy" in a coordinate just means its squared value β how much it contributes to the total size of the vector. If coordinate #3 has value 3.2, its energy is 3.2Β² = 10.24. If coordinate #7 has value 0.1, its energy is only 0.01.
For the word "renewal", maybe 3 coordinates hold 70% of all the energy, and the other 4,093 coordinates hold the remaining 30%. This is called a spiky distribution. It's a disaster for compression β because the model has to handle both the big spikes and the tiny near-zero values with the same compression scheme.
Concept 3 β Why "naive compression" fails on spiky vectors
The simplest way to compress a number is to round it to the nearest step β like rounding 3.247 to 3.2. You divide the possible range into equally-spaced "bins" and store which bin the value falls in. This is called uniform quantization.
Concept 4 β What the random rotation actually does
A rotation doesn't change the meaning of the vector β it just mixes the coordinates together. Think of it like stirring a glass of water with red food dye at the bottom: the total amount of dye (energy) stays the same, but it spreads evenly throughout the glass.
Imagine storing a recipe as a rack of 16 spice jars. For "renewal", jar #3 (legal-ness) is completely full, jar #1 (formality) is half full, and most jars are almost empty. If you compress this with equal-sized labels, you waste 14 labels on near-empty jars.
The rotation is like pouring all the spices into a blender, mixing them into 16 equal portions, and refilling the jars. Now every jar is equally full β so every label carries equal value. Compression works perfectly on equal jars.
Seeing it across many tokens: the histogram view
The single-vector view above shows one token. Let's zoom out: here are the coordinate values across all 500 tokens in your contract. Each token has its own vector, and we're plotting all coordinate values in one histogram. The shape of this histogram is what determines compression quality.
Before (coral): most values cluster near 0, with a few huge spikes β like the spice rack. Compressing this wastes most bins on the crowded center.
After (teal): a smooth bell curve β no spikes, no wasted bins. Every compression bin covers roughly the same number of values.
After rotating the "renewal" key vector, each of the 4,096 coordinates is a tiny random mixture of all the original values. No single original value dominates anymore.
Because each coordinate is now made up of thousands of tiny contributions, they all end up in a predictable, bell-shaped range β not too big, not too small. Mathematicians call this shape the Beta distribution.
Seeing the Shape β One Coordinate Across Many Tokens
Imagine taking coordinate #7 (Duration-ref) from the key vector of every word in your contract β "renewal", "days", "party", "shall", "terminate"... all 10,000 tokens. Plotting all those values gives you a histogram. Before rotation: the shape is jagged and unpredictable. After rotation: it's a clean bell curve. The same clean bell curve appears for coordinate #1, #2, #3β¦ all of them.
The Concentration Effect β More Dimensions = More Predictable
There's a bonus: the bigger the vector (more dimensions d), the narrower the bell curve gets. A small model with d=64 has a wide, spread-out distribution. A large model like Llama (d=4096) has a very tight, concentrated distribution β meaning all coordinates are packed into a tiny range, making compression even more precise.
Why This Lets Each Coordinate Be Compressed Independently
In plain English: before rotation, if you know coordinate #3 is big, you can guess coordinate #7 might also be big. They're correlated β carrying redundant information. After rotation, knowing one coordinate tells you nothing about the others. Each can be compressed in isolation, independently, without needing to look at the rest.
Imagine trying to compress a weather report. Temperature, humidity, and "feels like" are highly correlated β if it's 35Β°C and humid, "feels like" is always around 40Β°C. You're wasting space storing all three independently. After "rotating" (de-correlating), each measurement is truly independent β you can compress each with no knowledge of the others, and nothing is wasted. The Beta distribution is the mathematical proof that rotation achieves this perfectly.
Given a known distribution, Lloyd-Max finds the quantization thresholds that minimize mean squared error. The rule: each threshold lies halfway between adjacent centroids, and each centroid is the mean of its bin. Repeat until convergence. For a Gaussian, this puts bins densely in the center and sparsely at the tails.
Interactive Quantizer Demo
Rate-Distortion: How Close to the Theoretical Limit?
Lloyd-Max is like a parking lot designer who puts more spaces near the entrance. Instead of equal spacing, spaces cluster where cars are most likely to park β fewer steps for everyone on average. The theoretical lower bound is the absolute minimum if you had infinite design time; TurboQuant gets within a factor of 2.
The core operation in attention is the inner product β¨query, keyβ©. If your quantization systematically overestimates these products, the attention scores will be biased β some tokens will seem more relevant than they are. TurboQuant_PROD adds a single extra bit per vector that mathematically removes this bias.
Bias Visualization: Estimated vs True Inner Product
The QJL Residual Correction β 3 Steps
Like correcting a systematic rating bias: if Yelp reviewers always over-rate expensive restaurants, you ask one extra trusted reviewer to rate a comparable cheap one and use the difference to calibrate. The 1-bit QJL correction is that trusted reviewer β cheap, fast, and it neutralizes the systematic error.
Distortion Rate β How Close to the Limit?
LongBench: KV Cache Quality vs Bits Per Value
Indexing Speed Comparison
Recall@10 vs Product Quantization
Head-to-Head Scorecard
The Most Important Question: Does the Output Actually Look the Same?
Numbers like "3.5 bits" and "0.001 distortion" are hard to feel. The real test: give the same compressed model the same question β does the answer change? Below are real responses from Llama-3.1-8B, side by side: full precision vs TurboQuant at 3.5 bits.
Your Personal Savings Calculator
What does 4Γ compression actually mean for your workload? Adjust the sliders to see the memory and cost impact for your specific use case.
4Γ longer context windows on the same hardware. Or same context, 4Γ cheaper. Inner-product unbiased quantization means no model degradation β the model doesn't even know it's compressed.
Millisecond index building makes real-time quantization practical for streaming data. Better recall than PQ means fewer missed results in RAG pipelines. No offline training phase needed.
Near information-theoretic optimality with a strikingly simple algorithm. Proves random preprocessing + independent scalar quantizers can match the best achievable bound within a small constant factor.
AI Inference Efficiency Landscape
Open Problems
TurboQuant shows that theoretical optimality and practical simplicity aren't in conflict. A random rotation plus a known distribution plus a classical algorithm β and you're within a constant factor of the best possible. The three steps took decades of separate research to develop; the insight was combining them into a single online pipeline.