Visual Summary
Post 56 · Representations · Advanced
Incorrect password — try again
Overview
Problem
MaxSim
Systems
Trade-offs
Weaviate
Exercises
Post 56 Representations ColBERT · Weaviate · 2024 Late Interaction
Multi-Vector Embeddings
Beyond the Single Point
A single embedding vector must compress an entire document into one point in space. For a sentence, that's fine. For a 500-word technical paper — or a PDF full of charts — that compression destroys the exact details retrieval depends on. Multi-vector embeddings keep a separate contextualised embedding for every token. Documents become constellations, not points. Matching becomes precise, not averaged.
170×
Lower latency vs BERT reranker
+12.5
nDCG@10 on long docs (M3)
10×
Index compression (v1→v2)
128
Dims per token (ColBERT)
52.1
BEIR avg nDCG (Jina-ColBERT-v2)
The Core Shift
Single-vector: one point represents the entire document. Multi-vector (ColBERT): every token gets its own 128-dim contextualised vector. The document becomes a matrix — shape (N_tokens × 128). Retrieval compares each query token against every document token and takes the best match per query token.
Why It Matters Now
ColBERT's MaxSim score matches BERT-large reranker quality at 170× lower latency. ColBERTv2 compresses the index 10× via residual quantization. ColPali extends this to visual documents — skip OCR entirely; retrieve from raw PDF page images. Weaviate v1.29 ships native multi-vector support.
The Pooling Problem
Why averaging token embeddings into a single vector loses the details that retrieval depends on
✖ Single-Vector Failure Mode
Query: "bank loan interest rate"
Document A: "The river bank by the old mill floods in spring. Loan of a fishing rod costs $5 a day."
Document B: "A financial institution offering personal loans at a competitive rate."

Pooling collapses both documents to similar centroid embeddings. Document A scores high on "bank" and "loan" — wrong senses. Single-vector retrieval may rank it ahead of B.
✔ Multi-Vector Solution
BERT produces contextualised embeddings — "bank" in a fishing context and "bank" in a finance context occupy different positions in the 768-dim space. ColBERT preserves all token embeddings. When "bank" from the query finds its best match, it matches the financial-institution token, not the river-bank token. No compression = no context collapse.
Why Pooling Destroys Rare-Term Signals
The Centroid Effect
Pooling averages across all tokens. Common words (the, is, of) dominate numerically. A single rare but highly relevant term — a drug name, legal clause, error code — contributes less than 1/N of the final vector, where N is document length. The longer the document, the worse this gets.
Word Order Lost
"Not recommended" and "recommended, not unnecessary" pool to nearly identical vectors. Negation, qualification, and sequence — all destroyed. Multi-vector preserves position: each token's embedding captures its local context via BERT's attention, retaining meaning modifiers.
Visual Documents Fail Completely
For PDFs with charts, tables, and infographics, text-based embeddings can't see the information at all — OCR either misses it or produces garbled text. ColPali bypasses OCR entirely: treat each page as an image, produce visual patch embeddings, apply late interaction over patches.
MaxSim: Late Interaction
The scoring function that makes multi-vector retrieval precise — explore it with the interactive heatmap below
The Formula
# ColBERT late interaction score S(q, d) = Σmaxⱼ (Eq[i] · Ed[j]ᵀ) # where: # Eq[i] = embedding of i-th QUERY token (128-dim) # Ed[j] = embedding of j-th DOC token (128-dim) # · = dot product (= cosine sim, vectors L2-normalised) # max_j = best-matching doc token for this query token # Σ_i = sum over all query tokens
Intuition
Think of it as a panel of specialist judges. Each query token is one judge asking: "Which document token is most relevant to me specifically?" Each judge picks their own best match (the max). Their individual verdicts are summed. No judge is overruled by the others — a rare technical term gets its own vote, not an averaged one.

Because BERT embeddings are contextualised, "bank" in a finance query finds the finance-sense "bank" in the document, not the river-bank token. Context collapses in pooling; it's preserved in late interaction.
Interactive MaxSim Heatmap
Each cell shows cosine similarity between a query token (row) and a document token (column). Step through to see how MaxSim selects the best match per row and accumulates the score.
MaxSim Accumulation
Total MaxSim Score
ColBERT Architecture Details
Base model: BERT-base (12 layers, 110M params)
Special tokens: [Q] prepended to queries, [D] to documents
Query padding: Padded with [MASK] tokens to fixed length 32 — a form of soft query expansion
Dimension reduction: Linear projection BERT's 768 dims → 128 dims, then L2-normalise
Document filtering: Punctuation token embeddings are discarded before indexing
Pre-computation: All document token embeddings computed offline; only query embeddings computed at search time
The ColBERT Family
Five systems that shaped how multi-vector retrieval is built and deployed — click to expand each
Storage vs. Quality Trade-offs
Multi-vector costs 4–10× more storage than single-vector — the calculator shows you exactly what that means for your deployment
Storage Calculator
Number of documents
100K
Avg tokens per doc
100
Embedding dimensions
128
MS-MARCO Benchmark — Quality vs. Storage
System MRR@10 MS-MARCO Index Latency (GPU) Notes
BM2518.7~3.5 GiB62 msLexical only; no semantics
DPR (dense)~31~25 GiB<1 msSingle vector, 768 dims
BERT-large reranker36.5~25 GiB32,900 msExpensive cross-encoder
ColBERT v136.0154 GiB458 msMatches BERT-large at 170× lower latency
ColBERT v239.716 GiB~140 ms10× storage reduction via residual compression
PLAID ColBERTv239.821.6 GiB11–38 ms3-stage filtering; 7× faster than vanilla v2
ColBERTv2 Compression
v1 stored each token vector as float32 (256 bytes). v2 uses residual compression: cluster all token vectors via k-means; store centroid ID + 1–2 bit quantised residual. Result: 20–36 bytes per vector vs 256 bytes — ~10× compression with near-zero quality loss (MRR 39.7 vs 39.7 vanilla).
MRL: Orthogonal Compression
Matryoshka Representation Learning trains vectors so their first-D dimensions are always a valid representation. OpenAI text-embedding-3-large (3072 dims) truncated to 256 dims still outperforms ada-002 (1536 dims). MRL reduces vector width; multi-vector increases vector count. Jina-ColBERT-v2 applies both simultaneously: 128→64 dims with only 1.5% quality loss.
Weaviate + Jina-ColBERT-v2
Production multi-vector retrieval with Weaviate v1.29 — named vectors, automatic embedding via JinaAI API, variable-length token matrices
1. Schema: Named Multi-Vector
# Weaviate v1.29+ import weaviate from weaviate.classes.config import Configure client = weaviate.connect_to_weaviate_cloud(...) client.collections.create( name="Documents", vectorizer_config=[ Configure.NamedVectors.text2vec_jinaai( name="multi_vector", source_properties=["text"], vector_index_config=Configure.MultiVectors.text2vec_jinaai( model="jina-colbert-v2" ) ) ] )
2. Inspect Embedding Shape
# Single-vector → flat list # Multi-vector → list of lists v = response.objects[0].vector['multi_vector'] type(v) # list type(v[0]) # list ← token vector len(v) # 22–30 ← varies by doc length len(v[0]) # 128 ← fixed dim per token # Shape: (N_tokens, 128) # e.g. "Hello world" → (4, 128) # "A long paragraph..." → (42, 128)
3. Semantic Search (near_text)
collection = client.collections.get("Documents") results = collection.query.near_text( query="bank loan interest rate", target_vector="multi_vector", limit=5, return_metadata=MetadataQuery(distance=True) ) for obj in results.objects: print(obj.properties["text"], obj.metadata.distance)
4. Self-Provided Embeddings
# Bring your own ColBERT vectors Configure.MultiVectors.self_provided( name="multi_vector" ) # Insert with custom embeddings batch.add_object( properties={"text": doc["text"]}, vector={ "multi_vector": get_colbert_embedding( doc["text"] # returns (N, 128) matrix ) } )
Jina-ColBERT-v2
128 dims per token • 8,192-token context • 89+ languages • MRL-compatible: can truncate to 64 dims (1.5% quality loss, 50% storage saving) • 52.1 avg nDCG@10 on 14 BEIR benchmarks
Storage Reality (Weaviate)
Single-vector 1536 dims: ~6 KB/object • Multi-vector 64 × 96 dims: ~25 KB/object • ~4× overhead • For 1M docs: ~6 GB → ~25 GB. PLAID + residual compression substantially narrows this gap.
Query Types
near_text — automatic embedding via JinaAI API • near_vector — provide your own embedding matrix • hybrid — combine BM25 + multi-vector MaxSim scoring
Decision Guide
Check every characteristic that describes your use case — get a system recommendation
Recommendation
Select your characteristics above
Check boxes that apply to your retrieval use case and a recommendation will appear here.
Practice Exercises
Three browser exercises + one live lab to build intuition for multi-vector retrieval
1  MaxSim Puzzle — Identify the Best Matches
Below is a similarity matrix for query "null pointer exception fix" against a code document. Each cell is the cosine similarity (0–100) between a query token and a document token. Click the cell in each row that you think is the MaxSim winner (highest in that row), then check your score.
2  System Matcher — Choose the Right Embedding Approach
For each retrieval scenario, select the best embedding approach. One correct answer per scenario.
3  Compression Trade-off Quiz
Test your understanding of the ColBERTv2 compression and MRL techniques.
  Live Lab — MaxSim Token Walkthrough
Enter a query and a passage. gpt-4o-mini simulates ColBERT's MaxSim — it identifies which passage tokens are the best match for each query token, explains the context-sensitive matching, and shows how the score would accumulate. Educational (no actual embeddings computed). Optional OpenAI key (~$0.001/run).
Query
Passage / Document
Related Posts
Build the complete representation and retrieval picture
← Previous Post
Post 55 — Securing MCP
Next Post →
Post 57 — Open Knowledge Format