Post 56
Representations
ColBERT · Weaviate · 2024
Late Interaction
Multi-Vector Embeddings
Beyond the Single Point
Beyond the Single Point
A single embedding vector must compress an entire document into one point in space. For a sentence, that's fine. For a 500-word technical paper — or a PDF full of charts — that compression destroys the exact details retrieval depends on. Multi-vector embeddings keep a separate contextualised embedding for every token. Documents become constellations, not points. Matching becomes precise, not averaged.
170×
Lower latency vs BERT reranker
+12.5
nDCG@10 on long docs (M3)
10×
Index compression (v1→v2)
128
Dims per token (ColBERT)
52.1
BEIR avg nDCG (Jina-ColBERT-v2)
The Core Shift
Single-vector: one point represents the entire document. Multi-vector (ColBERT): every token gets its own 128-dim contextualised vector. The document becomes a matrix — shape (N_tokens × 128). Retrieval compares each query token against every document token and takes the best match per query token.
Why It Matters Now
ColBERT's MaxSim score matches BERT-large reranker quality at 170× lower latency. ColBERTv2 compresses the index 10× via residual quantization. ColPali extends this to visual documents — skip OCR entirely; retrieve from raw PDF page images. Weaviate v1.29 ships native multi-vector support.
The Pooling Problem
Why averaging token embeddings into a single vector loses the details that retrieval depends on
✖ Single-Vector Failure Mode
Query:
Document A: "The river bank by the old mill floods in spring. Loan of a fishing rod costs $5 a day."
Document B: "A financial institution offering personal loans at a competitive rate."
Pooling collapses both documents to similar centroid embeddings. Document A scores high on "bank" and "loan" — wrong senses. Single-vector retrieval may rank it ahead of B.
"bank loan interest rate"Document A: "The river bank by the old mill floods in spring. Loan of a fishing rod costs $5 a day."
Document B: "A financial institution offering personal loans at a competitive rate."
Pooling collapses both documents to similar centroid embeddings. Document A scores high on "bank" and "loan" — wrong senses. Single-vector retrieval may rank it ahead of B.
✔ Multi-Vector Solution
BERT produces contextualised embeddings — "bank" in a fishing context and "bank" in a finance context occupy different positions in the 768-dim space. ColBERT preserves all token embeddings. When "bank" from the query finds its best match, it matches the financial-institution token, not the river-bank token. No compression = no context collapse.
Why Pooling Destroys Rare-Term Signals
The Centroid Effect
Pooling averages across all tokens. Common words (the, is, of) dominate numerically. A single rare but highly relevant term — a drug name, legal clause, error code — contributes less than 1/N of the final vector, where N is document length. The longer the document, the worse this gets.
Word Order Lost
"Not recommended" and "recommended, not unnecessary" pool to nearly identical vectors. Negation, qualification, and sequence — all destroyed. Multi-vector preserves position: each token's embedding captures its local context via BERT's attention, retaining meaning modifiers.
Visual Documents Fail Completely
For PDFs with charts, tables, and infographics, text-based embeddings can't see the information at all — OCR either misses it or produces garbled text. ColPali bypasses OCR entirely: treat each page as an image, produce visual patch embeddings, apply late interaction over patches.
MaxSim: Late Interaction
The scoring function that makes multi-vector retrieval precise — explore it with the interactive heatmap below
The Formula
# ColBERT late interaction score
S(q, d) = Σᵢ maxⱼ (Eq[i] · Ed[j]ᵀ)
# where:
# Eq[i] = embedding of i-th QUERY token (128-dim)
# Ed[j] = embedding of j-th DOC token (128-dim)
# · = dot product (= cosine sim, vectors L2-normalised)
# max_j = best-matching doc token for this query token
# Σ_i = sum over all query tokens
Intuition
Think of it as a panel of specialist judges. Each query token is one judge asking: "Which document token is most relevant to me specifically?" Each judge picks their own best match (the
Because BERT embeddings are contextualised,
max). Their individual verdicts are summed. No judge is overruled by the others — a rare technical term gets its own vote, not an averaged one.Because BERT embeddings are contextualised,
"bank" in a finance query finds the finance-sense "bank" in the document, not the river-bank token. Context collapses in pooling; it's preserved in late interaction.
Interactive MaxSim Heatmap
Each cell shows cosine similarity between a query token (row) and a document token (column). Step through to see how MaxSim selects the best match per row and accumulates the score.
MaxSim Accumulation
Total MaxSim Score
—
ColBERT Architecture Details
• Base model: BERT-base (12 layers, 110M params)
• Special tokens:
• Query padding: Padded with
• Dimension reduction: Linear projection BERT's 768 dims → 128 dims, then L2-normalise
• Document filtering: Punctuation token embeddings are discarded before indexing
• Pre-computation: All document token embeddings computed offline; only query embeddings computed at search time
• Special tokens:
[Q] prepended to queries, [D] to documents• Query padding: Padded with
[MASK] tokens to fixed length 32 — a form of soft query expansion• Dimension reduction: Linear projection BERT's 768 dims → 128 dims, then L2-normalise
• Document filtering: Punctuation token embeddings are discarded before indexing
• Pre-computation: All document token embeddings computed offline; only query embeddings computed at search time
The ColBERT Family
Five systems that shaped how multi-vector retrieval is built and deployed — click to expand each
Storage vs. Quality Trade-offs
Multi-vector costs 4–10× more storage than single-vector — the calculator shows you exactly what that means for your deployment
Storage Calculator
MS-MARCO Benchmark — Quality vs. Storage
| System | MRR@10 | MS-MARCO Index | Latency (GPU) | Notes |
|---|---|---|---|---|
| BM25 | 18.7 | ~3.5 GiB | 62 ms | Lexical only; no semantics |
| DPR (dense) | ~31 | ~25 GiB | <1 ms | Single vector, 768 dims |
| BERT-large reranker | 36.5 | ~25 GiB | 32,900 ms | Expensive cross-encoder |
| ColBERT v1 | 36.0 | 154 GiB | 458 ms | Matches BERT-large at 170× lower latency |
| ColBERT v2 | 39.7 | 16 GiB | ~140 ms | 10× storage reduction via residual compression |
| PLAID ColBERTv2 | 39.8 | 21.6 GiB | 11–38 ms | 3-stage filtering; 7× faster than vanilla v2 |
ColBERTv2 Compression
v1 stored each token vector as float32 (256 bytes). v2 uses residual compression: cluster all token vectors via k-means; store centroid ID + 1–2 bit quantised residual. Result: 20–36 bytes per vector vs 256 bytes — ~10× compression with near-zero quality loss (MRR 39.7 vs 39.7 vanilla).
MRL: Orthogonal Compression
Matryoshka Representation Learning trains vectors so their first-D dimensions are always a valid representation. OpenAI text-embedding-3-large (3072 dims) truncated to 256 dims still outperforms ada-002 (1536 dims). MRL reduces vector width; multi-vector increases vector count. Jina-ColBERT-v2 applies both simultaneously: 128→64 dims with only 1.5% quality loss.
Weaviate + Jina-ColBERT-v2
Production multi-vector retrieval with Weaviate v1.29 — named vectors, automatic embedding via JinaAI API, variable-length token matrices
1. Schema: Named Multi-Vector
# Weaviate v1.29+
import weaviate
from weaviate.classes.config import Configure
client = weaviate.connect_to_weaviate_cloud(...)
client.collections.create(
name="Documents",
vectorizer_config=[
Configure.NamedVectors.text2vec_jinaai(
name="multi_vector",
source_properties=["text"],
vector_index_config=Configure.MultiVectors.text2vec_jinaai(
model="jina-colbert-v2"
)
)
]
)
2. Inspect Embedding Shape
# Single-vector → flat list
# Multi-vector → list of lists
v = response.objects[0].vector['multi_vector']
type(v) # list
type(v[0]) # list ← token vector
len(v) # 22–30 ← varies by doc length
len(v[0]) # 128 ← fixed dim per token
# Shape: (N_tokens, 128)
# e.g. "Hello world" → (4, 128)
# "A long paragraph..." → (42, 128)
3. Semantic Search (near_text)
collection = client.collections.get("Documents")
results = collection.query.near_text(
query="bank loan interest rate",
target_vector="multi_vector",
limit=5,
return_metadata=MetadataQuery(distance=True)
)
for obj in results.objects:
print(obj.properties["text"], obj.metadata.distance)
4. Self-Provided Embeddings
# Bring your own ColBERT vectors
Configure.MultiVectors.self_provided(
name="multi_vector"
)
# Insert with custom embeddings
batch.add_object(
properties={"text": doc["text"]},
vector={
"multi_vector": get_colbert_embedding(
doc["text"] # returns (N, 128) matrix
)
}
)
Jina-ColBERT-v2
128 dims per token • 8,192-token context • 89+ languages • MRL-compatible: can truncate to 64 dims (1.5% quality loss, 50% storage saving) • 52.1 avg nDCG@10 on 14 BEIR benchmarks
Storage Reality (Weaviate)
Single-vector 1536 dims: ~6 KB/object • Multi-vector 64 × 96 dims: ~25 KB/object • ~4× overhead • For 1M docs: ~6 GB → ~25 GB. PLAID + residual compression substantially narrows this gap.
Query Types
near_text — automatic embedding via JinaAI API • near_vector — provide your own embedding matrix • hybrid — combine BM25 + multi-vector MaxSim scoringDecision Guide
Check every characteristic that describes your use case — get a system recommendation
Recommendation
Select your characteristics above
Check boxes that apply to your retrieval use case and a recommendation will appear here.
Practice Exercises
Three browser exercises + one live lab to build intuition for multi-vector retrieval
1 MaxSim Puzzle — Identify the Best Matches
Below is a similarity matrix for query
"null pointer exception fix" against a code document. Each cell is the cosine similarity (0–100) between a query token and a document token. Click the cell in each row that you think is the MaxSim winner (highest in that row), then check your score.
2 System Matcher — Choose the Right Embedding Approach
For each retrieval scenario, select the best embedding approach. One correct answer per scenario.
3 Compression Trade-off Quiz
Test your understanding of the ColBERTv2 compression and MRL techniques.
★ Live Lab — MaxSim Token Walkthrough
Enter a query and a passage. gpt-4o-mini simulates ColBERT's MaxSim — it identifies which passage tokens are the best match for each query token, explains the context-sensitive matching, and shows how the score would accumulate. Educational (no actual embeddings computed). Optional OpenAI key (~$0.001/run).
Query
Passage / Document
Related Posts
Build the complete representation and retrieval picture
Post 08 — Embeddings & Similarity
Single-vector embeddings from first principles — word2vec, BERT pooling, cosine similarity, and why dot product search works. The foundation that Post 56 extends to multi-vector.
Post 21 — RAG
Retrieval-Augmented Generation: vector databases, chunking strategies, and how retrieved context improves generation quality. Multi-vector embeddings directly upgrade the retrieval step in any RAG pipeline.
Post 55 — Securing MCP
MCP agents with access to vector databases are a key attack surface — understanding multi-vector retrieval helps scope what data an agent can reach via semantic search and why DLP scanning of retrieved content matters.
← Previous Post
Post 55 — Securing MCP
Next Post →
Post 57 — Open Knowledge Format