Multi-Vector Embeddings — Beyond the Single Point

Overview

›

Problem

›

MaxSim

›

Systems

›

Trade-offs

›

Weaviate

›

Exercises

Post 56 Representations ColBERT · Weaviate · 2024 Late Interaction

Multi-Vector Embeddings
Beyond the Single Point

A single embedding vector must compress an entire document into one point in space. For a sentence, that's fine. For a 500-word technical paper — or a PDF full of charts — that compression destroys the exact details retrieval depends on. Multi-vector embeddings keep a separate contextualised embedding for every token. Documents become constellations, not points. Matching becomes precise, not averaged.

170×

Lower latency vs BERT reranker

+12.5

nDCG@10 on long docs (M3)

10×

Index compression (v1→v2)

128

Dims per token (ColBERT)

52.1

BEIR avg nDCG (Jina-ColBERT-v2)

The Core Shift

Single-vector: one point represents the entire document. Multi-vector (ColBERT): every token gets its own 128-dim contextualised vector. The document becomes a matrix — shape (N_tokens × 128). Retrieval compares each query token against every document token and takes the best match per query token.

Why It Matters Now

ColBERT's MaxSim score matches BERT-large reranker quality at 170× lower latency. ColBERTv2 compresses the index 10× via residual quantization. ColPali extends this to visual documents — skip OCR entirely; retrieve from raw PDF page images. Weaviate v1.29 ships native multi-vector support.

The Pooling Problem

Why averaging token embeddings into a single vector loses the details that retrieval depends on

✖ Single-Vector Failure Mode

Query: "bank loan interest rate"
Document A: "The river bank by the old mill floods in spring. Loan of a fishing rod costs $5 a day."
Document B: "A financial institution offering personal loans at a competitive rate."

Pooling collapses both documents to similar centroid embeddings. Document A scores high on "bank" and "loan" — wrong senses. Single-vector retrieval may rank it ahead of B.

✔ Multi-Vector Solution

BERT produces contextualised embeddings — "bank" in a fishing context and "bank" in a finance context occupy different positions in the 768-dim space. ColBERT preserves all token embeddings. When "bank" from the query finds its best match, it matches the financial-institution token, not the river-bank token. No compression = no context collapse.

Why Pooling Destroys Rare-Term Signals

The Centroid Effect

Pooling averages across all tokens. Common words (the, is, of) dominate numerically. A single rare but highly relevant term — a drug name, legal clause, error code — contributes less than 1/N of the final vector, where N is document length. The longer the document, the worse this gets.

Word Order Lost

"Not recommended" and "recommended, not unnecessary" pool to nearly identical vectors. Negation, qualification, and sequence — all destroyed. Multi-vector preserves position: each token's embedding captures its local context via BERT's attention, retaining meaning modifiers.

Visual Documents Fail Completely

For PDFs with charts, tables, and infographics, text-based embeddings can't see the information at all — OCR either misses it or produces garbled text. ColPali bypasses OCR entirely: treat each page as an image, produce visual patch embeddings, apply late interaction over patches.

MaxSim: Late Interaction

The scoring function that makes multi-vector retrieval precise — explore it with the interactive heatmap below

The Formula

# ColBERT late interaction score
S(q, d) = Σᵢ  maxⱼ  (Eq[i] · Ed[j]ᵀ)

# where:
# Eq[i] = embedding of i-th QUERY token (128-dim)
# Ed[j] = embedding of j-th DOC token (128-dim)
# · = dot product (= cosine sim, vectors L2-normalised)
# max_j = best-matching doc token for this query token
# Σ_i = sum over all query tokens
      

Intuition

Think of it as a panel of specialist judges. Each query token is one judge asking: "Which document token is most relevant to me specifically?" Each judge picks their own best match (the max). Their individual verdicts are summed. No judge is overruled by the others — a rare technical term gets its own vote, not an averaged one.

Because BERT embeddings are contextualised, "bank" in a finance query finds the finance-sense "bank" in the document, not the river-bank token. Context collapses in pooling; it's preserved in late interaction.

Interactive MaxSim Heatmap

Each cell shows cosine similarity between a query token (row) and a document token (column). Step through to see how MaxSim selects the best match per row and accumulates the score.

MaxSim Accumulation

Total MaxSim Score

—

ColBERT Architecture Details

• Base model: BERT-base (12 layers, 110M params)
• Special tokens: [Q] prepended to queries, [D] to documents
• Query padding: Padded with [MASK] tokens to fixed length 32 — a form of soft query expansion
• Dimension reduction: Linear projection BERT's 768 dims → 128 dims, then L2-normalise
• Document filtering: Punctuation token embeddings are discarded before indexing
• Pre-computation: All document token embeddings computed offline; only query embeddings computed at search time

The ColBERT Family

Five systems that shaped how multi-vector retrieval is built and deployed — click to expand each

Storage vs. Quality Trade-offs

Multi-vector costs 4–10× more storage than single-vector — the calculator shows you exactly what that means for your deployment

Storage Calculator

Number of documents

100K

Avg tokens per doc

100

Embedding dimensions

128

MS-MARCO Benchmark — Quality vs. Storage

System	MRR@10	MS-MARCO Index	Latency (GPU)	Notes
BM25	18.7	~3.5 GiB	62 ms	Lexical only; no semantics
DPR (dense)	~31	~25 GiB	<1 ms	Single vector, 768 dims
BERT-large reranker	36.5	~25 GiB	32,900 ms	Expensive cross-encoder
ColBERT v1	36.0	154 GiB	458 ms	Matches BERT-large at 170× lower latency
ColBERT v2	39.7	16 GiB	~140 ms	10× storage reduction via residual compression
PLAID ColBERTv2	39.8	21.6 GiB	11–38 ms	3-stage filtering; 7× faster than vanilla v2

ColBERTv2 Compression

v1 stored each token vector as float32 (256 bytes). v2 uses residual compression: cluster all token vectors via k-means; store centroid ID + 1–2 bit quantised residual. Result: 20–36 bytes per vector vs 256 bytes — ~10× compression with near-zero quality loss (MRR 39.7 vs 39.7 vanilla).

MRL: Orthogonal Compression

Matryoshka Representation Learning trains vectors so their first-D dimensions are always a valid representation. OpenAI text-embedding-3-large (3072 dims) truncated to 256 dims still outperforms ada-002 (1536 dims). MRL reduces vector width; multi-vector increases vector count. Jina-ColBERT-v2 applies both simultaneously: 128→64 dims with only 1.5% quality loss.

Weaviate + Jina-ColBERT-v2

Production multi-vector retrieval with Weaviate v1.29 — named vectors, automatic embedding via JinaAI API, variable-length token matrices

1. Schema: Named Multi-Vector

# Weaviate v1.29+
import weaviate
from weaviate.classes.config import Configure

client = weaviate.connect_to_weaviate_cloud(...)

client.collections.create(
  name="Documents",
  vectorizer_config=[
    Configure.NamedVectors.text2vec_jinaai(
      name="multi_vector",
      source_properties=["text"],
      vector_index_config=Configure.MultiVectors.text2vec_jinaai(
        model="jina-colbert-v2"
      )
    )
  ]
)
      

2. Inspect Embedding Shape

# Single-vector → flat list
# Multi-vector → list of lists
v = response.objects[0].vector['multi_vector']

type(v)          # list
type(v[0])        # list  ← token vector
len(v)            # 22–30  ← varies by doc length
len(v[0])         # 128  ← fixed dim per token

# Shape: (N_tokens, 128)
# e.g. "Hello world" → (4, 128)
# "A long paragraph..." → (42, 128)
      

3. Semantic Search (near_text)

collection = client.collections.get("Documents")

results = collection.query.near_text(
  query="bank loan interest rate",
  target_vector="multi_vector",
  limit=5,
  return_metadata=MetadataQuery(distance=True)
)
for obj in results.objects:
  print(obj.properties["text"], obj.metadata.distance)
      

4. Self-Provided Embeddings

# Bring your own ColBERT vectors
Configure.MultiVectors.self_provided(
  name="multi_vector"
)

# Insert with custom embeddings
batch.add_object(
  properties={"text": doc["text"]},
  vector={
    "multi_vector": get_colbert_embedding(
      doc["text"]  # returns (N, 128) matrix
    )
  }
)
      

Jina-ColBERT-v2

128 dims per token • 8,192-token context • 89+ languages • MRL-compatible: can truncate to 64 dims (1.5% quality loss, 50% storage saving) • 52.1 avg nDCG@10 on 14 BEIR benchmarks

Storage Reality (Weaviate)

Single-vector 1536 dims: ~6 KB/object • Multi-vector 64 × 96 dims: ~25 KB/object • ~4× overhead • For 1M docs: ~6 GB → ~25 GB. PLAID + residual compression substantially narrows this gap.

Query Types

near_text — automatic embedding via JinaAI API • near_vector — provide your own embedding matrix • hybrid — combine BM25 + multi-vector MaxSim scoring

Decision Guide

Check every characteristic that describes your use case — get a system recommendation

Recommendation

Select your characteristics above

Check boxes that apply to your retrieval use case and a recommendation will appear here.

Practice Exercises

Three browser exercises + one live lab to build intuition for multi-vector retrieval

1 MaxSim Puzzle — Identify the Best Matches

Below is a similarity matrix for query "null pointer exception fix" against a code document. Each cell is the cosine similarity (0–100) between a query token and a document token. Click the cell in each row that you think is the MaxSim winner (highest in that row), then check your score.

2 System Matcher — Choose the Right Embedding Approach

For each retrieval scenario, select the best embedding approach. One correct answer per scenario.

3 Compression Trade-off Quiz

Test your understanding of the ColBERTv2 compression and MRL techniques.

★ Live Lab — MaxSim Token Walkthrough

Enter a query and a passage. gpt-4o-mini simulates ColBERT's MaxSim — it identifies which passage tokens are the best match for each query token, explains the context-sensitive matching, and shows how the score would accumulate. Educational (no actual embeddings computed). Optional OpenAI key (~$0.001/run).

Query

Passage / Document

Build the complete representation and retrieval picture

Post 08 — Embeddings & Similarity

Single-vector embeddings from first principles — word2vec, BERT pooling, cosine similarity, and why dot product search works. The foundation that Post 56 extends to multi-vector.

Post 21 — RAG

Retrieval-Augmented Generation: vector databases, chunking strategies, and how retrieved context improves generation quality. Multi-vector embeddings directly upgrade the retrieval step in any RAG pipeline.

Post 55 — Securing MCP

MCP agents with access to vector databases are a key attack surface — understanding multi-vector retrieval helps scope what data an agent can reach via semantic search and why DLP scanning of retrieved content matters.

← Previous Post

Post 55 — Securing MCP

Post 57 — Open Knowledge Format