Face verification asks: are these two images of the same person? A softmax classifier cannot answer this — it only labels known classes. Metric learning solves a fundamentally different problem: learn a distance function where same-class inputs are close and different-class inputs are far.
Open
Set (unknown identities at test time)
1-shot
Works with 1 example per identity
∞
Identities supported at inference
2005
CVPR publication year
Classification (Closed-Set)
A softmax classifier maps inputs to a fixed set of N class labels. At test time you can only recognise those N classes. Adding a new person requires retraining the entire model.
Metric Learning (Open-Set)
A metric learning model maps inputs to a continuous embedding space. Two images are compared by their Euclidean distance. New identities are supported instantly — no retraining needed.
Why This Matters
Real-world verification systems must handle identities never seen during training. A border control system sees millions of new passports. Metric learning is the only approach that scales to this naturally.
A Siamese network consists of two identical subnetworks GW processing two inputs simultaneously. The critical constraint: both subnetworks share the exact same weights W. This ensures both inputs are mapped into the same embedding space with the same transformation.
Press Animate to run a forward pass through both networks
Weight Sharing
Shared weights guarantee both inputs are processed by the exact same function. Without this, G𝐔(X₁) and G𝐔(X₂) would live in incomparable spaces — the distance would be meaningless.
Energy-Based Model
The distance D𝐔 is an energy function. Genuine pairs are low-energy (close), impostor pairs are high-energy (far). Training minimises energy for genuine pairs and maximises it for impostors.
Gradient Flow
Gradients flow from the contrastive loss through D𝐔, into both branches simultaneously. Both networks update their shared weights in lockstep — each training step uses both inputs of the pair.
The loss has two terms that oppose each other. The genuine term pulls same-class pairs together. The impostor term pushes different-class pairs apart — but only until they reach the margin m. Beyond the margin, no gradient flows: the pair is considered “good enough.”
L(W, Y, X₁, X₂) = (1−Y) · ½ · D𝐔² + Y · ½ · max(0, m − D𝐔)²
Y = 0 (genuine pair, same class): L = ½ · D𝐔²
Y = 1 (impostor pair, diff. class): L = ½ · max(0, m − D𝐔)²
D𝐔 = ‖G𝐔(X₁) − G𝐔(X₂)‖₂ m = margin hyperparameter
Distance D𝐔0.50Y:
Genuine Pair Loss
For Y=0 (same class), the loss is simply ½·D𝐔² — a quadratic bowl with minimum at D𝐔=0. Every genuine pair always contributes to the gradient, pulling the embeddings closer together.
Genuine Gradient: ∂L/∂D𝐔 = D𝐔
The gradient is proportional to the current distance. When the pair is far apart, the gradient is large and pulls aggressively. When nearly converged (D𝐔≈0), the gradient vanishes naturally — no gradient clipping needed.
Impostor Gradient: ∂L/∂D𝐔 = −(m−D𝐔) if D𝐔<m, else 0
The gradient is negative (pushing apart) when D𝐔<m. When D𝐔≥m the pair already satisfies the constraint — zero gradient, zero cost. This is the “hinge” that makes the loss efficient.
The margin m defines the boundary between “close enough to be impostor-safe” and “still needs pushing.” Too small and impostors cluster at the boundary. Too large and training is slow with many wasted gradient steps on already-separated pairs.
Key insight: The margin creates a “dead zone” for impostor pairs with D𝐔 > m — they contribute zero gradient. This is computationally efficient and prevents over-separation which can distort the embedding geometry.
Margin m1.00Drag to see how m shapes the impostor loss curve
Too Small m
Impostors only need to be slightly separated. The model may learn embeddings where different classes cluster near each other just beyond m. Verification accuracy suffers because the decision boundary is too tight.
Well-Calibrated m
Impostors are pushed far enough apart that a verification threshold can cleanly separate them from genuine pairs. Typical values: m=1.0 to m=2.0 for unit-normalised embeddings. Set by validation.
Too Large m
The dead zone shrinks — almost all impostor pairs contribute gradient even when well-separated. Training slows and the loss landscape becomes harder to optimise. Risk of over-pushing geometrically close classes.
Training iterates over pairs (X₁, X₂, Y). For each pair, both inputs pass through G𝐔, producing embeddings. D𝐔 is computed. Loss and gradient are computed. Weights are updated via SGD. Watch the loss curve and pair processing animate below.
Speed:Ready — press Start Training
Pair Construction
For each batch: sample genuine pairs (different images, same identity) and impostor pairs (images from different identities). Balanced sampling is critical — too many easy impostors (D𝐔>m) slows training significantly.
Gradient Update
∂L/∂W = ∂L/∂D𝐔 · ∂D𝐔/∂G𝐔 · ∂G𝐔/∂W. The loss gradient flows back through the distance function and into both branches. Both branches update the same weight matrix W simultaneously.
Convergence
Training typically converges in 10–50 epochs. The loss decreases as genuine pairs cluster and impostors spread. The “dead zone” effect means that once a pair is correctly placed, it stops consuming gradient budget.
The entire goal of contrastive training is to reshape the embedding space. Before training: points are scattered randomly. After training: same-class points cluster tightly, different-class clusters are well-separated. Animate the transformation to see this unfold.
5 classes · 5 points per class · 2D projection of embedding space
Intraclass Distance
After training, all points from the same identity cluster near their class centroid. The intraclass distance approaches 0 as training progresses. The genuine loss term directly minimises this.
Interclass Separation
Different-class clusters are separated by at least the margin m. The hinge structure means once D𝐔≥m the gradient stops — clusters don’t drift unnecessarily further. The geometry is tight and interpretable.
Generalisation
Critically, the embedding function G𝐔 generalises: new identities never seen during training are automatically placed correctly relative to existing clusters. The learned metric is genuinely universal.
Given two face images, the system decides: same person or different? Two images pass through G𝐔 (shared CNN), producing embedding vectors. Their Euclidean distance D𝐔 is compared to a threshold τ. Below the threshold: same person. Above: different person.
Distance D𝐔0.40
Threshold τ0.80
✓ SAME PERSON
D𝐔 = 0.40 < τ = 0.80 → accept
How is the threshold τ set? ▼
The threshold is selected on a held-out validation set to minimise the Equal Error Rate (EER) — the point where the False Accept Rate equals the False Reject Rate. In production, you adjust the threshold to trade off FAR vs FRR depending on security requirements.
What is FAR vs FRR? ▼
False Accept Rate (FAR): an impostor pair is accepted (D𝐔 < τ when it shouldn’t be). False Reject Rate (FRR): a genuine pair is rejected (D𝐔 ≥ τ when it shouldn’t be). Lower threshold → lower FAR, higher FRR. Higher threshold → higher FAR, lower FRR.
What architecture does G𝐔 use? ▼
In the original 2005 paper, G𝐔 is a convolutional network applied to 96×96 greyscale face images. The output is an n-dimensional embedding. Modern implementations use deep ResNets or Vision Transformers, with n=128 or 512 dimensions, and L2-normalised outputs.
The paper benchmarks on two standard face databases of the era. On AT&T (40 subjects, 10 images each), the contrastive approach achieves near-perfect separation. On FERET (1,196 subjects, larger variation), it outperforms all PCA and kernel-based baselines.
40
AT&T subjects
1,196
FERET subjects
98.7%
AT&T verification accuracy
↓6%
Error reduction vs PCA baseline
AT&T Face Database
400 images of 40 subjects (10 per person). Variation in lighting, expression, and accessories. Standard 5-fold cross-validation. Contrastive loss achieves 98.7% — significantly better than eigenface baselines.
FERET Database
1,196 subjects with multiple captures (different dates, cameras, conditions). More realistic than AT&T. The contrastive method shows particularly large gains on this harder dataset where generalisation matters most.
vs Baselines
Compared against PCA (eigenfaces), kernel-PCA, and Fisher’s LDA. The Siamese contrastive approach outperforms all on both datasets. The gap widens for identities with larger intra-class variation.
Contrastive loss is the foundation of modern metric learning, self-supervised learning, and multimodal AI. Every major similarity-learning system from FaceNet to CLIP traces its core idea back to this 2005 paper.
Triplet Loss (FaceNet, 2015)
Uses three examples: anchor A, positive P (same class), negative N (different class). Minimises d(A,P) − d(A,N) + margin. More efficient than pairs — each training example contributes to two distances.
NT-Xent / SimCLR (2020)
Extends to N−1 negatives per anchor (all other samples in the batch). No explicit margin — uses temperature-scaled softmax. Powers all modern self-supervised vision models (MoCo, BYOL, MAE).
CLIP (OpenAI, 2021)
Cross-modal contrastive learning: image–text pairs. 400M pairs from the internet. The same fundamental idea — pull matching pairs together, push non-matching pairs apart — but across modalities at massive scale.