🧠
Continual Learning
Teaching Models to Remember Without Forgetting · Visual Summary
Incorrect password. Try again.
Intro Forgetting Scenarios Methods Nested Sparse FT Compare
Post 48 Training & Alignment Turing Post · Nov 2025 Google & Meta FAIR
Continual Learning
Teaching Models to Remember Without Forgetting
Humans learn their whole lives without forgetting how to walk when they learn to swim. Neural networks can't. When trained on new data, they catastrophically overwrite what they already knew. Continual learning is the discipline of fixing this — and two recent breakthroughs from Google (Nested Learning + HOPE) and Meta FAIR (Sparse Memory Finetuning) are bringing us closer than ever.
The Core Problem
A model trained sequentially on Task 2 after Task 1 overwrites the weights that encoded Task 1 knowledge — its performance on Task 1 collapses to near zero. This is catastrophic forgetting.
Why It Matters Now
LLMs have a hard boundary: knowledge frozen at pretraining cutoff. They can't update their weights at inference time. Continual learning would let deployed models absorb new facts without full retraining.
Catastrophic Forgetting
Why neural networks can't learn sequentially — and the interactive proof

Identified as early as 1989–1990 by McCloskey, Cohen, and Ratcliff, catastrophic forgetting is not a bug — it's a structural consequence of how gradient descent works. When a model trains on Task 2, the weight updates are computed to minimize Task 2 loss only. These updates push weights away from the optimum for Task 1, destroying previously encoded knowledge.

The cruel irony: if you train on Task 1 and Task 2 interleaved (mixed together), forgetting doesn't happen. The problem is purely the sequential training procedure, not the model's capacity.

▶ Catastrophic Forgetting Simulator
Watch Task 1 performance collapse as the model trains on Task 2 sequentially vs. interleaved.
Mode:
Task 1 performance
Task 2 performance
Task boundary
Key insight: The problem is not capacity — a model with enough parameters could hold both tasks. The problem is the optimizer doesn't know it needs to preserve Task 1 knowledge while learning Task 2. Standard SGD treats all weights as fair game.

Effective continual learning requires more than just preventing forgetting. A complete system also needs: fast adaptation to new tasks, ability to leverage task similarities (forward and backward knowledge transfer), task-agnostic behavior where possible, and efficiency in memory and compute.

→ Forward Transfer
Task 1 helps Task 2 — learning to recognize cats makes it easier to recognize dogs. A good CL system should preserve and exploit these relationships.
← Backward Transfer
Task 2 actually improves Task 1 performance. Rare and harder to achieve in neural networks — but the gold standard for true continual learning.
The Plasticity–Stability Tradeoff
Every continual learning system must find the right balance

At the heart of continual learning is a fundamental tension: a model that learns too easily (high plasticity) will overwrite old knowledge. A model that preserves old knowledge too strongly (high stability) won't be able to absorb new information. The goal is to find the sweet spot.

▶ Plasticity–Stability Explorer
Stable 🔒 💡 Plastic
Old Task Retention
50%
How well the model remembers previously learned knowledge
New Task Learning
50%
How quickly the model adapts to new tasks and data
The human solution: The brain maintains this balance through synaptic consolidation — important connections are strengthened and made resistant to change, while new connections form freely. Memory layers and parameter regularization are machine learning's attempt to replicate this.
3 Continual Learning Scenarios
How difficult is the problem? It depends on what the model knows at test time.

Researchers characterize CL problems by three scenarios, defined by what task information is available at test time. They range from easy (full task info given) to hard (no task info, all classes ever seen).

Easiest
Task-IL
Task identity given at test time
Medium
Domain-IL
Same task, distribution shifts
Hardest
Class-IL
All classes, no task identity
Important: These three scenarios are independent of whether the learning setup is task-based (explicit task boundaries) or task-free (smooth distribution shifts). Each scenario can occur in either setup.
6 General Methods
The toolkit for fighting catastrophic forgetting — click each to explore
Nested Learning — Google Research
What if every part of a neural network is already a memory system? Let different parts update at different speeds.

The brain continuously learns because different regions update at different speeds: the hippocampus encodes new episodic memories quickly, while the cortex consolidates knowledge slowly over time. Google Research asked: does this already exist inside Transformers?

The answer is yes. Every component of a neural network can be reframed as an associative memory — a system that maps keys to values and updates itself to improve this mapping:

Attention Layer
Maps query tokens (keys) to contextual representations (values). Updates every forward pass — instantaneous, infinite speed.
Optimizer State
Momentum, Adam's first/second moment — stores recent gradient history. Updates every batch. Medium speed.
Model Weights (MLP)
Long-term parametric memory. Updates slowly over training. Frozen after pretraining in standard Transformers.
▶ Neural Learning Module — Memory Hierarchy
A Transformer is already a nested learning system. NL makes this explicit and adds more levels.
Level 1
Attention Memory
Recomputes token-to-token relationships every forward pass. No persistent update — fully dynamic. Captures the immediate context window.
Every token
Level 2
Optimizer State (Momentum / Adam)
Stores running statistics of recent gradients. Acts as short-term memory for the learning process itself — which direction weights have been moving.
Every batch
Level 3
Model Weights (MLP / FFN)
Long-term parametric memory. Encodes world knowledge accumulated over pretraining on trillions of tokens. Updates slowly — once per gradient step.
Every few steps
The Transformer gap: Standard Transformers only have Levels 1 and 3 — attention (fast) and MLP (slow, frozen after pretraining). There is no mechanism to update long-term weights continuously after deployment. This is the gap Nested Learning addresses.

Nested Learning generalizes this into a structured hierarchy called the Continuum Memory System (CMS), where instead of one MLP (a single slow memory), there are multiple MLPs, each updating at a different frequency:

▶ Continuum Memory System (CMS)
Each memory block watches its own recent chunk of data. Faster levels capture short-term details; slower levels compress longer trends into stable, abstract knowledge.
HOPE — Google's Continual Learning Architecture
Hierarchical, Optimizing, Persistent, Evolving — where all Nested Learning ideas converge

HOPE is the architecture that instantiates Nested Learning in practice. It combines three components into a unified self-referential learning system:

🧠
1. Titans-Style Self-Modifying Sequence Models
Combines short-term attention with a neural long-term memory module that can learn and store information even at test time (not just during training). This closes the gap left open by standard Transformers.
🔧
2. Nested Learning View of Optimization
The optimizer is treated as an explicit memory module. HOPE generalizes momentum/Adam into a structured hierarchy where each level has its own update rule and operates at its own speed.
🌍
3. Continuum Memory System (CMS)
Multiple MLPs updating at different frequencies — creating a continuum from fast ephemeral memory to slow stable knowledge. Replaces the single static FFN of a standard Transformer.
▶ HOPE vs Transformer — Memory System Comparison
ComponentTransformerHOPE
Fast memory Attention — per-token, dynamic Attention — same, plus Titans self-modifying memory
Medium memory None CMS mid-freq MLPs — update every 10–100 tokens
Slow memory MLP / FFN — static after pretraining CMS slow MLP — updates continuously, never freezes
Update rules Single global optimizer Each level has its own update rule — nested optimization
Catastrophic forgetting Severe (sequential tasks) Greatly reduced (sparse, multi-scale updates)
Perplexity (1.3B / 100B tokens) 18.53 15.11 (lower = better)
Sparse Memory Finetuning — Meta FAIR
A plug-in solution for current Transformers: update only the memory slots that matter

While HOPE requires building a new architecture, Meta FAIR's approach works with existing models. The key idea: replace one of the Transformer's FFN layers with a memory layer — a sparse attention lookup into millions of learned key-value slots.

During finetuning on new knowledge, only a tiny fraction of those slots are updated — the ones specifically relevant to the new information. Everything else stays frozen. This prevents forgetting almost entirely.

▶ Memory Layer — How It Works
A memory layer replaces an FFN with a sparse lookup into a giant table of key-value memory slots. Only the top-32 matching slots are used on any forward pass.
Standard FFN
Input → Dense weight matrix
→ All parameters activated
→ Single output
⚠ Training updates affect all weights — high forgetting risk
Memory Layer
Input → Query vector
→ Find top-32 matching slots
→ Combine + gate output
✔ Only 0.03%–0.0002% of parameters touched per pass
Memory Slot Access Pattern (1M slots)
Active (top-32)   Inactive   TF-IDF selected for finetuning

The brilliant part of Sparse Memory Finetuning is how it identifies which slots to update. It borrows TF-IDF from classical information retrieval:

▶ TF-IDF Slot Selection
Term Frequency (TF)
How often is this memory slot accessed during the new training batch? Slots accessed frequently for the new data are likely relevant to it.
Inverse Document Freq (IDF)
How rarely was this slot accessed during pretraining? Rare slots are more specific — likely storing unique rather than common knowledge.
High TF-IDF slot = frequently used for the new fact AND rarely used in general → highly specific to the new knowledge → safe to finetune without harming existing knowledge.
Only the small set of high-TF-IDF slots is updated. Everything else (99.97%+ of parameters) stays frozen.
Learning vs. Forgetting Frontier
How do full fine-tuning, LoRA, and memory layers compare? Teaching models new TriviaQA facts.

Meta FAIR tested three approaches on the task of injecting new facts (TriviaQA) while preserving existing knowledge (NaturalQuestions). The results reveal a stark tradeoff between learning and forgetting:

▶ Forgetting Comparison — NaturalQuestions Performance Drop
* Drop in NaturalQuestions performance after finetuning on TriviaQA. Lower forgetting is better.
MethodNQ Drop (%)TriviaQA GainVerdict
Full Fine-Tuning −89% High ↑ Learns fast, forgets everything
LoRA −71% Medium ↑ Better control, still degrades
Sparse Memory FT −11% Good ↑ Best balance — targeted updates
The core advantage: Because memory layers are sparse, finetuning touches only ~0.03% of parameters. These targeted updates don't interfere with the dense knowledge encoded in the rest of the model — it's surgical rather than broad-spectrum.

Here's how the two approaches compare at a high level:

DimensionGoogle HOPE / Nested LearningMeta Sparse Memory FT
ApproachNew architecture with multi-speed memory hierarchyPlug-in memory layer for existing Transformers
DeploymentRequires training from scratchAdd memory layer to existing pretrained model
Update mechanismMultiple MLPs at different frequenciesSparse attention into millions of key-value slots
Slot selectionFrequency-based (update schedule)TF-IDF relative to pretraining
Best forNew models; continual deployment from day oneUpdating existing deployed models with new facts
ScalabilityUnclear at 100B–1T params (early results)Requires memory layer during pretraining
Related Posts
Build a complete picture of how models learn and adapt
Key Takeaways
01Catastrophic forgetting is caused by the sequential training procedure — not capacity. Interleaved training eliminates it but is impractical at scale.
02The plasticity-stability tradeoff is fundamental: a model that learns fast forgets fast. Every CL method is a different way of managing this tradeoff.
03Google's Nested Learning reframes neural networks as hierarchies of associative memories, each operating at its own speed. HOPE instantiates this with multi-frequency MLPs.
04Meta's Sparse Memory Finetuning is immediately practical: TF-IDF selects the tiny subset of memory slots relevant to new knowledge, leaving everything else frozen. Only 11% performance drop vs 89% for full finetuning.
05Continual learning is the missing ingredient in deployed LLMs — the bridge between static pretraining knowledge and an ever-changing world.
← Previous Post
Post 47 — GCG Attack
Next Post →
Post 49 — Natural Language Autoencoders