Continual Learning — Teaching Models to Remember Without Forgetting

Intro › Forgetting › Scenarios › Methods › Nested › Sparse FT › Compare

Post 48 Training & Alignment Turing Post · Nov 2025 Google & Meta FAIR

Continual Learning
Teaching Models to Remember Without Forgetting

Humans learn their whole lives without forgetting how to walk when they learn to swim. Neural networks can't. When trained on new data, they catastrophically overwrite what they already knew. Continual learning is the discipline of fixing this — and two recent breakthroughs from Google (Nested Learning + HOPE) and Meta FAIR (Sparse Memory Finetuning) are bringing us closer than ever.

The Core Problem

A model trained sequentially on Task 2 after Task 1 overwrites the weights that encoded Task 1 knowledge — its performance on Task 1 collapses to near zero. This is catastrophic forgetting.

Why It Matters Now

LLMs have a hard boundary: knowledge frozen at pretraining cutoff. They can't update their weights at inference time. Continual learning would let deployed models absorb new facts without full retraining.

Catastrophic Forgetting

Why neural networks can't learn sequentially — and the interactive proof

Identified as early as 1989–1990 by McCloskey, Cohen, and Ratcliff, catastrophic forgetting is not a bug — it's a structural consequence of how gradient descent works. When a model trains on Task 2, the weight updates are computed to minimize Task 2 loss only. These updates push weights away from the optimum for Task 1, destroying previously encoded knowledge.

The cruel irony: if you train on Task 1 and Task 2 interleaved (mixed together), forgetting doesn't happen. The problem is purely the sequential training procedure, not the model's capacity.

▶ Catastrophic Forgetting Simulator

Watch Task 1 performance collapse as the model trains on Task 2 sequentially vs. interleaved.

Mode:

Task 1 performance

Task 2 performance

Task boundary

Key insight: The problem is not capacity — a model with enough parameters could hold both tasks. The problem is the optimizer doesn't know it needs to preserve Task 1 knowledge while learning Task 2. Standard SGD treats all weights as fair game.

Effective continual learning requires more than just preventing forgetting. A complete system also needs: fast adaptation to new tasks, ability to leverage task similarities (forward and backward knowledge transfer), task-agnostic behavior where possible, and efficiency in memory and compute.

→ Forward Transfer

Task 1 helps Task 2 — learning to recognize cats makes it easier to recognize dogs. A good CL system should preserve and exploit these relationships.

← Backward Transfer

Task 2 actually improves Task 1 performance. Rare and harder to achieve in neural networks — but the gold standard for true continual learning.

The Plasticity–Stability Tradeoff

Every continual learning system must find the right balance

At the heart of continual learning is a fundamental tension: a model that learns too easily (high plasticity) will overwrite old knowledge. A model that preserves old knowledge too strongly (high stability) won't be able to absorb new information. The goal is to find the sweet spot.

▶ Plasticity–Stability Explorer

Stable 🔒 💡 Plastic

Old Task Retention

50%

How well the model remembers previously learned knowledge

New Task Learning

50%

How quickly the model adapts to new tasks and data

The human solution: The brain maintains this balance through synaptic consolidation — important connections are strengthened and made resistant to change, while new connections form freely. Memory layers and parameter regularization are machine learning's attempt to replicate this.

3 Continual Learning Scenarios

How difficult is the problem? It depends on what the model knows at test time.

Researchers characterize CL problems by three scenarios, defined by what task information is available at test time. They range from easy (full task info given) to hard (no task info, all classes ever seen).

Easiest

Task-IL

Task identity given at test time

Medium

Domain-IL

Same task, distribution shifts

Hardest

Class-IL

All classes, no task identity

Important: These three scenarios are independent of whether the learning setup is task-based (explicit task boundaries) or task-free (smooth distribution shifts). Each scenario can occur in either setup.

6 General Methods

The toolkit for fighting catastrophic forgetting — click each to explore

Nested Learning — Google Research

What if every part of a neural network is already a memory system? Let different parts update at different speeds.

The brain continuously learns because different regions update at different speeds: the hippocampus encodes new episodic memories quickly, while the cortex consolidates knowledge slowly over time. Google Research asked: does this already exist inside Transformers?

The answer is yes. Every component of a neural network can be reframed as an associative memory — a system that maps keys to values and updates itself to improve this mapping:

Attention Layer

Maps query tokens (keys) to contextual representations (values). Updates every forward pass — instantaneous, infinite speed.

Optimizer State

Momentum, Adam's first/second moment — stores recent gradient history. Updates every batch. Medium speed.

Model Weights (MLP)

Long-term parametric memory. Updates slowly over training. Frozen after pretraining in standard Transformers.

▶ Neural Learning Module — Memory Hierarchy

A Transformer is already a nested learning system. NL makes this explicit and adds more levels.

Level 1

Attention Memory

Recomputes token-to-token relationships every forward pass. No persistent update — fully dynamic. Captures the immediate context window.

Every token

Level 2

Optimizer State (Momentum / Adam)

Stores running statistics of recent gradients. Acts as short-term memory for the learning process itself — which direction weights have been moving.

Every batch

Level 3

Model Weights (MLP / FFN)

Long-term parametric memory. Encodes world knowledge accumulated over pretraining on trillions of tokens. Updates slowly — once per gradient step.

Every few steps

The Transformer gap: Standard Transformers only have Levels 1 and 3 — attention (fast) and MLP (slow, frozen after pretraining). There is no mechanism to update long-term weights continuously after deployment. This is the gap Nested Learning addresses.

Nested Learning generalizes this into a structured hierarchy called the Continuum Memory System (CMS), where instead of one MLP (a single slow memory), there are multiple MLPs, each updating at a different frequency:

▶ Continuum Memory System (CMS)

Each memory block watches its own recent chunk of data. Faster levels capture short-term details; slower levels compress longer trends into stable, abstract knowledge.

HOPE — Google's Continual Learning Architecture

Hierarchical, Optimizing, Persistent, Evolving — where all Nested Learning ideas converge

HOPE is the architecture that instantiates Nested Learning in practice. It combines three components into a unified self-referential learning system:

🧠

1. Titans-Style Self-Modifying Sequence Models

Combines short-term attention with a neural long-term memory module that can learn and store information even at test time (not just during training). This closes the gap left open by standard Transformers.

🔧

2. Nested Learning View of Optimization

The optimizer is treated as an explicit memory module. HOPE generalizes momentum/Adam into a structured hierarchy where each level has its own update rule and operates at its own speed.

🌍

3. Continuum Memory System (CMS)

Multiple MLPs updating at different frequencies — creating a continuum from fast ephemeral memory to slow stable knowledge. Replaces the single static FFN of a standard Transformer.

▶ HOPE vs Transformer — Memory System Comparison

Component	Transformer	HOPE
Fast memory	Attention — per-token, dynamic	Attention — same, plus Titans self-modifying memory
Medium memory	None	CMS mid-freq MLPs — update every 10–100 tokens
Slow memory	MLP / FFN — static after pretraining	CMS slow MLP — updates continuously, never freezes
Update rules	Single global optimizer	Each level has its own update rule — nested optimization
Catastrophic forgetting	Severe (sequential tasks)	Greatly reduced (sparse, multi-scale updates)
Perplexity (1.3B / 100B tokens)	18.53	15.11 (lower = better)

Sparse Memory Finetuning — Meta FAIR

A plug-in solution for current Transformers: update only the memory slots that matter

While HOPE requires building a new architecture, Meta FAIR's approach works with existing models. The key idea: replace one of the Transformer's FFN layers with a memory layer — a sparse attention lookup into millions of learned key-value slots.

During finetuning on new knowledge, only a tiny fraction of those slots are updated — the ones specifically relevant to the new information. Everything else stays frozen. This prevents forgetting almost entirely.

▶ Memory Layer — How It Works

A memory layer replaces an FFN with a sparse lookup into a giant table of key-value memory slots. Only the top-32 matching slots are used on any forward pass.

Standard FFN

Input → Dense weight matrix
→ All parameters activated
→ Single output
⚠ Training updates affect all weights — high forgetting risk

Memory Layer

Input → Query vector
→ Find top-32 matching slots
→ Combine + gate output
✔ Only 0.03%–0.0002% of parameters touched per pass

Memory Slot Access Pattern (1M slots)

■ Active (top-32) ■ Inactive ■ TF-IDF selected for finetuning

The brilliant part of Sparse Memory Finetuning is how it identifies which slots to update. It borrows TF-IDF from classical information retrieval:

▶ TF-IDF Slot Selection

Term Frequency (TF)

How often is this memory slot accessed during the new training batch? Slots accessed frequently for the new data are likely relevant to it.

Inverse Document Freq (IDF)

How rarely was this slot accessed during pretraining? Rare slots are more specific — likely storing unique rather than common knowledge.

High TF-IDF slot = frequently used for the new fact AND rarely used in general → highly specific to the new knowledge → safe to finetune without harming existing knowledge.

Only the small set of high-TF-IDF slots is updated. Everything else (99.97%+ of parameters) stays frozen.

Learning vs. Forgetting Frontier

How do full fine-tuning, LoRA, and memory layers compare? Teaching models new TriviaQA facts.

Meta FAIR tested three approaches on the task of injecting new facts (TriviaQA) while preserving existing knowledge (NaturalQuestions). The results reveal a stark tradeoff between learning and forgetting:

▶ Forgetting Comparison — NaturalQuestions Performance Drop

* Drop in NaturalQuestions performance after finetuning on TriviaQA. Lower forgetting is better.

Method	NQ Drop (%)	TriviaQA Gain	Verdict
Full Fine-Tuning	−89%	High ↑	Learns fast, forgets everything
LoRA	−71%	Medium ↑	Better control, still degrades
Sparse Memory FT	−11%	Good ↑	Best balance — targeted updates

The core advantage: Because memory layers are sparse, finetuning touches only ~0.03% of parameters. These targeted updates don't interfere with the dense knowledge encoded in the rest of the model — it's surgical rather than broad-spectrum.

Here's how the two approaches compare at a high level:

Dimension	Google HOPE / Nested Learning	Meta Sparse Memory FT
Approach	New architecture with multi-speed memory hierarchy	Plug-in memory layer for existing Transformers
Deployment	Requires training from scratch	Add memory layer to existing pretrained model
Update mechanism	Multiple MLPs at different frequencies	Sparse attention into millions of key-value slots
Slot selection	Frequency-based (update schedule)	TF-IDF relative to pretraining
Best for	New models; continual deployment from day one	Updating existing deployed models with new facts
Scalability	Unclear at 100B–1T params (early results)	Requires memory layer during pretraining

Build a complete picture of how models learn and adapt

Post 44 · Training & Alignment

LLM Training Pipeline — Pretraining → RLHF

Continual learning picks up exactly where pretraining ends — this post covers what freezes at that cutoff and why.

Key Takeaways

01Catastrophic forgetting is caused by the sequential training procedure — not capacity. Interleaved training eliminates it but is impractical at scale.

02The plasticity-stability tradeoff is fundamental: a model that learns fast forgets fast. Every CL method is a different way of managing this tradeoff.

03Google's Nested Learning reframes neural networks as hierarchies of associative memories, each operating at its own speed. HOPE instantiates this with multi-frequency MLPs.

04Meta's Sparse Memory Finetuning is immediately practical: TF-IDF selects the tiny subset of memory slots relevant to new knowledge, leaving everything else frozen. Only 11% performance drop vs 89% for full finetuning.

05Continual learning is the missing ingredient in deployed LLMs — the bridge between static pretraining knowledge and an ever-changing world.

← Previous Post

Post 47 — GCG Attack

Post 49 — Natural Language Autoencoders