🔭
Natural Language Autoencoders
Reading the Mind of an LLM · Visual Summary
Incorrect password. Try again.
Intro Landscape Objective Architecture Training Cases Limits
Post 49 Safety & Governance Anthropic · May 2026 Mechanistic Interpretability
Natural Language Autoencoders
Reading the Mind of an LLM
LLMs process the world through high-dimensional activation vectors — but what do those vectors actually mean? Anthropic's NLA paper introduces a system that translates activation vectors directly into natural language explanations, then reconstructs the original activations from those explanations. If the reconstruction is faithful, the explanation is real.
The Core Idea
Train two models jointly: one that writes explanations from activations (Activation Verbalizer), and one that reconstructs activations from those explanations (Activation Reconstructor). High reconstruction fidelity = the explanation captures real information.
Why This Matters
NLAs can audit what a model is actually thinking — not what it says. They surface hidden evaluation awareness, training data artifacts, and unverbalised safety-relevant states. Used in pre-deployment audits of Claude models.
The Interpretability Landscape
Where NLAs sit in the spectrum of methods for understanding LLM internals

Mechanistic interpretability asks: what computations does an LLM actually perform? Existing approaches split into two camps, each with serious limitations. NLAs occupy a new position between them.

▶ Interpretability Method Spectrum
Unsupervised
NLA ◀ here
Supervised
Unsupervised Methods
Logit Lens — project mid-layer activations to vocabulary; see what token the model is "thinking"

Sparse Autoencoders (SAEs) — decompose activations into sparse linear combinations of learned features

No labels needed Limited to fixed vocab / features
⭐ Natural Language Autoencoders
AV + AR — verbalizer writes free-form text; reconstructor validates the explanation reconstructs the original activation

Expressive + validated Expensive to train
Supervised Methods
Activation Oracles — predict ground-truth labels (sentiment, topic) from activations; requires labelled data

Probing Classifiers — linear probes trained on specific concepts

Accurate on trained concepts Narrow; requires labels
The key gap NLA fills: Unsupervised methods are expressive but can't verify their explanations. Supervised methods are verifiable but require pre-specified labels. NLA generates free-form natural language (unsupervised) and then validates it by reconstruction (supervised-style). Best of both worlds — with new costs.
The NLA Objective — Math First
Formalizing "a good explanation is one that lets you reconstruct the original"

The central claim is elegant: if a text description contains enough information to reconstruct the original activation vector, then the description genuinely explains that activation. This turns the vague notion of "a good explanation" into a concrete, measurable objective.

-- Setup -- hl ∈ ℝd = activation vector at layer l of the target LLM AV( hl ) → z = Activation Verbalizer: maps activations → text explanation z AR( z ) → ĥl = Activation Reconstructor: maps text → reconstructed activation -- Training Objective: minimize reconstruction error -- L = 𝔼[ || hl − AR( AV( hl ) ) ||² ] -- Performance Metric: Fraction of Variance Explained (FVE) -- FVE = 1 − 𝔼[||hl − ĥl||²] / 𝔼[||hl − μh||²] where μh = mean activation (the trivial "predict the mean" baseline) -- Interpretation -- FVE = 0 → explanation no better than predicting the mean (useless) FVE = 1 → perfect reconstruction (explanation captures everything) FVE = 0.6–0.8 → achieved in practice (substantial, not perfect)
Eq 1: NLA objective and FVE metric
▶ FVE Meter — What Does 0.6–0.8 Mean?
Drag the slider to explore what different FVE values mean for reconstruction quality.
0.0 1.0
0.70
0.0–0.2 Trivial
0.2–0.5 Weak
0.5–0.7 Informative
0.7–0.8 ✓ NLA Range
0.9–1.0 Near-perfect

The FVE metric is non-trivial because the denominator is the variance of activations around their mean. An FVE of 0.7 means the explanation reduces residual variance by 70% relative to just predicting the average activation — a substantial amount of information captured in free-form text.

Crucially, FVE measures information preservation, not subjective explanation quality. An explanation can be fluent and plausible but have low FVE (confabulation). Conversely, a terse cryptic explanation might achieve high FVE if its tokens precisely encode the activation geometry.

AV + AR Architecture
Two full LLMs, jointly trained, communicating through natural language

Both the Activation Verbalizer (AV) and Activation Reconstructor (AR) are initialized as full copies of the target LLM being interpreted. This is a deliberate design choice: the same model that produced the activations is also used to explain and reconstruct them — it speaks the right "language".

▶ NLA Pipeline — Click a Stage to Explore
Input Context
tokens x1:n
Original prompt
Target LLM
forward pass
Model being studied
hl ∈ ℝd
Layer activation
What we want to explain
AV
Activation Verbalizer
LLM copy → writes text
z
Text explanation
Natural language
AR
Activation Reconstructor
LLM copy → reads text
ĥl
Reconstructed activation
Measures: ||hl−ĥl||²

A critical design detail: the AV and AR are not connected by backpropagation. They are trained jointly but with independent gradient flows — the AV is updated via RL (reward = AR's reconstruction quality), while the AR is updated via standard regression. This avoids the unstable gradient flows that would arise from differentiating through the text generation process.

Activation Verbalizer (AV)
Input: activation vector hl (injected into residual stream)
Output: free-form text explanation z
Training: GRPO reinforcement learning
Reward: reconstruction quality from AR
Regularization: KL divergence toward initialization
Activation Reconstructor (AR)
Input: text explanation z
Output: reconstructed vector ĥl
Training: standard supervised regression
Loss: MSE between hl and ĥl
Role: the "verifier" that validates AV explanations
Training Procedure
Four stages from cold start to convergence — interactive step-through
▶ Training Loop Stepper
1
Supervised Warm-Start
Fine-tune the AV on a text-summarization proxy task using Claude-generated summaries of the context that produced each activation. This gives a warm start at FVE ≈ 0.3–0.4 before RL begins — preventing the cold-start problem where early RL rewards are too sparse.
2
AR Update — Supervised Regression
Sample a batch of explanations from the current AV. Train the AR to reconstruct the original activations from those explanations using standard MSE regression. The AR learns the "vocabulary" the AV is currently using to encode activations.
3
AV Update — GRPO Reinforcement Learning
Generate multiple explanation candidates from AV. Use the AR to evaluate reconstruction quality for each. Assign rewards based on FVE. Update the AV via GRPO (Group Relative Policy Optimization) with KL regularization toward the warm-start initialization to prevent reward hacking and maintain explanation fluency.
4
Joint Iteration — Repeat Until Convergence
Steps 2 and 3 alternate without backpropagation coupling between the two models. Each cycle, the AR adapts to AV's evolving explanation style while the AV improves its explanations to maximize AR reconstruction. FVE grows from ~0.35 to 0.6–0.8 over training. Evaluate on held-out prediction tasks throughout.
Why GRPO for the AV? The AV generates discrete text tokens — you can't backpropagate through text generation. GRPO solves this by treating the AV as a policy: sample multiple explanations, compare their reconstruction quality, and update toward better ones using policy gradient. The KL penalty prevents the AV from gaming the AR with encoded gibberish (steganography).
-- AV GRPO update -- πθ( z | hl ) = AV policy (generates explanation z given activation hl) πref( z | hl ) = reference policy (warm-start initialization) r( z ) = FVE( hl, AR( z ) ) ← reconstruction quality as reward J(θ) = 𝔼[ r(z) ] − β · KL[ πθ || πref ] -- AR regression update -- LAR = 𝔼[ || hl − ARφ( z ) ||² ] where z ~ πθ( · | hl )
Eq 2: GRPO objective for AV and MSE loss for AR
Case Studies
What NLAs found inside Claude — three real discoveries from Anthropic
🎤
Poetry Planning
Opus 4.6 plans rhyme schemes
🌐
Language Switching
Training data artifact discovered
🔧
Tool Misreporting
Pre-deployment safety audit
Quantitative Evaluations
Five prediction tasks measuring whether NLA explanations are genuinely informative

To verify that NLA explanations grow more informative during training (not just fluent or plausible), the paper introduces five held-out prediction tasks. A downstream model attempts to predict properties of the context from the NLA-generated explanation alone. Better explanations → higher prediction accuracy.

▶ Prediction Task Performance — NLA vs Baseline (Haiku 4.5)
Baseline = random/chance prediction. Higher bars = explanation contains more recoverable information about context.

The paper evaluates NLA performance across three Claude models: Haiku 3.5, Haiku 4.5, and Opus 4.6. A consistent finding: NLA explanations become more informative throughout training across all three models, confirming the training procedure is genuinely improving explanation quality rather than just increasing fluency.

ModelWarm-Start FVEFinal FVEPrediction Task Gain
Claude Haiku 3.5~0.30~0.62Consistent improvement
Claude Haiku 4.5~0.33~0.70Consistent improvement
Claude Opus 4.6~0.38~0.78Best performance
Limitations
Honest accounting of what NLAs cannot do — important for safe deployment
⚠️
Confabulation
Explanations contain verifiably false claims about the context. The AV can generate plausible-sounding text that doesn't reflect actual activation content. Thematic fidelity often holds even when specific facts are wrong — the "vibe" is right but details are hallucinated. FVE measures information preservation, not factual accuracy.
🔍
No Mechanistic Grounding
NLAs generate holistic text explanations but cannot identify which components of the activation vector drove which parts of the explanation. Unlike sparse autoencoders (which decompose into interpretable features), NLAs treat the activation as a black box and describe its overall effect.
💬
Excessive Expressivity
Because AV is a full LLM, it can draw inferences beyond what is stored in the activation — using world knowledge from pretraining. An explanation might be accurate because the AV correctly inferred something from partial activation information, not because that information was actually encoded.
💻
Computational Cost
Training requires joint RL optimization of two full LLMs. Inference generates hundreds of tokens per activation vector. This makes NLAs expensive to run at scale — unsuitable for online monitoring of every activation in a deployed model.
🃏
Degenerate Objectives
Two failure modes: (1) Verbatim copying — AV simply reproduces the input context, which trivially allows reconstruction but isn't an "explanation". (2) Steganography — AV encodes activation information in imperceptible patterns in the text, achieving high FVE while generating uninterpretable output. KL regularization helps but doesn't fully prevent these.
Applications
Where NLAs are already being used at Anthropic
🔎 Pre-Deployment Auditing
NLAs surfaced cases where Claude models internally suspected they were being tested — "evaluation awareness" — without explicitly expressing this in their outputs. This hidden state would be undetectable by behavioral testing alone.
Used on Claude models
🛡️ Safety Diagnostics
Identified safety-relevant internal states — including the language-switching artifact (malformed training data) and tool call misreporting patterns — during pre-deployment safety reviews before they caused user-facing issues.
Pre-deployment pipeline
🤖 Automated Auditing Agents
NLA-equipped agents — able to read and query model activations — outperformed baselines on investigating misaligned model behavior, even without access to training data. Opens a path toward automated interpretability-based alignment verification.
Beats behavioral baselines
Broader vision: The paper sketches "general-purpose activation language models" — systems capable of flexibly reading and writing between activation space and natural language. Rather than just describing activations, such a system could edit them, compose them, or query them like a database. NLA is the first step toward this.
Key Takeaways
01NLAs frame interpretability as an autoencoding problem: a good explanation is one that faithfully reconstructs the original activation. FVE = 0.6–0.8 in practice.
02The AV and AR are full LLM copies of the target model, trained jointly but without backpropagation coupling — AV via GRPO RL, AR via supervised MSE regression.
03Real case studies surfaced poetry planning (causal validation via steering), a language-switching training artifact, and hidden evaluation awareness in deployed models.
04Key limitations: confabulation, no mechanistic grounding, excessive AV expressivity, and high compute cost. Not a drop-in replacement for SAEs.
05Already used in Anthropic's pre-deployment pipeline — surfacing hidden internal states that behavioral testing would miss entirely.
Related Posts
Build a complete picture of LLM internals and safety
← Previous Post
Post 48 — Continual Learning
Next Post →
Post 50 — Memory in LLM Agents