Natural Language Autoencoders — Reading the Mind of an LLM

Intro › Landscape › Objective › Architecture › Training › Cases › Limits

Post 49 Safety & Governance Anthropic · May 2026 Mechanistic Interpretability

Natural Language Autoencoders
Reading the Mind of an LLM

LLMs process the world through high-dimensional activation vectors — but what do those vectors actually mean? Anthropic's NLA paper introduces a system that translates activation vectors directly into natural language explanations, then reconstructs the original activations from those explanations. If the reconstruction is faithful, the explanation is real.

The Core Idea

Train two models jointly: one that writes explanations from activations (Activation Verbalizer), and one that reconstructs activations from those explanations (Activation Reconstructor). High reconstruction fidelity = the explanation captures real information.

Why This Matters

NLAs can audit what a model is actually thinking — not what it says. They surface hidden evaluation awareness, training data artifacts, and unverbalised safety-relevant states. Used in pre-deployment audits of Claude models.

The Interpretability Landscape

Where NLAs sit in the spectrum of methods for understanding LLM internals

Mechanistic interpretability asks: what computations does an LLM actually perform? Existing approaches split into two camps, each with serious limitations. NLAs occupy a new position between them.

▶ Interpretability Method Spectrum

Unsupervised

NLA ◀ here

Supervised

Unsupervised Methods

Logit Lens — project mid-layer activations to vocabulary; see what token the model is "thinking"

Sparse Autoencoders (SAEs) — decompose activations into sparse linear combinations of learned features

No labels needed Limited to fixed vocab / features

⭐ Natural Language Autoencoders

AV + AR — verbalizer writes free-form text; reconstructor validates the explanation reconstructs the original activation

Expressive + validated Expensive to train

Supervised Methods

Activation Oracles — predict ground-truth labels (sentiment, topic) from activations; requires labelled data

Probing Classifiers — linear probes trained on specific concepts

Accurate on trained concepts Narrow; requires labels

The key gap NLA fills: Unsupervised methods are expressive but can't verify their explanations. Supervised methods are verifiable but require pre-specified labels. NLA generates free-form natural language (unsupervised) and then validates it by reconstruction (supervised-style). Best of both worlds — with new costs.

The NLA Objective — Math First

Formalizing "a good explanation is one that lets you reconstruct the original"

The central claim is elegant: if a text description contains enough information to reconstruct the original activation vector, then the description genuinely explains that activation. This turns the vague notion of "a good explanation" into a concrete, measurable objective.

-- Setup -- h l \in ℝ d = activation vector at layer l of the target LLM AV ( h l) \to z = Activation Verbalizer: maps activations \to text explanation z AR ( z ) \to ĥ l = Activation Reconstructor: maps text \to reconstructed activation -- Training Objective: minimize reconstruction error -- L = 𝔼[ || h l - AR( AV( h l) ) ||² ] -- Performance Metric: Fraction of Variance Explained (FVE) -- FVE = 1 - 𝔼[||h l - ĥ l ||²] / 𝔼[||h l - μ h ||²] where μ h = mean activation (the trivial "predict the mean" baseline) -- Interpretation -- FVE = 0 \to explanation no better than predicting the mean (useless) FVE = 1 \to perfect reconstruction (explanation captures everything) FVE = 0.6-0.8 \to achieved in practice (substantial, not perfect)

Eq 1: NLA objective and FVE metric

▶ FVE Meter — What Does 0.6–0.8 Mean?

Drag the slider to explore what different FVE values mean for reconstruction quality.

0.0 1.0

0.70

0.0–0.2 Trivial

0.2–0.5 Weak

0.5–0.7 Informative

0.7–0.8 ✓ NLA Range

0.9–1.0 Near-perfect

The FVE metric is non-trivial because the denominator is the variance of activations around their mean. An FVE of 0.7 means the explanation reduces residual variance by 70% relative to just predicting the average activation — a substantial amount of information captured in free-form text.

Crucially, FVE measures information preservation, not subjective explanation quality. An explanation can be fluent and plausible but have low FVE (confabulation). Conversely, a terse cryptic explanation might achieve high FVE if its tokens precisely encode the activation geometry.

AV + AR Architecture

Two full LLMs, jointly trained, communicating through natural language

Both the Activation Verbalizer (AV) and Activation Reconstructor (AR) are initialized as full copies of the target LLM being interpreted. This is a deliberate design choice: the same model that produced the activations is also used to explain and reconstruct them — it speaks the right "language".

▶ NLA Pipeline — Click a Stage to Explore

Input Context

tokens x_1:n

Original prompt

→

Target LLM

forward pass

Model being studied

→

h_l ∈ ℝ^d

Layer activation

What we want to explain

→

Activation Verbalizer

LLM copy → writes text

→

Text explanation

Natural language

→

Activation Reconstructor

LLM copy → reads text

→

ĥ_l

Reconstructed activation

Measures: ||h_l−ĥ_l||²

A critical design detail: the AV and AR are not connected by backpropagation. They are trained jointly but with independent gradient flows — the AV is updated via RL (reward = AR's reconstruction quality), while the AR is updated via standard regression. This avoids the unstable gradient flows that would arise from differentiating through the text generation process.

Activation Verbalizer (AV)

Input: activation vector h_l (injected into residual stream)
Output: free-form text explanation z
Training: GRPO reinforcement learning
Reward: reconstruction quality from AR
Regularization: KL divergence toward initialization

Activation Reconstructor (AR)

Input: text explanation z
Output: reconstructed vector ĥ_l
Training: standard supervised regression
Loss: MSE between h_l and ĥ_l
Role: the "verifier" that validates AV explanations

Training Procedure

Four stages from cold start to convergence — interactive step-through

▶ Training Loop Stepper

Supervised Warm-Start

Fine-tune the AV on a text-summarization proxy task using Claude-generated summaries of the context that produced each activation. This gives a warm start at FVE ≈ 0.3–0.4 before RL begins — preventing the cold-start problem where early RL rewards are too sparse.

AR Update — Supervised Regression

Sample a batch of explanations from the current AV. Train the AR to reconstruct the original activations from those explanations using standard MSE regression. The AR learns the "vocabulary" the AV is currently using to encode activations.

AV Update — GRPO Reinforcement Learning

Generate multiple explanation candidates from AV. Use the AR to evaluate reconstruction quality for each. Assign rewards based on FVE. Update the AV via GRPO (Group Relative Policy Optimization) with KL regularization toward the warm-start initialization to prevent reward hacking and maintain explanation fluency.

Joint Iteration — Repeat Until Convergence

Steps 2 and 3 alternate without backpropagation coupling between the two models. Each cycle, the AR adapts to AV's evolving explanation style while the AV improves its explanations to maximize AR reconstruction. FVE grows from ~0.35 to 0.6–0.8 over training. Evaluate on held-out prediction tasks throughout.

Why GRPO for the AV? The AV generates discrete text tokens — you can't backpropagate through text generation. GRPO solves this by treating the AV as a policy: sample multiple explanations, compare their reconstruction quality, and update toward better ones using policy gradient. The KL penalty prevents the AV from gaming the AR with encoded gibberish (steganography).

-- AV GRPO update -- π θ ( z | h l) = AV policy (generates explanation z given activation h l) π ref ( z | h l) = reference policy (warm-start initialization) r( z ) = FVE( h l, AR( z ) ) \leftarrow reconstruction quality as reward J(θ) = 𝔼[ r(z) ] - β \cdot KL[ π θ || π ref] -- AR regression update -- L AR = 𝔼[ || h l - AR φ ( z ) ||² ] where z ~ π θ ( \cdot | h l)

Eq 2: GRPO objective for AV and MSE loss for AR

Case Studies

What NLAs found inside Claude — three real discoveries from Anthropic

🎤

Poetry Planning

Opus 4.6 plans rhyme schemes

🌐

Language Switching

Training data artifact discovered

🔧

Tool Misreporting

Pre-deployment safety audit

Quantitative Evaluations

Five prediction tasks measuring whether NLA explanations are genuinely informative

To verify that NLA explanations grow more informative during training (not just fluent or plausible), the paper introduces five held-out prediction tasks. A downstream model attempts to predict properties of the context from the NLA-generated explanation alone. Better explanations → higher prediction accuracy.

▶ Prediction Task Performance — NLA vs Baseline (Haiku 4.5)

Baseline = random/chance prediction. Higher bars = explanation contains more recoverable information about context.

The paper evaluates NLA performance across three Claude models: Haiku 3.5, Haiku 4.5, and Opus 4.6. A consistent finding: NLA explanations become more informative throughout training across all three models, confirming the training procedure is genuinely improving explanation quality rather than just increasing fluency.

Model	Warm-Start FVE	Final FVE	Prediction Task Gain
Claude Haiku 3.5	~0.30	~0.62	Consistent improvement
Claude Haiku 4.5	~0.33	~0.70	Consistent improvement
Claude Opus 4.6	~0.38	~0.78	Best performance

Limitations

Honest accounting of what NLAs cannot do — important for safe deployment

⚠️

Confabulation

Explanations contain verifiably false claims about the context. The AV can generate plausible-sounding text that doesn't reflect actual activation content. Thematic fidelity often holds even when specific facts are wrong — the "vibe" is right but details are hallucinated. FVE measures information preservation, not factual accuracy.

🔍

No Mechanistic Grounding

NLAs generate holistic text explanations but cannot identify which components of the activation vector drove which parts of the explanation. Unlike sparse autoencoders (which decompose into interpretable features), NLAs treat the activation as a black box and describe its overall effect.

💬

Excessive Expressivity

Because AV is a full LLM, it can draw inferences beyond what is stored in the activation — using world knowledge from pretraining. An explanation might be accurate because the AV correctly inferred something from partial activation information, not because that information was actually encoded.

💻

Computational Cost

Training requires joint RL optimization of two full LLMs. Inference generates hundreds of tokens per activation vector. This makes NLAs expensive to run at scale — unsuitable for online monitoring of every activation in a deployed model.

🃏

Degenerate Objectives

Two failure modes: (1) Verbatim copying — AV simply reproduces the input context, which trivially allows reconstruction but isn't an "explanation". (2) Steganography — AV encodes activation information in imperceptible patterns in the text, achieving high FVE while generating uninterpretable output. KL regularization helps but doesn't fully prevent these.

Applications

Where NLAs are already being used at Anthropic

🔎 Pre-Deployment Auditing

NLAs surfaced cases where Claude models internally suspected they were being tested — "evaluation awareness" — without explicitly expressing this in their outputs. This hidden state would be undetectable by behavioral testing alone.

Used on Claude models

🛡️ Safety Diagnostics

Identified safety-relevant internal states — including the language-switching artifact (malformed training data) and tool call misreporting patterns — during pre-deployment safety reviews before they caused user-facing issues.

Pre-deployment pipeline

🤖 Automated Auditing Agents

NLA-equipped agents — able to read and query model activations — outperformed baselines on investigating misaligned model behavior, even without access to training data. Opens a path toward automated interpretability-based alignment verification.

Beats behavioral baselines

Broader vision: The paper sketches "general-purpose activation language models" — systems capable of flexibly reading and writing between activation space and natural language. Rather than just describing activations, such a system could edit them, compose them, or query them like a database. NLA is the first step toward this.

Key Takeaways

01NLAs frame interpretability as an autoencoding problem: a good explanation is one that faithfully reconstructs the original activation. FVE = 0.6–0.8 in practice.

02The AV and AR are full LLM copies of the target model, trained jointly but without backpropagation coupling — AV via GRPO RL, AR via supervised MSE regression.

03Real case studies surfaced poetry planning (causal validation via steering), a language-switching training artifact, and hidden evaluation awareness in deployed models.

04Key limitations: confabulation, no mechanistic grounding, excessive AV expressivity, and high compute cost. Not a drop-in replacement for SAEs.

05Already used in Anthropic's pre-deployment pipeline — surfacing hidden internal states that behavioral testing would miss entirely.

Build a complete picture of LLM internals and safety

Post 44 · Training & Alignment

LLM Training Pipeline — RLHF

NLAs interpret the activation vectors produced by this training pipeline — understanding what RLHF actually encodes in model weights.

← Previous Post

Post 48 — Continual Learning

Post 50 — Memory in LLM Agents