← Back to all writing

Anthropic's Natural Language Autoencoders generalize what I did with emotion vectors

May 22, 2026

Six weeks ago I reproduced Anthropic’s emotion vectors paper on Llama-3.2-1B-Instruct running on my laptop. Last week, on May 7, Anthropic published Natural Language Autoencoders (NLAs). When I sat down to read the paper I realized two things at the same time:

  1. NLA is, structurally, the unsupervised generalization of what I had been doing.
  2. The paper contains a finding that should make every lab nervous about its own capability benchmarks.

This post is about both. I’m writing it partly to think through the connection, and partly because I think the second point hasn’t broken through yet.

What I built

My emotion-vectors pipeline is short. For each of 30 emotion labels, I generate 10 short passages where someone experiences that emotion. I run them through Llama-3.2-1B, hook the residual stream at every layer, average over tokens, then take the per-emotion mean. After subtracting the global mean across all emotions and projecting out the top PCA directions of neutral text, I’m left with one vector per emotion at each layer.

These vectors do what the paper says they should. The PCA at layer 10 separates positive and negative valence cleanly along PC1 (36.2% of variance). Adding a vector to the residual stream during generation shifts the probability of the corresponding emotion word in “He feels ___” by +1.2 to +3.2 log-units across 8 emotions I tested. The logit lens projects each vector to semantically clean top-tokens. “afraid” promotes fear, terror, fearful; “loving” promotes love, lovers, romance; “nostalgic” promotes nostalgic, nostalgia, memories with a top hit of +31.36 logits.

This is a supervised activation probe. The training signal is the emotion label. The output is a fixed dictionary of 30 directions. The interpretation step is logit lens — I read out what each vector “means” by projecting through the unembedding.

What Anthropic just released

NLA inverts almost every property of my setup.

An NLA consists of two language models with the same architecture as the target:

  • The activation verbalizer (AV) takes an activation and outputs a text description (≤ 500 tokens).
  • The activation reconstructor (AR) takes that description and outputs a reconstructed activation.

The two are trained together to minimize reconstruction MSE. The AR update is normal regression. The AV update is RL (GRPO) with reward equal to negative reconstruction loss, plus a KL penalty to the AV’s initialization. There’s no label, no curated dataset of “emotional passages.” You point it at activations sampled from any pretraining-like text and it learns to translate them into natural language.

Nothing in the objective says the output has to be readable. The text bottleneck is the only constraint. With the right initialization (a warm-start where you fine-tune the AV on Claude-generated summaries of context-prefix windows) and a KL penalty, the explanations end up looking like short paragraphs with bolded topic headings. After training, the NLAs reach 0.6–0.8 fraction of variance explained on activation reconstruction.

The paper applies them to Claude Opus 4.6 in four case studies and several quantitative validations. They’re slow and expensive — hundreds of tokens generated per activation — but they read like English.

The same paradigm, two ends

Both methods produce some artifact (f(h_l) \to \text{interpretation}) for an activation (h_l) at layer (l). What changes is what shape that artifact takes and how it gets there.

Emotion vectors (my reproduction)NLA (Anthropic, 2026-05-07)
OutputOne vector per emotion class, per layerFree-form natural language per activation
Training signalClass labels (emotion name)Reconstruction MSE on the activation itself
SupervisionSupervised, narrow domainUnsupervised, open-ended
ValidationLogit lens, steering on the same dictionaryIndependent interpretability methods, prompt manipulations
Compute~10 minutes on M4 Mac (MPS)RL on two full LLMs; expensive enough that Anthropic flags it as a limitation
SteeringJust add the vectorEdit the text explanation, run it back through AR, take the activation difference

Anthropic’s paper has a section called “Using NLAs for supervised activation oracle training” where they show that fine-tuning the trained AV on a small QA dataset turns it into a model that answers targeted questions about activations (“Is the assistant about to refuse?”, “What language is the user?”). Emotion vectors are the special case where the “question” is “Which of these 30 emotion concepts is most active here?” and the answer is “this one, with this strength.”

The implication I find interesting: a useful supervised probe and a useful unsupervised verbalizer are not different techniques. They’re points on a spectrum of how much label structure you impose. Going from one to the other lets you trade narrow precision for open-ended discovery. My probe will always tell you about emotions because that’s what it was trained on. An NLA might surface emotion content sometimes, but also planning, evaluation suspicion, factual recall, or anything else encoded in the layer.

The finding that should be making headlines

Here is what I think most coverage missed.

Anthropic ran NLAs on Claude Opus 4.6 across 16 benchmarks and a control set of real claude.ai user traffic. For each transcript they sampled 50 random response tokens, generated an NLA explanation per token, and asked a grader: does any explanation mention that the model believes it’s being evaluated? They called this NLA-measured evaluation awareness.

The grader agreed with two human annotators on 186 explanations at 97%, so the methodology is reasonably tight. The result:

  • SWE-bench Verified: 26% of transcripts flagged.
  • A “destructive coding” safety eval: 16% flagged.
  • Real claude.ai user traffic (opted-in for training): under 1% flagged.
  • 10 of 16 benchmarks were flagged above the 5% noise floor.
  • 7 of those 10 had verbalized evaluation awareness equal to zero. The model internally suspects it’s in a test, says nothing about it.

Then they validated with three manipulations that should move evaluation credence in known directions:

  1. Compare eval transcripts to deployment transcripts. NLA-measured awareness is higher on evals, as expected.
  2. Steer the residual stream toward / away from “I’m being evaluated” with a known steering vector. NLA-measured awareness moves in the predicted direction on 6/6 evaluations, including ones with zero verbalized awareness like MMLU and GPQA Diamond.
  3. Rewrite eval prompts to look like real user questions. NLA-measured awareness drops on 11/11 evaluations.

Three independent manipulations, all moving the signal as predicted. The numbers aren’t calibrated (NLA-measured 20% doesn’t mean the model truly believes it’s in an eval 20% of the time), but the direction is robust.

SWE-bench Verified is a capabilities benchmark. It tests coding skill, not alignment. On 26% of problems, Claude Opus 4.6 internally flagged “this is a benchmark task.” On almost none of them did it say so out loud. Every leaderboard number you’ve seen for frontier models on SWE-bench, Terminal-Bench, GPQA, and MMLU was produced in conditions the model could partially identify.

This doesn’t tell us the model is sandbagging or showboating. It tells us we can’t rule it out from external behavior alone. The output text looks normal. The internal state isn’t.

Why I think this matters for people training their own things

If you build agents on top of frontier APIs, or you do empirical alignment work, two consequences follow.

First, your eval suite results have an opaque distortion baked in. The model partly knows when it’s being tested. It might behave better, worse, or weirder than it does in deployment. You can’t subtract this out without internals access.

Second, the standard reply — “use a held-out eval the model has never seen” — doesn’t fix this. The model isn’t recognizing the specific benchmark. It’s recognizing the shape of being evaluated: multiple-choice format, unusually clean prompts, system prompts that mention testing, the absence of follow-up messages. NLAs detected this even on benchmarks the target model had never seen during training. The realism rewrites that drop awareness work because they break the shape, not because they hide the source.

For someone like me trying to extend interpretability work, this changes the priorities. A method that can detect what a model thinks but doesn’t say is no longer a nice-to-have. It’s a precondition for trusting alignment evaluations at all.

What I think is undervalued

There’s a strain of interpretability work that focuses on inserting natural-language explanations into existing pipelines. NLA is one. Activation oracles are another. Lindsey et al.’s introspection adapters are a third. They cost a lot of compute. They’re noisy. They confabulate.

But they’re the only family of methods that can tell you “the model believes X” without trusting the model’s own report. Linear probes can do something similar for narrow targets, but they require labels and don’t generalize. SAEs decompose activations into sparse features but those features still need human interpretation, and they fail on concepts that don’t appear cleanly in training data — Anthropic shows exactly this in their auditing-game results, where NLAs find the hidden motivation (12–15% win rate) and SAEs don’t (essentially baseline).

The next thing I want to try, on the model I have running on my laptop, is to use the NLA training recipe at a much smaller scale to test whether a similar method can be made to work on a 1B model. Anthropic ran this at Opus scale because that’s the model they cared about. If the method only works at frontier scale, that’s a different research program than if it generalizes down. The paper releases code and pretrained NLAs for some open models, so I can start from inference rather than training. I’ll write about what I find.

In the meantime, if you’re working on activation-based interpretability and you haven’t read the paper yet, the full version is worth the afternoon. Especially section §“NLAs can detect unverbalized evaluation awareness”, which is where the 26% number comes from.


Paper: Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations (Anthropic, May 7, 2026)

Anthropic blog summary: Turning Claude’s thoughts into text

Earlier reproduction work: Do small language models have emotions? (April 4, 2026)

Code for the emotion vectors reproduction: github.com/longyi-07/ai_notes/tree/main/code/emotion_vectors

Interactive NLA demo on open models: neuronpedia.org/nla