← Back to all writing

A map of mechanistic interpretability: observe, intervene, validate

May 29, 2026

I spent a few weeks working through the standard mechanistic interpretability toolkit: contrast vectors and steering, linear probes, sparse autoencoders, Neuronpedia, Natural Language Autoencoders. They all felt intuitive. Almost suspiciously so. Same hooks on the residual stream, same layer sweeps, same question — what is this activation doing? — with different answers.

That intuition is correct. The methods are not independent inventions. They are different edges on the same causal graph: decode what is there → intervene to test causality → validate that the story survives scrutiny. Once you see that structure, the acronym soup stops being intimidating.

What also becomes clear is why it feels like “there are no clever new methods left.” The micro-invention phase — logit lens, induction heads, the first SAE papers — is mostly over. The frontier in 2026 is not probe #7. It is multi-method convergence, honest failure modes, and closing the loop to safety decisions. This article is a map of the tools as they exist today, organized by the question each one actually answers.

For how interpretability fits into oversight at deployment time, see three layers of AI oversight. For one concrete research thread on my site — supervised emotion vectors vs. Anthropic’s NLA line — see NLA and emotion vectors.


Start with the question, not the acronym

Most confusion comes from treating “interpretability” as one technique. It is at least three:

StageQuestionIf you skip it…
Observe / decodeCan we read an internal state?You confuse correlation with mechanism
InterveneDoes changing that state cause the behavior to change?You get pretty heatmaps that do not transfer
ValidateDoes the explanation survive held-out tasks, ablations, and adversarial checks?You get IOI-style anecdotes, not science

Every method you have heard of sits in one of these buckets (sometimes two). Read ≠ steer ≠ patch ≠ feature. A concept can peak for linear decoding at layer 8, steer best at layer 14, and matter for a circuit only at an attention head you never probed. Collapsing those curves into “layer 12 = emotion” is how interpretability Twitter lies to you.


Observe: reading the residual stream

These methods take activations and produce an interpretation. None of them, alone, proves causality.

Contrast vectors and linear probes

The simplest read: collect activations on positive vs. negative examples of a concept, subtract means, get a direction. Representation Engineering (RepE) (Zou et al., 2023) formalized this as monitor + control. Linear Artificial Tomography (LAT) sweeps the same direction across layers.

Linear probes and logistic regression on hidden states answer a narrower question: is concept C linearly decodable at layer L? Cheap, scalable, easy to fool yourself with if your stimuli share surface features.

This is the family I used when reproducing Anthropic’s emotion vectors on Llama 1B: supervised contrast pairs, layer-wise vectors, logit lens readout, steering at inference. It works. It also mostly captures emotion categorization (label-shaped signal) more than affect reception — a dissociation Whether, Not Which (2026) documents on keyword-free stimuli.

Logit lens and tuned lens

Project a hidden state through the unembedding matrix and inspect top promoted tokens. Fast sanity check: does this direction look like “fear” or “Paris”? Tuned lens variants learn a per-layer correction because the raw unembedding is a rough approximation. Useful, not causal.

Sparse autoencoders (SAEs) and crosscoders

Superposition (Elhage/Olah) explains why individual neurons are polysemantic: models pack more concepts than dimensions. SAEs train a sparse dictionary to decompose activations into features — hopefully monosemantic units.

Anthropic’s Scaling Monosemanticity (2024) showed this at Claude 3 Sonnet scale: safety-relevant features (deception, sycophancy), multi-step reasoning features, and human-inspectable names. Crosscoders (2024) encode multiple layers into one shared feature set so the same concept can be tracked vertically without reinventing it per layer.

Neuronpedia is the community UI for browsing SAE and NLA features — not a method, but the inspection layer the ecosystem actually uses.

SAEs are the main unsupervised alternative to hand-built contrast vectors. The catch: dictionaries are not unique, low reconstruction error does not imply mechanistic faithfulness (Olah, 2025 toy model), and a feature that looks semantic may implement memorization.

Natural language autoencoders (NLAs)

Anthropic’s NLA paper (2026) compresses an activation into ≤500 tokens of natural language, then reconstructs the activation from that text. Training optimizes reconstruction MSE only; readability emerges from the bottleneck.

NLAs sit at the opposite end of supervision from contrast vectors: no labels, high compute, noisy, confabulation-prone — but they can surface what the model represents but does not say, including evaluation awareness on SWE-bench-style tasks that output monitoring misses. I wrote more about the connection to my emotion-vector work here.

Related 2025–2026 lines: activation oracles, introspection adapters (Lindsey et al.) — same bet, different architectures: insert human-readable explanations into the pipeline and accept the faithfulness tradeoff.

Persona vectors and similar supervised directions

Chen et al., 2025 extract persona vectors — directions associated with misaligned personas (sycophancy, hallucination pressure) — using contrast data and causal tests. Same observe→intervene loop as RepE, different target concepts and safety framing.


Intervene: testing whether the read is causal

Observation tells you something is there. Intervention asks whether it does work.

Activation steering (ActAdd, RepE control)

At inference, add (\alpha v) to a residual stream, where (v) is a concept direction. If behavior shifts predictably, you have weak causal evidence. Cheap on open models. Side effects are common; off-target damage is underreported.

My RFA jailbreak experiment on Qwen lives on the same geometry: Arditi et al. showed refusal in many chat models is mediated by a single direction; ablating it removes refusal. Interpretability did not create the vulnerability — it revealed how shallow the safety mechanism was.

Activation patching and attribution patching

Activation patching is the gold standard: run a “clean” input and a “corrupted” input, swap one component’s activation during the corrupted forward pass, measure whether the task metric recovers. Causal, brutally expensive.

Attribution patching (AtP) (Nanda) approximates patching with gradients — two forwards, one backward, all components scored at once. AtP* (Syed et al., 2024) fixes known failure modes around softmax saturation. Still an approximation; top candidates should be verified with real patching.

Automated circuit discovery

A circuit is a subgraph of the model’s computation that is complete (ablate it → behavior breaks) and faithful (ablate outside it → behavior unchanged). The IOI circuit (Wang et al., 2022) — indirect object identification in GPT-2 small — is the canonical hand-drawn example.

ACDC (Conmy et al., 2023) automates edge pruning from the output backward. EAP / EAP-IG (Bhaskar et al., 2024; Hanna et al., 2024) score edges with attribution patching and integrated gradients. Hanna et al.’s central warning: high overlap with a hand-drawn circuit ≠ faithfulness. You can share most nodes and still fail ablation tests.

Circuit work is the granularity upgrade from “layer 14 matters” to “this head and this MLP edge matter.” Budget accordingly on anything above a few billion parameters.

Weight editing (ROME and descendants)

ROME (Meng et al., 2022) locates factual associations in mid-layer MLPs and applies rank-one weight edits. Activation methods ask what the model is doing now; ROME asks where a fact lives in weights. Different ontology — both middle-layer-centric for different concept types.


Validate: when is an explanation “true enough”?

The field’s maturing move is benchmarks and explicit faithfulness criteria — not prettier dashboards.

RAVEL and disentanglement

RAVEL (Huang et al., ACL 2024) tests whether methods can locate and disentangle entity attributes (city → continent, person → occupation) using causal interchange: swap representations and check whether the target attribute moves in isolation. Distributed methods beat single-neuron stories. The benchmark now feeds broader suites (SAEBench, MIB).

CausalGym and linguistic causality

CausalGym (Arora et al., 2024) extends the SyntaxGym idea: can interpretability methods causally affect linguistic behaviors on minimal pairs? DAS (Distributed Alignment Search) often wins — with overfitting caveats the authors discuss.

Faithfulness, completeness, open problems

Sharkey et al.’s Open Problems in Mechanistic Interpretability (2025) is the community checklist: decomposition theory, automation, validation protocols, monitoring, dual-use, governance. If you read one meta-document, read that.

The operational definitions from circuit work carry over everywhere:

  • Faithfulness: intervene on everything outside your explanation → behavior should not change
  • Completeness: intervene on everything inside → behavior should break

Most published “circuits” and “features” satisfy neither fully. Reporting which one you tested matters.

Provable guarantees (early)

Zhang et al., 2026 connect circuit discovery to neural network verification (α-β-CROWN). Aspirational for frontier models today, but it sets the epistemic bar: what would it mean to prove an internal explanation, not just illustrate it?


What “SOTA” means in 2026

If you are waiting for a radically new primitive, you may be looking in the wrong place. The active frontier lines, as I read them:

  1. Feature dictionaries at scale — SAE → transcoder → crosscoder, aligned with steering and patching (Scaling Monosemanticity, crosscoders)
  2. Automated circuits with honest metrics — ACDC/EAP-IG plus faithfulness, not just overlap (Hanna et al., 2024)
  3. Unsupervised verbalization for monitoring — NLA, activation oracles, eval-awareness detection (NLA 2026)
  4. Concept phenomenology across depth — RepE LAT, emotion layers, read-vs-steer dissociations (Anthropic emotion concepts, Whether Not Which)
  5. Safety-grounded pipelines — locate refusal/harm features → break them → harden (Actionable MI survey; my RFA entanglement post)
  6. Measurement science — RAVEL, CausalGym, SAEBench, Sharkey open problems

The synthesis recipe I find most convincing — and the one I am trying to execute — is triangulation: for the same concept, ask whether contrast vectors, probes, (optional) SAE neighbors, steering effects, and (optional) NLA decodes agree qualitatively on depth and behavior. Disagreement is often more publishable than agreement.


What this map leaves out

Deliberately incomplete:

  • Black-box interpretability (mechanistic vs. behavioral probing of capabilities) — different threat model
  • CoT / reasoning-token internals on o-style models — same tools, harder access, different geometry
  • Training dynamics (grokking, progress measures) — where mechanisms emerge, not just where they sit at convergence
  • Full tool inventory — TransformerLens, pyvene, nnsight, SAELens, etc. are infrastructure, not theory

Also: interpretability is dual-use. Methods that locate refusal or deception features help defenders and attackers on open weights. The Arditi refusal-direction result is the clean public example. The answer is not “hide the tools” — it is do not build safety as a single ablatable direction.


Where I am placing my own bets

I am not claiming a unified theory of layers. I am building multi-method concept maps on open models: emotion and related families with paired read and steer depth profiles, RSA checks against NLA decodes, and behavioral evals that are not just cosine similarity between vectors. If that sounds like recombining 2019–2024 primitives — it is. That recombination, with explicit limits, is what the field says it wants next.

If you are new to this: pick one observe method and one intervene method on a small open model. Run them on the same concept. Plot where they disagree. That single plot teaches more than learning a fourth acronym.


Sources and further reading

Research notes behind this map: notes/mech_interp_topics/00_INDEX.md (topic guides), readings/mech_interp_paper_inventory.md (local PDF index), notes/safety_mech_interp_literature.md (safety-specific pipeline).

Key papers linked inline above. Surveys: Sharkey et al., 2025; Actionable MI, 2026. Anthropic Transformer Circuits sequence: transformer-circuits.pub.

Related writing on this site: Emotion vectors on Llama 1B · NLA and emotion vectors · RFA jailbreak and entanglement on Qwen · AI oversight layers