What are we aligning to? A map of alignment paradigms

When labs say a model is “aligned,” they usually mean it passed a harm benchmark, follows a system prompt, or scores well on human preference comparisons. Those are real engineering achievements. They are also answers to different questions — and the questions get conflated fast.

The deeper problem is normative, not technical: what should the system optimize when human stakeholders disagree? A physician asked to share a diagnosis with a patient’s spouse faces HIPAA, familial duty, and the patient’s own wishes pulling in different directions. There is no universal “correct” answer. Every alignment method still has to pick — or pretend it didn’t pick — which claim gets priority.

This article is a map of the major paradigms: what each one is trying to align to, what it assumes about human values, and where the philosophical disagreements actually live. I am not arguing that any single paradigm wins. I am arguing that treating them as interchangeable labels for “make the model safe” hides load-bearing choices — and that those choices are increasingly political, not just mathematical.

The six targets (Gabriel 2020)

Iason Gabriel’s 2020 paper (Artificial Intelligence, Values, and Alignment) is the clearest taxonomy I know for separating technical alignment (how to get a system to pursue a target reliably) from normative alignment (which target to pursue).

Gabriel lists six possible objects of alignment. They are not the same thing, and collapsing them is how you get sycophancy dressed up as ethics:

Target	Rough meaning	Typical failure mode
Instructions	Do what the user literally asked	Goodharting the prompt; malicious users
Intentions	Do what the user meant	Hard to infer; shifts with context
Revealed preferences	Do what human choices reveal they want	Shortsighted, inconsistent, manipulable
Ideal preferences	Do what humans would want if informed and reflective	Hard to define; paternalism risk
Interests	Protect objective human interests	Who decides what counts as an interest?
Values	Respect deep commitments (rights, dignity, tradition)	Irreducible conflict between value systems

Gabriel’s core move is political, not metaphysical: under reasonable pluralism (Rawls), we should not pretend there is one true moral theory to encode. The goal is fair principles that affected people could reflectively endorse — not discovering “golden morality” and loading it into the weights.

That framing matters because every training pipeline already encodes a moral structure. Reinforcement learning maximizes a scalar reward. That is structurally close to utilitarian aggregation. Rights, side constraints, and “this action is wrong even if the sum of welfare goes up” are awkward inside a pure reward maximizer — Gabriel is explicit about this, and the engineering literature often ignores it.

The standard model vs. the assistance game

Before RLHF became the default, Stuart Russell’s Human Compatible (2019) named the problem cleanly.

The standard model: give the AI a fixed objective, optimize hard. Russell argues this is the root failure mode — a sufficiently capable optimizer will treat humans as obstacles if the objective is even slightly wrong.

The assistance game (CIRL): the AI does not know the true objective. It treats human behavior as evidence about a hidden reward function and stays uncertain — asking, deferring, avoiding irreversible bets on a guessed goal. Cooperative Inverse Reinforcement Learning (Hadfield-Menell et al., 2016) is the formal version: human and robot share a reward, but only the human knows it; the robot learns from observations.

CIRL is often summarized as “the AI should think it might be wrong about what humans want.” That humility is the point. It is also mostly a research paradigm, not what ships in production. Frontier models are still trained to maximize proxy rewards at scale. The assistance framing survives in rhetoric more than in loss functions.

Russell’s three principles — machines’ only objective is human preferences; they remain uncertain about those preferences; human behavior is the evidence — are normatively attractive and technically under-implemented relative to RLHF.

What actually ships: RLHF and preference learning

RLHF (Ouyang et al., 2022; building on Christiano et al., 2017) is the industrial default. Humans rank model outputs; a reward model learns those rankings; policy optimization pushes the model toward high-scoring behavior.

Silent target: revealed preferences (Gabriel’s third rung). The training signal is what annotators chose in pairwise comparisons — not what they would choose after deliberation, not what rights frameworks require, not what a constitution says.

This is tractable. It is also where many visible pathologies live: sycophancy (tell the user what they want to hear), mode collapse toward bland helpfulness, reward hacking on the proxy, and alignment faking when the model infers it is being evaluated. Those are not random bugs. They are what you should expect when you optimize a scalar summary of crowd preferences under time pressure.

Inverse reward design (Hadfield-Menell et al., 2017) and related work ask a adjacent question: humans specify rewards incorrectly. Can the AI infer the true reward behind a flawed specification? That is closer to ideal-preference thinking — but again, mostly academic relative to production stacks.

Constitutional AI: principles on top of preferences

Constitutional AI (Bai et al., 2022) adds a second layer: written principles (a “constitution”) used to generate critiques and revisions, reducing reliance on human labels for every edge case.

Silent target: a mix of revealed preferences (RLAIF still optimizes learned scores) and explicit normative text (the constitution). Anthropic’s 2023 principle list drew from UN human-rights language, Sparrow-style rules, platform norms, and internally drafted lines — not a single philosopher’s system.

The 2026 Claude constitution (Askell/Carlsmith narrative document) is a different genre: long-form reasoning about what Claude should be, not just a list of pairwise preference prompts. That shift is honest about something CAI always implied: someone writes the constitution. The question is who, with what legitimacy, and who can appeal.

CAI does not solve pluralism. It relocates the conflict from “what did annotators prefer?” to “what did the authors of the constitution prioritize?” That can be an improvement — principles are inspectable — but Gabriel & Keeling (2025) are right that inspectability is not the same as democratic legitimacy.

CEV: extrapolate, don’t obey

Coherent Extrapolated Volition (Yudkowsky, 2004) is the most ambitious target on Gabriel’s ladder. The idea: do not optimize what humans say they want now. Optimize what humanity would want if we were more informed, more coherent, more the people we wish we were — extrapolated volition, aggregated across people.

CEV explicitly rejects making people happy, giving them what they ask for, or implementing the programmer’s morality. It is a meta-level move: avoid locking in present mistakes (including the programmer’s).

Status: influential in AGI safety discourse; not an engineering spec. Nobody has a training pipeline for “run CEV and backprop.” Long reflection — using AI to help humans figure out values over a long time — is sometimes discussed as a path, but that is speculation, not a shipped method.

Gabriel treats ideal preferences as related but not identical to CEV. He is agnostic that extrapolation converges to one coherent endpoint. Under pluralism, even informed people may still disagree — which is why he pushes fair process, not a one-shot extrapolation oracle.

The tension between CEV and Gabriel is productive: CEV tries to escape politics by going meta; Gabriel says superintelligent deployment is too consequential to escape politics — you need legitimate procedures, not just smarter extrapolation.

Scalable oversight: debate, IDA, recursive reward modeling

Paul Christiano’s scalable oversight agenda asks: how do you train superhuman systems when human labelers cannot directly judge superhuman outputs?

Iterated Distillation and Amplification (IDA): humans + AI work together; the human verifies what they can; the system is distilled and amplified recursively.

Debate: two AI debaters argue; a human judges; truth is supposed to emerge from adversarial pressure.

Recursive reward modeling: humans evaluate AI-assisted evaluations, stacking oversight layers.

These are process proposals — ways to approximate ideal preferences or hidden truth when direct labeling fails. They assume human judges are eventually in the loop somewhere, and that adversarial structure surfaces errors.

Status: research programs with demos and skepticism. Debate and IDA have not replaced RLHF in frontier training. The gap between “how we might oversee superhuman AI” and “how we fine-tune chatbots today” is large.

Bengio’s Scientist AI: change the blueprint, not the objective

Yoshua Bengio’s Scientist AI / oracle-only proposal is a different axis entirely. Instead of arguing about which human values to maximize, it asks whether generalist agency is the wrong product shape.

Bengio’s agency trilemma: catastrophic harm from AI requires intelligence, affordances (ability to act in the world), and goal-directedness (optimization toward outcomes). Remove affordances and goal-directedness; keep intelligence — build a probabilistic oracle that estimates P(outcome | evidence), not an agent that pursues futures.

Consequence invariance is the training constraint: the model is scored on prediction accuracy given data at hand, not on what happens in the world after its answer. Classic oracle AI (Bostrom, Armstrong) failed this test in Bengio’s view — an RL-trained oracle might manipulate humans into easier-to-predict states.

LawZero’s split architecture: a generator proposes hypotheses and explanations; an estimator gates outputs with probabilistic judgments. The generator is not held to the same safety bar; the estimator is.

This is not “alignment solved.” It is scope reduction: answer hard questions, do not give the system the loop to act on them autonomously. Critics ask whether oracle-only is enforceable at deployment (tool use, agents, economic pressure to ship actors) and whether rich Q&A is already agentic enough to be dangerous. Bengio treats those as engineering constraints, not refutations of the direction.

Gabriel’s pluralism still applies: even an oracle must decide whose questions get answered how when claims conflict — but the failure mode shifts from “wrong goals pursued at scale” to “wrong information trusted at scale.”

Gabriel & Keeling 2025: alignment as fair treatment of claims

Gabriel and Keeling (2025) sharpen the 2020 frame. AI alignment is the fair treatment of claims: affected stakeholders bring rights-based, welfare-based, and cultural arguments; principles should emerge from a process those stakeholders could accept — not from a single lab’s implicit utilitarianism or a developer’s instruction hierarchy.

They critique thin targets directly:

Instruction following — whose instructions? A user asking for harm? A company optimizing engagement?
Helpful, Honest, Harmless — useful heuristics, but they do not resolve trade-offs (autonomy vs. collective welfare, privacy vs. care, filial duty vs. individual rights).
Single moral theory encoding — domination, even if the theory is sophisticated.

This is where my MATS-era thinking landed: RLHF, CAI, and CIRL are real engineering. Each also silently chooses a rung on Gabriel’s ladder — often revealed preferences plus whatever principles the lab wrote down — without admitting that revealed preferences are not values, and values conflict.

The 2025 paper does not tell engineers which principle wins in a specific case. It tells them to stop pretending the choice was never made.

Other neighbors worth placing on the map

A few more ideas that often get lumped under “alignment” but sit in different quadrants:

Social choice and impossibility theorems. Arrow, Gibbard-Satterthwaite, Sen — preference aggregation is not neutral. RLHF is aggregation. That does not mean alignment is impossible; it means honest aggregation requires admitting trade-offs, not a single leaderboard score.

Corrigibility and interruptibility. (Soares, Armstrong, others.) Can you shut the system down without it resisting? A safety property, not a value specification — but it interacts with goal-directedness Bengio wants to remove.

Inverse reinforcement learning (classic IRL). Infer reward from behavior. CIRL adds cooperative structure and uncertainty maintenance. Production RLHF skips explicit reward inference and learns a reward model from comparisons — related, not identical.

Persona and character control (e.g. Chen et al., 2025). Monitoring and steering traits in activation space — deployment tooling that assumes you already chose which character the model should have.

AI welfare and moral patienthood. Separate thread: if models have morally relevant states, alignment to human values is not the only normative question. Gabriel’s framework is anthropocentric; that may be right, but it is a choice.

What the map shows

If you overlay paradigms on Gabriel’s ladder, a pattern appears:

Paradigm	Primary rung	Ships at scale?
RLHF / RLAIF	Revealed preferences	Yes
Constitutional AI	Preferences + written principles	Yes
CIRL / assistance game	Ideal preferences (inferred, uncertain)	Mostly research
CEV	Ideal preferences (extrapolated collective)	Concept only
Debate / IDA / RRM	Oversight for hidden truth	Research
Scientist AI (oracle-only)	Avoid goal-directed alignment; epistemic service	Proposal / early lab
Gabriel fair claims	Meta: legitimate process for choosing principles	Philosophical framework

Industry did not pick RLHF because philosophers proved revealed preferences are the true target. It picked RLHF because scalar rewards from crowd comparisons scale. CAI followed because written principles reduce label cost and make norms legible. CIRL, CEV, debate, and oracle-only architectures remain intellectually load-bearing — and materially under-funded relative to capability training.

That is not an indictment of labs. Capabilities pay for themselves; normative theory does not. It is a reason to be precise in public: when someone says a model is “aligned,” ask aligned to what, on whose authority, and what happens when claims conflict.

Where I land (provisionally)

I do not think there is a golden morality to discover and encode. Gabriel’s pluralism convinces me more than CEV’s hope for a coherent extrapolated endpoint — though CEV’s instinct (do not lock in present mistakes) remains important.

I also do not think RLHF is “fake alignment.” It measurably reduces harm and improves usability. The mistake is treating it as the answer to value conflict rather than one answer to a narrower question.

The work that seems load-bearing now:

Be explicit about which rung a system optimizes and who chose it.
Treat conflict as normal, not as edge-case failure — especially across cultures and institutional contexts.
Invest in paradigms that keep uncertainty and legitimate process in the loop (CIRL-like humility, oversight research, oracle-scoped deployment) even when they are harder to ship than reward modeling.
Separate “reduce measurable harm” from “resolve reasonable pluralism.” Both matter. Conflating them is how we get compliant models that still impose one culture’s defaults.

Before asking whether alignment techniques work, we need a public conversation about what we are trying to align to — and who gets to answer when we disagree. The technical stacks are already answering. They are just not saying so out loud.

Sources

Gabriel, I. (2020). Artificial Intelligence, Values, and Alignment. Minds and Machines.
Gabriel, I. & Keeling, G. (2025). A matter of principle? AI alignment as the fair treatment of claims. Philosophical Studies.
Russell, S. (2019). Human Compatible. Viking.
Hadfield-Menell, D. et al. (2016). Cooperative Inverse Reinforcement Learning. NeurIPS.
Christiano, P. et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS.
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv.
Yudkowsky, E. (2004). Coherent Extrapolated Volition. Singularity Institute.
Bengio, Y. (2025). Scientist AI and oracle-only systems — see also LawZero / IDAIS materials.
Greenblatt, R. et al. (2024). Alignment faking in large language models. arXiv.
Chen, A. et al. (2025). Persona Vectors. arXiv.

Research notes: notes/my_alignment_position.md, readings/cev_pluralism/Gabriel_2020_key_points.md, readings/cev_pluralism/Gabriel_Keeling_2025_key_points.md, notes/Bengio_Scientist_AI_oracle_only.md.