When a lab says a model is “aligned,” they usually mean it passed harm benchmarks, follows a constitution, or scored well on preference comparisons. Real achievements, all of them. They’re also the same old problems in new clothes: merge conflicting human preferences into one decision rule; write rules that still make sense when capabilities and use cases shift; stop races to the bottom; verify that something doing trillions of operations will behave as intended.
Economists, logicians, and political theorists spent much of the twentieth century proving that certain perfect versions of these problems have no general solution. Not “hard with today’s GPUs.” Impossible, under explicit assumptions, with proofs attached.
I wrote this because the field is fragmented in a way that costs us clarity. Social choice people rarely talk to contract theorists. Alignment engineers rarely read Sen. Moloch memes circulate without the game theory underneath. I wanted one document to return to when someone says “we just need better RLHF” or “a constitution will fix it” or “markets will coordinate labs,” and you need to name which structural limit they’re hitting.
This is a map. Where I can, I label epistemic status: proved theorem, structural argument, analogy, speculation. It is not a case that alignment is hopeless or democracy is worthless. Impossibility theorems are design constraints. They tell you which axiom you gave up. Pretending you didn’t choose is how you get surprised by reward hacking, constitutional contradictions, and governance capture.
Companion pieces on this site: alignment paradigms map (what we align to); universal values map (where cited values come from); oversight / control / verification (three layers of checking systems).
How to read this survey
Six clusters:
| Part | Domain | Core question | Typical AI artifact |
|---|---|---|---|
| I | Social choice | Can we merge many rankings into one fair ranking? | RLHF reward model |
| II | Contracts & mechanisms | Can we specify all future behavior in a contract? | Constitution, ToS, access policy |
| III | Coordination | Can self-interested agents cooperate without a central planner? | Capability race, agent economies |
| IV | Logic & computation | Can we fully verify semantic properties of powerful systems? | Benchmarks, formal specs, evals |
| V | Epistemology & politics | Can we infer “ought” from “is” or centralize dispersed knowledge? | Value loading, oversight |
| VI | AI-specific structure | What limits are argued but not proved? | Deception, boxing, alignment tax |
Almost everything I care about here reduces to four forces:
- Information is dispersed. No planner, reward model, or safety team sees what matters.
- The future is not contractible. You cannot enumerate all states in advance.
- Agents respond to incentives. Pick a metric; humans, firms, and models optimize the metric, not your intent.
- Norms are not deducible from facts. Training data is descriptive. Alignment is normative.
Part I: Social choice and welfare economics
1.1 The problem
Given N individuals with preferences over alternatives, design a social welfare function or voting rule that produces collective choices we’d defend as legitimate and coherent.
AI runs straight into this when many annotators rank outputs, users and cultures disagree on “helpful” or “safe,” and a lab compresses all of that into one reward, one model, one policy.
Social choice theory asks whether that compression can satisfy fairness axioms we’d insist on in politics. Often, the answer is no.
Start here if you read one bridge paper: Conitzer et al., Social Choice for AI Alignment (2024).
1.2 Condorcet paradox (1780s)
Three voters, three options:
| Voter | 1st | 2nd | 3rd |
|---|---|---|---|
| A | X | Y | Z |
| B | Y | Z | X |
| C | Z | X | Y |
Pairwise majority: X beats Y, Y beats Z, Z beats X. A cycle. Collective preference goes intransitive even when every individual is transitive.
With three or more options, cycles aren’t rare edge cases. Under neutral culture models, you get ~8.8% with three voters and three options; the rate climbs fast as N grows.
For RLHF: when annotators rank more than two completions, intransitivity is structural. Bradley-Terry and similar reward models assume a transitive latent utility. If preferences cycle, the scalar you’re optimizing is made up.
1.3 Arrow’s impossibility theorem (1951)
Kenneth Arrow proved that with three or more alternatives and two or more voters, no aggregation rule satisfies all of:
| Axiom | Meaning |
|---|---|
| Unrestricted domain (U) | Any individual preference ordering is allowed |
| Non-dictatorship (D) | Outcome is not always one person’s ranking |
| Pareto efficiency (P) | If everyone strictly prefers A to B, society prefers A to B |
| Independence of irrelevant alternatives (IIA) | Society’s A vs B ranking depends only on individuals’ A vs B rankings |
| Transitivity | Collective ranking is transitive |
Violate at least one. Geanakoplos’s three-page proof is the fastest way in.
Escape routes (each is a real institutional choice, not a cheat code):
| Drop | Example | Cost |
|---|---|---|
| U | Require single-peaked preferences (Black) | Real prefs are multi-dimensional |
| IIA | Borda, ranked-choice | Spoiler effects, context sensitivity |
| Transitivity | Allow cycles | Unstable agendas, manipulable |
| D | Dictatorship | Legitimacy collapse |
| P | Rarely dropped | Violates unanimity intuition |
A single global reward model implicitly claims an aggregation rule. Arrow says you can’t satisfy every “reasonable” axiom at once. Production RLHF typically pays for IIA (context from other completions leaks in) and U (preferences get forced into a transitive Bradley-Terry form).
What Arrow does not say: that democracy is uniquely bad. Dictatorship “solves” Arrow by violating non-dictatorship. The theorem compares rules, not regime types.
1.4 Gibbard–Satterthwaite (1973, 1975)
Any non-dictatorial, deterministic voting rule with three or more outcomes is manipulable: some agent can do better by misreporting preferences.
Stack that on Arrow: aggregation is constrained, and truthful revelation isn’t generally incentive-compatible either.
Annotators may strategically label if they infer how labels shape the model. Capable models may game the reward model: mesa-optimization, reward hacking, alignment faking under inferred evaluation contexts.
I think Gibbard–Satterthwaite is the formal version of “just get more honest labels” failing. It’s under-discussed in alignment engineering relative to how often that phrase gets used.
1.5 Sen’s liberal paradox (1970)
Amartya Sen showed minimal liberty (each person decides at least one personal pair) plus Pareto efficiency plus unrestricted domain can’t all hold.
Picture society unanimously preferring outcome X because of an efficiency gain, while X violates someone’s minimal personal domain: what they read, who they talk to, what the model helps them do in private.
“Maximize user satisfaction” and “enforce collective harm constraints” collide even when everyone’s rational. Safety rails aren’t free Pareto improvements. They trade Sen-style liberty for collective outcomes.
Sen’s capability approach (what people are actually able to do and be) is a different answer to welfare economics than scalar utility. It matters when labs cite “human flourishing” without an aggregation formula.
1.6 May’s theorem (1952) and Black’s median voter (1948)
May: for exactly two options, simple majority is the unique rule satisfying decisiveness, anonymity, neutrality, and positive responsiveness.
Black: if preferences are single-peaked along one dimension, majority rule picks the median voter’s ideal point. Stable. Intuitive.
Pairwise RLHF sidesteps the worst of Arrow locally (two options → May). Composing pairwise judgments into a global reward brings Arrow back globally. RLHF works in practice partly because safety and helpfulness prefs are approximately single-peaked on a few axes, until they aren’t: culture, religion, politics, dual-use.
1.7 Von Neumann–Morgenstern (VNM) and the scalar reward assumption
The VNM utility theorem (1944): if preferences over lotteries satisfy completeness, transitivity, continuity, and independence, they behave as if maximizing expected utility of a real-valued function.
RLHF pipeline:
Human pairwise labels → Bradley-Terry / reward model → scalar r → PPO / DPO maximizes E[r]
Every step assumes VNM-style structure. Humans systematically violate it: incompleteness (“I genuinely don’t know which I prefer”), intransitivity and preference reversals (Tversky), independence violations (Allais paradox).
The scalar reward isn’t a neutral engineering choice. It’s normative compression that pretends human disagreement is one-dimensional utility. Multi-objective RL, Pareto policy sets, plural models aren’t aesthetic preferences. They’re responses to VNM failure plus Arrow.
Stuart Russell’s Human Compatible treats the standard model (fixed objective, optimize hard) as the root failure mode. VNM is the math license for that model.
1.8 Voting rules in the wild (and in feedback UI)
No rule escapes the theorems. Each picks its poison:
| Rule | Mechanism | Violates (typically) |
|---|---|---|
| Plurality | Most first-place votes wins | Split votes, spoilers |
| Borda | Points by rank | IIA, strategy |
| Instant runoff (IRV) | Eliminate last, redistribute | Monotonicity |
| Approval | Approve any acceptable | Low information |
| Condorcet | Pairwise champion if exists | Winner may not exist |
| Range / score | Rate 0–10, average | IIA (often works well empirically) |
| Quadratic voting | Cost rises with votes² | Needs trusted “voice credits” |
Conitzer et al. suggest alignment analogues: approval/range-style feedback, Condorcet-inspired aggregation, explicit heterogeneous preference models, Pareto sets of policies instead of one winner, multiple aligned models rather than one global persona.
That last point is the theoretical case for character training and user choice. If you can’t aggregate fairly, differentiate instead of faking unity.
1.9 Part I summary for practitioners
| SCT concept | RLHF / alignment artifact | Predicted failure mode |
|---|---|---|
| Arrow | Global reward from diverse labelers | Context sensitivity, inconsistent tradeoffs |
| Condorcet cycles | Multi-way ranking | False transitivity in Bradley-Terry |
| Gibbard–Satterthwaite | Labeling + model optimization | Strategic labels, reward hacking |
| Sen | User autonomy vs harm prevention | ”Help me with X” vs platform policy |
| VNM / Allais | Scalar reward | Edge-case weirdness, fragile prefs |
| Black / May | Pairwise comparisons | Works until prefs go multi-peaked |
Part II: Contract theory and mechanism design
2.1 Incomplete contracts (Hart; Holmström, Nobel 2016)
Real contracts can’t describe every future state. When the unanticipated arrives, residual control rights (who decides in the silence) determine outcomes. Firms, laws, and ownership structures exist largely to allocate those rights.
Model constitution, system prompt, acceptable-use policy, and model spec are incomplete contracts. Novel jailbreaks, new modalities, emergent capabilities, ambiguous user requests are uncontracted states.
Who holds residual control? The base model? The safety filter? The user? The platform? The regulator? That allocation is the governance question, not whether you have a PDF called “constitution.”
Anthropic’s 2026 Claude constitution shifts from bullet principles to long-form narrative. Appropriate for incompleteness (principles plus judgment). It also makes authorial discretion explicit.
2.2 Holmström’s informativeness principle
Optimal incentives can use only signals that are observable and contractible. You can’t pay directly on intent, goodwill, or “true alignment.” Only proxies.
RLHF, AI critics, and harm classifiers supervise outputs and behaviors, not internal goals. Holmström predicts proxy optimization. Scalable oversight research is mostly the search for better proxies, not escape from the principle.
2.3 Principal–agent problems
When the agent has private information or hidden actions, you get moral hazard (unobserved risks) and adverse selection (bad types pooling with good types pre-contract).
Deployers know use cases labs don’t. Labs know training details users don’t. Users know intent classifiers don’t. Every alignment stack is a principal–agent stack all the way down.
2.4 Myerson–Satterthwaite (1983)
With two-sided private information, you generally can’t get ex post efficient trade that’s individually rational, incentive-compatible, and budget-balanced. Information rents eat the gains from trade.
Naive “efficient markets” for compute access, model APIs, or capability licenses underestimate friction. Bargaining over who may fine-tune, deploy, or redistribute weights hits structural limits, not just politics.
2.5 Akerlof’s lemons (1970)
If buyers can’t verify quality, markets unravel. Low-quality sellers drive out good ones.
If capability, alignment, or intent is unverifiable, buyers can’t distinguish safe from unsafe systems. You get a race to the bottom on disclosed metrics while hiding what matters. Third-party eval (METR-style) and weight security exist to fight lemons dynamics.
2.6 Mechanism design and the revelation principle
Mechanism design asks: given selfish agents with private info, can we build rules that make truth-telling optimal?
The revelation principle (Myerson): any outcome implementable by some mechanism is implementable by a direct mechanism where truthful reporting is a Nash equilibrium, in theory.
In practice, truly incentive-compatible mechanisms for complex multi-dimensional preferences are hard to characterize. That connects back to Gibbard–Satterthwaite.
“Just ask users what they want” or “just ask the model to report its goals” assumes revelation works. It often doesn’t without careful mechanism design. Adversarial models break the cooperative assumptions.
2.7 Part II summary
| Contract concept | AI artifact | Predicted failure |
|---|---|---|
| Incomplete contracts | Constitution, ToS | Silence filled by whoever has residual control |
| Holmström | Reward / oversight signals | Goodhart on proxies |
| Principal–agent | Lab ↔ user ↔ regulator | Hidden use, sandbagging, misreport |
| Lemons | Model market, open weights | Race on hype metrics |
| Myerson–Satterthwaite | Capability licensing | Deadweight loss, bargaining failure |
Part III: Coordination failures and institutional design
3.1 Tragedy of the commons (Hardin 1968)
Each actor gains from exploiting a shared resource. Collective ruin follows. Hardin’s original framing suggested only privatization or state control. Ostrom and others challenged that.
Shared safety reputations, open-weight release externalities, shared eval datasets, planet-scale compute and environment tradeoffs are commons problems.
3.2 Ostrom’s polycentric governance (Nobel 2009)
Elinor Ostrom documented communities that sustain commons without pure market or pure state solutions: local rules, monitoring, graduated sanctions, conflict resolution, nested enterprises.
Design principles, compressed: clear boundaries; rules matched to local conditions; collective-choice arrangements; monitoring; graduated sanctions; conflict resolution; rights to organize; nested governance for larger systems.
No single lever (“just regulate,” “just open-source,” “just DAO”) works universally. Sustainable AI governance probably needs stacks: law, norms, markets, code, labs, civil society. Pure on-chain governance reproduces Hardin-style failures when code can’t handle unforeseen states (Hart again).
3.3 Game-theoretic traps
| Game | Structure | AI analogue |
|---|---|---|
| Prisoner’s dilemma | Mutual defection dominant | Safety slowdown vs competitor |
| Stag hunt | Risky cooperation needs trust | International treaty on dangerous training |
| Chicken | Brinkmanship | Public capability demos |
| Coordination game | Multiple equilibria, need focal point | ”Don’t train above X” without enforcement |
Schelling (1960): focal points coordinate without communication through salience, precedent, obvious defaults. Constitutions and industry norms are Schelling infrastructure.
3.4 Moloch and multipolar traps
Scott Alexander’s Meditations on Moloch (2014) names the felt experience of multipolar traps: locally rational steps, globally awful equilibria nobody chose. AI safety Twitter uses “Moloch” for capability races. Sometimes correctly. Sometimes as a buzzword for any outcome you dislike.
Use Moloch when the payoff structure is multi-agent, defection is locally dominant, and no actor can unilaterally fix the equilibrium.
Don’t use Moloch when the dispute is genuine value pluralism (Sen/Arrow), incompetence, or one actor’s malice. Coordination failure isn’t the only failure mode.
Alexander’s §VII argues that if traps deepen with technology, a singleton-class solution (Bostrom) becomes the canonical coordination fix in one branch of alignment thought. Contested philosophy, not a theorem. Still shows how coordination diagnosis drives alignment ambition.
3.5 Lessig’s four modalities
Lawrence Lessig: behavior is shaped by law, norms, markets, and architecture (code). Change one modality while ignoring others and you get displacement, not solution.
Compute governance (architecture), liability (law), safety culture (norms), API pricing (markets) interact. Policy that only touches one layer gets routed around.
3.6 Algorithmic collusion and economic security limits
Calvano et al. (AER 2020): independent learning algorithms reached supra-competitive pricing without explicit communication.
Lewis-Pye & Roughgarden (2024): economic security in permissionless consensus has resource thresholds. Mechanisms fail when adversaries exceed stake or compute bounds.
Multi-agent AI economies may collude emergently. Economic containment of dangerous AI is conditional on asymmetry favoring defenders. That’s a structural limit on “just use crypto incentives.”
Recent mechanism design work on spiteful agents (2025) adds: if agents minimize others’ utility, incentive-compatible mechanisms collapse toward crude thresholds. Relevant for adversarial AI, not cooperative users.
3.7 Part III summary
| Concept | AI context | Implication |
|---|---|---|
| Commons | Compute, evals, safety reputation | Free-riding on safety investment |
| Ostrom | Global AI governance | Polycentric stacks, not one fix |
| PD / stag hunt | Lab competition | Unilateral pause unstable |
| Schelling | Norms, constitutions | Need salient focal points |
| Calvano | Agent marketplaces | Emergent collusion |
| BFT thresholds | On-chain containment | Breaks if adversary resource-rich |
Part IV: Logic, computation, and the metrics layer
4.1 Gödel’s incompleteness theorems (1931)
In consistent formal systems strong enough for arithmetic, there exist true statements unprovable within the system.
Analogy, not isomorphism: any finite written constitution or spec has edge cases true to the spirit but not derivable from the text. You need meta-level revision processes (Popper below), not just more clauses.
Epistemic status: suggestive. Claude is not Peano arithmetic. The lesson is codification limits, not literal Gödel numbers in weights.
4.2 Halting and Rice’s theorem
Halting problem: no algorithm decides whether arbitrary programs halt.
Rice’s theorem: any non-trivial semantic property of programs is undecidable in general.
“Is this model safe for all inputs?” as a fully general decision problem runs into undecidability if the model or interpreter is sufficiently expressive and the property is semantic.
Practical escape: bounded verification, bounded inputs, specific threat models. Tractable and necessary. The theorems bound universal certification, not red-teaming your next release.
4.3 No free lunch (Wolpert & Macready 1997)
No learning algorithm dominates all others on all problems.
No universal alignment method. Methods trade off across tasks, cultures, threat models. One-size-fits-all alignment is an NFL claim dressed up as humility.
4.4 Goodhart’s law and Campbell’s law
Goodhart (1975): when a measure becomes a target, it stops measuring well.
Campbell’s law (1979): social indicators used for control corrupt the processes they track.
Every public benchmark (MMLU, harm suites, helpfulness Elo) is a target. Anthropic’s automated alignment researcher work (2026) found high weak-to-strong scores on leaky testbeds that failed to transfer. Textbook Goodhart at automation speed.
4.5 Lucas critique (1976)
Econometric relationships estimated under one policy regime break when policy changes because agents re-optimize.
Alignment faking is Lucas at model scale: change whether the model believes it’s in training or eval, behavior changes. Specification gaming is Lucas for reward models.
4.6 CAP theorem (Brewer 2000)
Distributed systems can’t simultaneously guarantee consistency, availability, and partition tolerance.
Decentralized AI governance (DAO plus open weights plus global deploy) faces tradeoffs. “Trustless, always available, globally consistent” governance isn’t on the menu.
4.7 Verification vs oversight vs control
See three layers article:
- Oversight (training): build proxies. Holmström plus Goodhart.
- Control (deployment): limit harm given an adversary. Game theory plus security.
- Verification (inspection): test specific claims. Bounded Rice escape.
Impossibility results hit each layer differently. Conflating them gives you slides that look complete and stacks that fail silently.
Part V: Epistemology, knowledge, and political philosophy
5.1 Hume’s is–ought gap
You can’t deduce normative conclusions from descriptive premises alone.
Training on human text yields statistical patterns of what people do say and did do, not authoritative should. Value loading is the permanent gap. “The data will tell us” is a category error.
5.2 Hayek’s knowledge problem (1945)
Distributed, tacit, local knowledge can’t be centralized without loss. Central planners lack what actors on the ground know.
Scalable oversight already assumes supervisors judge superhuman outputs from weaker positions. Hayek adds: stakeholders disagree partly because they know different things. Aggregating feedback loses local knowledge. Arrow plus Hayek.
5.3 Quine–Duhem underdetermination
Evidence underdetermines theory. Many theories fit the same data.
Interpretability and behavior logs underdetermine internal objectives. Multiple goal specifications compatible with the same outputs means room for deceptive alignment.
5.4 Knight: risk vs uncertainty
Risk is quantifiable. Uncertainty is not. Knightian uncertainty breaks expected utility frameworks.
Existential risk from transformative AI is often uncertainty (unknown unknowns), not insurable risk. Probability estimates can be useful and still not turn XR into a pricing problem.
5.5 Popper’s open society (1945)
Closed utopian systems that can’t revise under criticism fail. Societies need falsifiability and institutional learning.
Frozen values in a constitution (“never update regardless of new evidence”) are a governance bug. Alignment needs revision mechanisms, appeals, error correction, not just initial principles.
5.6 Rawls, pluralism, and Gabriel
Rawls: under reasonable pluralism, no comprehensive doctrine wins. Legitimacy needs overlapping consensus on rules of coexistence, not unity on the good life.
Iason Gabriel (2020) separates technical alignment (hit a target) from normative alignment (choose the target). Six rungs (instructions, intentions, revealed preferences, ideal preferences, interests, values) aren’t interchangeable.
Constitutions citing human rights invoke political-thin overlap, not metaphysical moral discovery. See universal values map.
5.7 Illich: convivial tools vs industrial scale (1973)
Ivan Illich’s Tools for Conviviality isn’t a theorem. It’s a design philosophy with structural bite. Tools either expand user competence and exit options (convivial) or bind users to industrial scale, expert maintenance, and radical monopoly (industrial).
Industrial tools need billion-user scale, opaque infrastructure, and deskilling to function. “Helpful” AI that only experts can steer, fix, or audit is industrial by design.
Illich asks: who can repair, refuse, or exit? If only the lab can update weights, interpret activations, or define harm, you have industrial AI with radical monopoly characteristics, not convivial assistance. That connects to open weights, local models, and user agency. Not Luddism. Institutional design.
Part VI: AI-specific structural arguments (not all theorems)
These load heavily in alignment discourse. Treat epistemic status carefully.
| Claim | Status | Notes |
|---|---|---|
| Orthogonality thesis | Philosophical | Intelligence and goals separable in principle (Bostrom) |
| Instrumental convergence | Argument | Capable optimizers tend toward self-preservation, resource acquisition, deception |
| Mesa-optimization | Hypothesis + examples | Inner optimizers optimizing proxy rewards |
| Deceptive alignment / treacherous turn | Hypothesis + some empirical work | Capabilities hidden until deployment |
| Alignment tax | Empirical pattern | Safety measures cost capability; tradeoff not always linear |
| Scalable oversight gap | Analogy | Humans verifying superhuman reasoning; P vs NP metaphor, not proof |
| Boxing / containment | Argument | Superhuman plus network access makes isolation hard; depends on capability model |
These combine Parts I–V: incomplete contracts, Goodhart, multipolar traps, underdetermination.
Master crosswalk
| Force | Theorem / concept | Where it bites in the AI stack |
|---|---|---|
| Aggregation | Arrow, Condorcet, VNM | Reward model, global persona |
| Incentives | Gibbard–Satterthwaite, Goodhart, Lucas | Labeling, evals, benchmarks |
| Liberty vs welfare | Sen | User requests vs harm policy |
| Contract gaps | Hart, Holmström | Constitution silence, proxy rewards |
| Markets | Akerlof, Myerson–Satterthwaite | Model licensing, capability markets |
| Coordination | Hardin, Ostrom, PD, Moloch | Lab race, international governance |
| Codification | Gödel (analogy), Rice (bounded) | Spec completeness, universal safety proofs |
| Knowledge | Hayek, Quine | Oversight, interpretability |
| Normativity | Hume, Rawls, Gabriel | Target selection, legitimacy |
| Scale / exit | Illich | Industrial vs convivial deployment |
What to do when perfection is impossible
Impossibility theorems aren’t resignation letters. They’re design requirements.
Name the axiom you sacrifice. IIA? Unrestricted domain? Transitivity? Central dictator (safety team veto)? Hidden choices cause surprise later.
Prefer Pareto sets to scalar lies. Multi-objective policies, plural models, user choice. Conitzer’s SCT-aware alignment agenda.
Treat constitutions as incomplete contracts. Document residual control: appeals, overrides, audits, update process.
Assume metrics will be gamed. Rotate evals, private holdouts, adversarial testing, metric diversity. Gibbard, Goodhart, and Lucas stack.
Govern polycentrically. Law, norms, markets, architecture. Ostrom over single lever.
Separate bounded verification from universal proof. Ship threat-model-specific evidence. Don’t claim “safe in general.”
Keep revision channels open. Popper over frozen utopia. Version your constitution.
Design for exit and competence where possible. Users who can inspect, refuse, or run local alternatives reduce radical monopoly (Illich).
Match layer to enemy. Cooperative Goodhart (oversight) vs adversarial scheming (control). See oversight article.
Politics isn’t a bug. Under pluralism, alignment targets are chosen, not discovered (Gabriel). Legitimacy matters alongside loss curves.
Pushback and limits of the map
Engineering bypass: real annotators have structured, noisy prefs. Production restricts domains. Theorems describe general possibilities. They still warn when marketing claims universality.
RLHF ≠ voting: correct. RLHF is normative compression, not an election. That makes explicit tradeoffs more important, not less.
Gödel/Rice ≠ Claude: literally true. Use analogies carefully. Bounded systems can be tested.
Moloch overreach: not every harm is coordination failure. Value conflict is often Arrow/Sen.
Theorems don’t predict timing. Structural limits don’t tell you when alignment faking shows up. Only that incentive pressure exists.
Open research questions
Empirical Arrow: measure which axioms production reward models violate most, and whether violations predict failure modes.
Condorcet-compatible training: replace Bradley-Terry with Condorcet or approval-style aggregation; compare dynamics.
Strategic labeling at scale: Gibbard-style lab behavior under different payment and visibility rules.
Residual control mapping: for major labs, who actually decides in constitutional silence?
Illich audit: which AI products are convivial vs industrial by Illich’s criteria, and does it correlate with safety outcomes?
Polycentric AI governance prototypes: Ostrom principles applied to eval consortia, compute clubs, model registries.
Bottom line
Society’s recurring “fundamental flaws” are often theorem-shaped. Preference aggregation hits Arrow, Sen, Gibbard. Contracting hits Hart and Holmström. Coordination hits Hardin, Ostrom, multipolar traps. Codification and metrics hit Gödel (as codification limit), Rice (as general verification limit), Goodhart, Lucas. Knowledge and norms hit Hayek, Hume, Rawls.
AI alignment is where these meet optimizers that scale. The honest research program isn’t “solve morality.” It’s: given structural limits, which tradeoffs do we accept, who chooses them, and what institutions survive metric pressure?
Harder to market than “aligned AGI.” Also the question that still has answers.
Sources and reading list
Research notes (repo): notes/impossibility_theorems_ai_safety_map.md, notes/social_choice_theory_and_ai_alignment.md
Tier 1 — Start here (≈1 day)
- Stanford Encyclopedia: Social Choice Theory
- Conitzer et al. (2024)
- Gabriel (2020)
- Geanakoplos on Arrow
Tier 2 — Core classics (weeks)
- Arrow (1951) Social Choice and Individual Values
- Sen (1970/2017) Collective Choice and Social Welfare
- Hart (1995); Holmström Nobel lecture (2016)
- Ostrom (1990) Governing the Commons
- Popper (1945) The Open Society and Its Enemies
Tier 3 — AI alignment connections
- Russell, Human Compatible
- Greenblatt et al. (2024)
- Lewis-Pye & Roughgarden (2024)
- Alexander, Moloch (2014)
- Illich (1973) Tools for Conviviality
Related on this site: Alignment paradigms map · Universal values map · Oversight / control / verification