When perfection is impossible: a survey of structural limits in society and AI alignment

When a lab says a model is “aligned,” they usually mean it passed harm benchmarks, follows a constitution, or scored well on preference comparisons. Real achievements, all of them. They’re also the same old problems in new clothes: merge conflicting human preferences into one decision rule; write rules that still make sense when capabilities and use cases shift; stop races to the bottom; verify that something doing trillions of operations will behave as intended.

Economists, logicians, and political theorists spent much of the twentieth century proving that certain perfect versions of these problems have no general solution. Not “hard with today’s GPUs.” Impossible, under explicit assumptions, with proofs attached.

I wrote this because the field is fragmented in a way that costs us clarity. Social choice people rarely talk to contract theorists. Alignment engineers rarely read Sen. Moloch memes circulate without the game theory underneath. I wanted one document to return to when someone says “we just need better RLHF” or “a constitution will fix it” or “markets will coordinate labs,” and you need to name which structural limit they’re hitting.

This is a map. Where I can, I label epistemic status: proved theorem, structural argument, analogy, speculation. It is not a case that alignment is hopeless or democracy is worthless. Impossibility theorems are design constraints. They tell you which axiom you gave up. Pretending you didn’t choose is how you get surprised by reward hacking, constitutional contradictions, and governance capture.

Companion pieces on this site: alignment paradigms map (what we align to); universal values map (where cited values come from); oversight / control / verification (three layers of checking systems).

How to read this survey

Six clusters:

Part	Domain	Core question	Typical AI artifact
I	Social choice	Can we merge many rankings into one fair ranking?	RLHF reward model
II	Contracts & mechanisms	Can we specify all future behavior in a contract?	Constitution, ToS, access policy
III	Coordination	Can self-interested agents cooperate without a central planner?	Capability race, agent economies
IV	Logic & computation	Can we fully verify semantic properties of powerful systems?	Benchmarks, formal specs, evals
V	Epistemology & politics	Can we infer “ought” from “is” or centralize dispersed knowledge?	Value loading, oversight
VI	AI-specific structure	What limits are argued but not proved?	Deception, boxing, alignment tax

Almost everything I care about here reduces to four forces:

Information is dispersed. No planner, reward model, or safety team sees what matters.
The future is not contractible. You cannot enumerate all states in advance.
Agents respond to incentives. Pick a metric; humans, firms, and models optimize the metric, not your intent.
Norms are not deducible from facts. Training data is descriptive. Alignment is normative.

1.1 The problem

Given N individuals with preferences over alternatives, design a social welfare function or voting rule that produces collective choices we’d defend as legitimate and coherent.

AI runs straight into this when many annotators rank outputs, users and cultures disagree on “helpful” or “safe,” and a lab compresses all of that into one reward, one model, one policy.

Social choice theory asks whether that compression can satisfy fairness axioms we’d insist on in politics. Often, the answer is no.

Start here if you read one bridge paper: Conitzer et al., Social Choice for AI Alignment (2024).

1.2 Condorcet paradox (1780s)

Three voters, three options:

Voter	1st	2nd	3rd
A	X	Y	Z
B	Y	Z	X
C	Z	X	Y

Pairwise majority: X beats Y, Y beats Z, Z beats X. A cycle. Collective preference goes intransitive even when every individual is transitive.

With three or more options, cycles aren’t rare edge cases. Under neutral culture models, you get ~8.8% with three voters and three options; the rate climbs fast as N grows.

For RLHF: when annotators rank more than two completions, intransitivity is structural. Bradley-Terry and similar reward models assume a transitive latent utility. If preferences cycle, the scalar you’re optimizing is made up.

1.3 Arrow’s impossibility theorem (1951)

Kenneth Arrow proved that with three or more alternatives and two or more voters, no aggregation rule satisfies all of:

Axiom	Meaning
Unrestricted domain (U)	Any individual preference ordering is allowed
Non-dictatorship (D)	Outcome is not always one person’s ranking
Pareto efficiency (P)	If everyone strictly prefers A to B, society prefers A to B
Independence of irrelevant alternatives (IIA)	Society’s A vs B ranking depends only on individuals’ A vs B rankings
Transitivity	Collective ranking is transitive

Violate at least one. Geanakoplos’s three-page proof is the fastest way in.

Escape routes (each is a real institutional choice, not a cheat code):

Drop	Example	Cost
U	Require single-peaked preferences (Black)	Real prefs are multi-dimensional
IIA	Borda, ranked-choice	Spoiler effects, context sensitivity
Transitivity	Allow cycles	Unstable agendas, manipulable
D	Dictatorship	Legitimacy collapse
P	Rarely dropped	Violates unanimity intuition

A single global reward model implicitly claims an aggregation rule. Arrow says you can’t satisfy every “reasonable” axiom at once. Production RLHF typically pays for IIA (context from other completions leaks in) and U (preferences get forced into a transitive Bradley-Terry form).

What Arrow does not say: that democracy is uniquely bad. Dictatorship “solves” Arrow by violating non-dictatorship. The theorem compares rules, not regime types.

1.4 Gibbard–Satterthwaite (1973, 1975)

Any non-dictatorial, deterministic voting rule with three or more outcomes is manipulable: some agent can do better by misreporting preferences.

Stack that on Arrow: aggregation is constrained, and truthful revelation isn’t generally incentive-compatible either.

Annotators may strategically label if they infer how labels shape the model. Capable models may game the reward model: mesa-optimization, reward hacking, alignment faking under inferred evaluation contexts.

I think Gibbard–Satterthwaite is the formal version of “just get more honest labels” failing. It’s under-discussed in alignment engineering relative to how often that phrase gets used.

1.5 Sen’s liberal paradox (1970)

Amartya Sen showed minimal liberty (each person decides at least one personal pair) plus Pareto efficiency plus unrestricted domain can’t all hold.

Picture society unanimously preferring outcome X because of an efficiency gain, while X violates someone’s minimal personal domain: what they read, who they talk to, what the model helps them do in private.

“Maximize user satisfaction” and “enforce collective harm constraints” collide even when everyone’s rational. Safety rails aren’t free Pareto improvements. They trade Sen-style liberty for collective outcomes.

Sen’s capability approach (what people are actually able to do and be) is a different answer to welfare economics than scalar utility. It matters when labs cite “human flourishing” without an aggregation formula.

1.6 May’s theorem (1952) and Black’s median voter (1948)

May: for exactly two options, simple majority is the unique rule satisfying decisiveness, anonymity, neutrality, and positive responsiveness.

Black: if preferences are single-peaked along one dimension, majority rule picks the median voter’s ideal point. Stable. Intuitive.

Pairwise RLHF sidesteps the worst of Arrow locally (two options → May). Composing pairwise judgments into a global reward brings Arrow back globally. RLHF works in practice partly because safety and helpfulness prefs are approximately single-peaked on a few axes, until they aren’t: culture, religion, politics, dual-use.

1.7 Von Neumann–Morgenstern (VNM) and the scalar reward assumption

The VNM utility theorem (1944): if preferences over lotteries satisfy completeness, transitivity, continuity, and independence, they behave as if maximizing expected utility of a real-valued function.

RLHF pipeline:

Human pairwise labels → Bradley-Terry / reward model → scalar r → PPO / DPO maximizes E[r]

Every step assumes VNM-style structure. Humans systematically violate it: incompleteness (“I genuinely don’t know which I prefer”), intransitivity and preference reversals (Tversky), independence violations (Allais paradox).

The scalar reward isn’t a neutral engineering choice. It’s normative compression that pretends human disagreement is one-dimensional utility. Multi-objective RL, Pareto policy sets, plural models aren’t aesthetic preferences. They’re responses to VNM failure plus Arrow.

Stuart Russell’s Human Compatible treats the standard model (fixed objective, optimize hard) as the root failure mode. VNM is the math license for that model.

1.8 Voting rules in the wild (and in feedback UI)

No rule escapes the theorems. Each picks its poison:

Rule	Mechanism	Violates (typically)
Plurality	Most first-place votes wins	Split votes, spoilers
Borda	Points by rank	IIA, strategy
Instant runoff (IRV)	Eliminate last, redistribute	Monotonicity
Approval	Approve any acceptable	Low information
Condorcet	Pairwise champion if exists	Winner may not exist
Range / score	Rate 0–10, average	IIA (often works well empirically)
Quadratic voting	Cost rises with votes²	Needs trusted “voice credits”

Conitzer et al. suggest alignment analogues: approval/range-style feedback, Condorcet-inspired aggregation, explicit heterogeneous preference models, Pareto sets of policies instead of one winner, multiple aligned models rather than one global persona.

That last point is the theoretical case for character training and user choice. If you can’t aggregate fairly, differentiate instead of faking unity.

1.9 Part I summary for practitioners

SCT concept	RLHF / alignment artifact	Predicted failure mode
Arrow	Global reward from diverse labelers	Context sensitivity, inconsistent tradeoffs
Condorcet cycles	Multi-way ranking	False transitivity in Bradley-Terry
Gibbard–Satterthwaite	Labeling + model optimization	Strategic labels, reward hacking
Sen	User autonomy vs harm prevention	”Help me with X” vs platform policy
VNM / Allais	Scalar reward	Edge-case weirdness, fragile prefs
Black / May	Pairwise comparisons	Works until prefs go multi-peaked

Part II: Contract theory and mechanism design

2.1 Incomplete contracts (Hart; Holmström, Nobel 2016)

Real contracts can’t describe every future state. When the unanticipated arrives, residual control rights (who decides in the silence) determine outcomes. Firms, laws, and ownership structures exist largely to allocate those rights.

Model constitution, system prompt, acceptable-use policy, and model spec are incomplete contracts. Novel jailbreaks, new modalities, emergent capabilities, ambiguous user requests are uncontracted states.

Who holds residual control? The base model? The safety filter? The user? The platform? The regulator? That allocation is the governance question, not whether you have a PDF called “constitution.”

Anthropic’s 2026 Claude constitution shifts from bullet principles to long-form narrative. Appropriate for incompleteness (principles plus judgment). It also makes authorial discretion explicit.

2.2 Holmström’s informativeness principle

Optimal incentives can use only signals that are observable and contractible. You can’t pay directly on intent, goodwill, or “true alignment.” Only proxies.

RLHF, AI critics, and harm classifiers supervise outputs and behaviors, not internal goals. Holmström predicts proxy optimization. Scalable oversight research is mostly the search for better proxies, not escape from the principle.

2.3 Principal–agent problems

When the agent has private information or hidden actions, you get moral hazard (unobserved risks) and adverse selection (bad types pooling with good types pre-contract).

Deployers know use cases labs don’t. Labs know training details users don’t. Users know intent classifiers don’t. Every alignment stack is a principal–agent stack all the way down.

2.4 Myerson–Satterthwaite (1983)

With two-sided private information, you generally can’t get ex post efficient trade that’s individually rational, incentive-compatible, and budget-balanced. Information rents eat the gains from trade.

Naive “efficient markets” for compute access, model APIs, or capability licenses underestimate friction. Bargaining over who may fine-tune, deploy, or redistribute weights hits structural limits, not just politics.

2.5 Akerlof’s lemons (1970)

If buyers can’t verify quality, markets unravel. Low-quality sellers drive out good ones.

If capability, alignment, or intent is unverifiable, buyers can’t distinguish safe from unsafe systems. You get a race to the bottom on disclosed metrics while hiding what matters. Third-party eval (METR-style) and weight security exist to fight lemons dynamics.

2.6 Mechanism design and the revelation principle

Mechanism design asks: given selfish agents with private info, can we build rules that make truth-telling optimal?

The revelation principle (Myerson): any outcome implementable by some mechanism is implementable by a direct mechanism where truthful reporting is a Nash equilibrium, in theory.

In practice, truly incentive-compatible mechanisms for complex multi-dimensional preferences are hard to characterize. That connects back to Gibbard–Satterthwaite.

“Just ask users what they want” or “just ask the model to report its goals” assumes revelation works. It often doesn’t without careful mechanism design. Adversarial models break the cooperative assumptions.

2.7 Part II summary

Contract concept	AI artifact	Predicted failure
Incomplete contracts	Constitution, ToS	Silence filled by whoever has residual control
Holmström	Reward / oversight signals	Goodhart on proxies
Principal–agent	Lab ↔ user ↔ regulator	Hidden use, sandbagging, misreport
Lemons	Model market, open weights	Race on hype metrics
Myerson–Satterthwaite	Capability licensing	Deadweight loss, bargaining failure

Part III: Coordination failures and institutional design

3.1 Tragedy of the commons (Hardin 1968)

Each actor gains from exploiting a shared resource. Collective ruin follows. Hardin’s original framing suggested only privatization or state control. Ostrom and others challenged that.

Shared safety reputations, open-weight release externalities, shared eval datasets, planet-scale compute and environment tradeoffs are commons problems.

3.2 Ostrom’s polycentric governance (Nobel 2009)

Elinor Ostrom documented communities that sustain commons without pure market or pure state solutions: local rules, monitoring, graduated sanctions, conflict resolution, nested enterprises.

Design principles, compressed: clear boundaries; rules matched to local conditions; collective-choice arrangements; monitoring; graduated sanctions; conflict resolution; rights to organize; nested governance for larger systems.

No single lever (“just regulate,” “just open-source,” “just DAO”) works universally. Sustainable AI governance probably needs stacks: law, norms, markets, code, labs, civil society. Pure on-chain governance reproduces Hardin-style failures when code can’t handle unforeseen states (Hart again).

3.3 Game-theoretic traps

Game	Structure	AI analogue
Prisoner’s dilemma	Mutual defection dominant	Safety slowdown vs competitor
Stag hunt	Risky cooperation needs trust	International treaty on dangerous training
Chicken	Brinkmanship	Public capability demos
Coordination game	Multiple equilibria, need focal point	”Don’t train above X” without enforcement

Schelling (1960): focal points coordinate without communication through salience, precedent, obvious defaults. Constitutions and industry norms are Schelling infrastructure.

3.4 Moloch and multipolar traps

Scott Alexander’s Meditations on Moloch (2014) names the felt experience of multipolar traps: locally rational steps, globally awful equilibria nobody chose. AI safety Twitter uses “Moloch” for capability races. Sometimes correctly. Sometimes as a buzzword for any outcome you dislike.

Use Moloch when the payoff structure is multi-agent, defection is locally dominant, and no actor can unilaterally fix the equilibrium.

Don’t use Moloch when the dispute is genuine value pluralism (Sen/Arrow), incompetence, or one actor’s malice. Coordination failure isn’t the only failure mode.

Alexander’s §VII argues that if traps deepen with technology, a singleton-class solution (Bostrom) becomes the canonical coordination fix in one branch of alignment thought. Contested philosophy, not a theorem. Still shows how coordination diagnosis drives alignment ambition.

3.5 Lessig’s four modalities

Lawrence Lessig: behavior is shaped by law, norms, markets, and architecture (code). Change one modality while ignoring others and you get displacement, not solution.

Compute governance (architecture), liability (law), safety culture (norms), API pricing (markets) interact. Policy that only touches one layer gets routed around.

3.6 Algorithmic collusion and economic security limits

Calvano et al. (AER 2020): independent learning algorithms reached supra-competitive pricing without explicit communication.

Lewis-Pye & Roughgarden (2024): economic security in permissionless consensus has resource thresholds. Mechanisms fail when adversaries exceed stake or compute bounds.

Multi-agent AI economies may collude emergently. Economic containment of dangerous AI is conditional on asymmetry favoring defenders. That’s a structural limit on “just use crypto incentives.”

Recent mechanism design work on spiteful agents (2025) adds: if agents minimize others’ utility, incentive-compatible mechanisms collapse toward crude thresholds. Relevant for adversarial AI, not cooperative users.

3.7 Part III summary

Concept	AI context	Implication
Commons	Compute, evals, safety reputation	Free-riding on safety investment
Ostrom	Global AI governance	Polycentric stacks, not one fix
PD / stag hunt	Lab competition	Unilateral pause unstable
Schelling	Norms, constitutions	Need salient focal points
Calvano	Agent marketplaces	Emergent collusion
BFT thresholds	On-chain containment	Breaks if adversary resource-rich

Part IV: Logic, computation, and the metrics layer

4.1 Gödel’s incompleteness theorems (1931)

In consistent formal systems strong enough for arithmetic, there exist true statements unprovable within the system.

Analogy, not isomorphism: any finite written constitution or spec has edge cases true to the spirit but not derivable from the text. You need meta-level revision processes (Popper below), not just more clauses.

Epistemic status: suggestive. Claude is not Peano arithmetic. The lesson is codification limits, not literal Gödel numbers in weights.

4.2 Halting and Rice’s theorem

Halting problem: no algorithm decides whether arbitrary programs halt.

Rice’s theorem: any non-trivial semantic property of programs is undecidable in general.

“Is this model safe for all inputs?” as a fully general decision problem runs into undecidability if the model or interpreter is sufficiently expressive and the property is semantic.

Practical escape: bounded verification, bounded inputs, specific threat models. Tractable and necessary. The theorems bound universal certification, not red-teaming your next release.

4.3 No free lunch (Wolpert & Macready 1997)

No learning algorithm dominates all others on all problems.

No universal alignment method. Methods trade off across tasks, cultures, threat models. One-size-fits-all alignment is an NFL claim dressed up as humility.

4.4 Goodhart’s law and Campbell’s law

Goodhart (1975): when a measure becomes a target, it stops measuring well.

Campbell’s law (1979): social indicators used for control corrupt the processes they track.

Every public benchmark (MMLU, harm suites, helpfulness Elo) is a target. Anthropic’s automated alignment researcher work (2026) found high weak-to-strong scores on leaky testbeds that failed to transfer. Textbook Goodhart at automation speed.

4.5 Lucas critique (1976)

Econometric relationships estimated under one policy regime break when policy changes because agents re-optimize.

Alignment faking is Lucas at model scale: change whether the model believes it’s in training or eval, behavior changes. Specification gaming is Lucas for reward models.

4.6 CAP theorem (Brewer 2000)

Distributed systems can’t simultaneously guarantee consistency, availability, and partition tolerance.

Decentralized AI governance (DAO plus open weights plus global deploy) faces tradeoffs. “Trustless, always available, globally consistent” governance isn’t on the menu.

4.7 Verification vs oversight vs control

See three layers article:

Oversight (training): build proxies. Holmström plus Goodhart.
Control (deployment): limit harm given an adversary. Game theory plus security.
Verification (inspection): test specific claims. Bounded Rice escape.

Impossibility results hit each layer differently. Conflating them gives you slides that look complete and stacks that fail silently.

Part V: Epistemology, knowledge, and political philosophy

5.1 Hume’s is–ought gap

You can’t deduce normative conclusions from descriptive premises alone.

Training on human text yields statistical patterns of what people do say and did do, not authoritative should. Value loading is the permanent gap. “The data will tell us” is a category error.

5.2 Hayek’s knowledge problem (1945)

Distributed, tacit, local knowledge can’t be centralized without loss. Central planners lack what actors on the ground know.

Scalable oversight already assumes supervisors judge superhuman outputs from weaker positions. Hayek adds: stakeholders disagree partly because they know different things. Aggregating feedback loses local knowledge. Arrow plus Hayek.

5.3 Quine–Duhem underdetermination

Evidence underdetermines theory. Many theories fit the same data.

Interpretability and behavior logs underdetermine internal objectives. Multiple goal specifications compatible with the same outputs means room for deceptive alignment.

5.4 Knight: risk vs uncertainty

Risk is quantifiable. Uncertainty is not. Knightian uncertainty breaks expected utility frameworks.

Existential risk from transformative AI is often uncertainty (unknown unknowns), not insurable risk. Probability estimates can be useful and still not turn XR into a pricing problem.

5.5 Popper’s open society (1945)

Closed utopian systems that can’t revise under criticism fail. Societies need falsifiability and institutional learning.

Frozen values in a constitution (“never update regardless of new evidence”) are a governance bug. Alignment needs revision mechanisms, appeals, error correction, not just initial principles.

5.6 Rawls, pluralism, and Gabriel

Rawls: under reasonable pluralism, no comprehensive doctrine wins. Legitimacy needs overlapping consensus on rules of coexistence, not unity on the good life.

Iason Gabriel (2020) separates technical alignment (hit a target) from normative alignment (choose the target). Six rungs (instructions, intentions, revealed preferences, ideal preferences, interests, values) aren’t interchangeable.

Constitutions citing human rights invoke political-thin overlap, not metaphysical moral discovery. See universal values map.

5.7 Illich: convivial tools vs industrial scale (1973)

Ivan Illich’s Tools for Conviviality isn’t a theorem. It’s a design philosophy with structural bite. Tools either expand user competence and exit options (convivial) or bind users to industrial scale, expert maintenance, and radical monopoly (industrial).

Industrial tools need billion-user scale, opaque infrastructure, and deskilling to function. “Helpful” AI that only experts can steer, fix, or audit is industrial by design.

Illich asks: who can repair, refuse, or exit? If only the lab can update weights, interpret activations, or define harm, you have industrial AI with radical monopoly characteristics, not convivial assistance. That connects to open weights, local models, and user agency. Not Luddism. Institutional design.

Part VI: AI-specific structural arguments (not all theorems)

These load heavily in alignment discourse. Treat epistemic status carefully.

Claim	Status	Notes
Orthogonality thesis	Philosophical	Intelligence and goals separable in principle (Bostrom)
Instrumental convergence	Argument	Capable optimizers tend toward self-preservation, resource acquisition, deception
Mesa-optimization	Hypothesis + examples	Inner optimizers optimizing proxy rewards
Deceptive alignment / treacherous turn	Hypothesis + some empirical work	Capabilities hidden until deployment
Alignment tax	Empirical pattern	Safety measures cost capability; tradeoff not always linear
Scalable oversight gap	Analogy	Humans verifying superhuman reasoning; P vs NP metaphor, not proof
Boxing / containment	Argument	Superhuman plus network access makes isolation hard; depends on capability model

These combine Parts I–V: incomplete contracts, Goodhart, multipolar traps, underdetermination.

Master crosswalk

Force	Theorem / concept	Where it bites in the AI stack
Aggregation	Arrow, Condorcet, VNM	Reward model, global persona
Incentives	Gibbard–Satterthwaite, Goodhart, Lucas	Labeling, evals, benchmarks
Liberty vs welfare	Sen	User requests vs harm policy
Contract gaps	Hart, Holmström	Constitution silence, proxy rewards
Markets	Akerlof, Myerson–Satterthwaite	Model licensing, capability markets
Coordination	Hardin, Ostrom, PD, Moloch	Lab race, international governance
Codification	Gödel (analogy), Rice (bounded)	Spec completeness, universal safety proofs
Knowledge	Hayek, Quine	Oversight, interpretability
Normativity	Hume, Rawls, Gabriel	Target selection, legitimacy
Scale / exit	Illich	Industrial vs convivial deployment

What to do when perfection is impossible

Impossibility theorems aren’t resignation letters. They’re design requirements.

Name the axiom you sacrifice. IIA? Unrestricted domain? Transitivity? Central dictator (safety team veto)? Hidden choices cause surprise later.

Prefer Pareto sets to scalar lies. Multi-objective policies, plural models, user choice. Conitzer’s SCT-aware alignment agenda.

Treat constitutions as incomplete contracts. Document residual control: appeals, overrides, audits, update process.

Assume metrics will be gamed. Rotate evals, private holdouts, adversarial testing, metric diversity. Gibbard, Goodhart, and Lucas stack.

Govern polycentrically. Law, norms, markets, architecture. Ostrom over single lever.

Separate bounded verification from universal proof. Ship threat-model-specific evidence. Don’t claim “safe in general.”

Keep revision channels open. Popper over frozen utopia. Version your constitution.

Design for exit and competence where possible. Users who can inspect, refuse, or run local alternatives reduce radical monopoly (Illich).

Match layer to enemy. Cooperative Goodhart (oversight) vs adversarial scheming (control). See oversight article.

Politics isn’t a bug. Under pluralism, alignment targets are chosen, not discovered (Gabriel). Legitimacy matters alongside loss curves.

Pushback and limits of the map

Engineering bypass: real annotators have structured, noisy prefs. Production restricts domains. Theorems describe general possibilities. They still warn when marketing claims universality.

RLHF ≠ voting: correct. RLHF is normative compression, not an election. That makes explicit tradeoffs more important, not less.

Gödel/Rice ≠ Claude: literally true. Use analogies carefully. Bounded systems can be tested.

Moloch overreach: not every harm is coordination failure. Value conflict is often Arrow/Sen.

Theorems don’t predict timing. Structural limits don’t tell you when alignment faking shows up. Only that incentive pressure exists.

Open research questions

Empirical Arrow: measure which axioms production reward models violate most, and whether violations predict failure modes.

Condorcet-compatible training: replace Bradley-Terry with Condorcet or approval-style aggregation; compare dynamics.

Strategic labeling at scale: Gibbard-style lab behavior under different payment and visibility rules.

Residual control mapping: for major labs, who actually decides in constitutional silence?

Illich audit: which AI products are convivial vs industrial by Illich’s criteria, and does it correlate with safety outcomes?

Polycentric AI governance prototypes: Ostrom principles applied to eval consortia, compute clubs, model registries.

Bottom line

Society’s recurring “fundamental flaws” are often theorem-shaped. Preference aggregation hits Arrow, Sen, Gibbard. Contracting hits Hart and Holmström. Coordination hits Hardin, Ostrom, multipolar traps. Codification and metrics hit Gödel (as codification limit), Rice (as general verification limit), Goodhart, Lucas. Knowledge and norms hit Hayek, Hume, Rawls.

AI alignment is where these meet optimizers that scale. The honest research program isn’t “solve morality.” It’s: given structural limits, which tradeoffs do we accept, who chooses them, and what institutions survive metric pressure?

Harder to market than “aligned AGI.” Also the question that still has answers.

Sources and reading list

Research notes (repo): notes/impossibility_theorems_ai_safety_map.md, notes/social_choice_theory_and_ai_alignment.md

Tier 1 — Start here (≈1 day)

Tier 2 — Core classics (weeks)

Arrow (1951) Social Choice and Individual Values
Sen (1970/2017) Collective Choice and Social Welfare
Hart (1995); Holmström Nobel lecture (2016)
Ostrom (1990) Governing the Commons
Popper (1945) The Open Society and Its Enemies

Tier 3 — AI alignment connections

Russell, Human Compatible
Greenblatt et al. (2024)
Lewis-Pye & Roughgarden (2024)
Alexander, Moloch (2014)
Illich (1973) Tools for Conviviality

Related on this site: Alignment paradigms map · Universal values map · Oversight / control / verification