Discussion on human-AI interaction models

This is the article where several threads I’ve been following converge.

In a previous piece, I calculated that AI generates code 53,000x cheaper than a human — but after you add the cost of a human verifying the output and fixing the errors, the real advantage shrinks to 1-5x. That pattern holds across eight industries. The verification cost is the binding constraint.

That finding raises a design question: is the verification cost inherent to AI, or is it an artifact of how we hand AI to people? If a different interaction design could cut verification and error costs in half, the real economic advantage would jump from 1-5x to 10-20x. That’s not a UX improvement. It’s a business case that changes the math on every AI deployment.

At the same time, the evidence on how AI affects the humans using it is converging on an uncomfortable finding. Across coding, healthcare, writing, and education, the same interaction model — AI generates, human reviews — consistently degrades human skill, satisfaction, and judgment. Not because AI is bad at its job. Because reviewing someone else’s output is a fundamentally different cognitive activity than creating your own, and it’s worse on almost every dimension that matters for long-term human performance.

The question is whether there’s a design that solves both problems — better economics and better human outcomes. The research suggests there is, but with conditions that matter.

The same five failures, every domain

I’ve written about AI in coding, healthcare, SRE, writing, and education. Each article was about a different field. Each surfaced the same five patterns. Briefly, since each is covered in depth elsewhere:

The “worst of both worlds” trap. People don’t trust AI enough to follow it when it’s right, but they trust it enough to stop checking carefully when it’s wrong. Doctors don’t change their diagnoses based on AI input, but their unaided performance drops after using it. Developers use AI despite low trust (29% trust it), but 30-50% refuse to work without it.

The perception-reality gap. Users consistently believe AI is helping more than it is. The METR study found a 39-percentage-point gap between perceived and actual speed. DORA found 80% of developers believe AI increases their productivity while delivery stability declines. BCG found workers with AI brain fry don’t realize how impaired they are.

The deskilling spiral. AI handles routine tasks. Humans get less practice. When AI fails, humans can’t take over. This is Bainbridge’s Ironies of Automation from 1983. Endoscopists get worse at detecting cancer after using AI assistance. NASA found pilots’ cognitive and judgment skills degrade under automation while procedural skills survive — and the cognitive skills are exactly what’s needed when the system fails.

Complexity outpacing understanding. AI makes producing output cheaper, so we produce more. More code, more clinical notes, more configurations. But the people maintaining these systems don’t gain understanding at the same rate. MTTR has gotten worse every year since 2021 — from 47% of organizations needing more than an hour to 82%.

The vigilance impossibility. The safety case for AI in every domain assumes humans will carefully review every output. Cognitive science has known for decades that sustained monitoring performance degrades after about 20 minutes. This assumption has failed in aviation, nuclear power, and medical monitoring. Building AI tools around it and expecting different results is not rational.

Why “AI generates, human reviews” conflicts with cognition

These five failures trace to the same root cause: the dominant interaction paradigm conflicts with how human cognition actually works. Four well-established principles explain why.

The generation effect. Actively producing information creates stronger memory and understanding than passively reviewing it. This has been replicated for decades. When AI writes and you review, you skip the neural encoding that comes from creating. An MIT EEG study found ChatGPT users had the lowest brain engagement across all 32 measured regions, and 83% couldn’t recall key points from their own AI-assisted essays.

Flow state disruption. Csikszentmihalyi’s flow requires a balance between challenge and skill. AI collapses that balance. Either the AI handles the challenge (too easy, you’re disengaged) or you review unfamiliar AI output (wrong kind of challenge, you’re anxious and context-switching). The prompt-wait-review cycle breaks the sustained focus that flow needs. BCG’s research described the shift as going from “carpentry” to “air traffic control” — from building things to monitoring streams.

Self-determination collapse. Ryan and Deci’s Self-Determination Theory identifies three basic needs: autonomy, competence, and relatedness. A 2026 study measured all three under different AI modes. When AI generates and humans edit (passive use), self-efficacy, ownership, and meaningfulness all declined — and the effects persisted after returning to manual work. The initial satisfaction boost reversed. People felt worse about their own abilities after using AI passively.

Vigilance limitation. Sustained monitoring of outputs you didn’t create is one of the weakest human cognitive capabilities. Performance degrades after about 20 minutes. Every safety argument for AI assumes indefinite vigilance. This assumption has been falsified in every industry that has tested it. In healthcare, 84% of AI-generated clinical notes get edited — but the NEJM pointed out that assuming every doctor catches every error every time is hope, not a safety model.

The economic argument for better design

These cognitive failures have a direct economic consequence: they inflate verification and error costs, which is what compresses AI’s raw advantage from 53,000x to 1-5x.

In the “AI generates, human reviews” model, the reviewer didn’t write the code. They lack the mental model that the writer would have. They’re doing cold review. GitClear found AI adoption increased code review time by 91%. CodeRabbit found AI pull requests wait 4.6x longer in review queues, with a 32.7% acceptance rate versus 84.4% for human code.

Consider the alternative: the human makes the design decisions and writes the logic. AI handles mechanical implementation. The human reviews their own logic implemented by AI, not a stranger’s logic written by a machine. Verification cost drops because the reviewer has context. Error cost drops because the judgment calls were made by the human.

A 2026 economics paper formalizes this: AI accuracy costs are convex — good performance is cheap, near-perfect is disproportionately expensive. Full automation hits diminishing returns fast. The cost-minimizing equilibrium is partial automation. The paper’s conclusion: “Partial automation emerges as the cost-minimizing equilibrium in most cases, not merely a transitional phase.”

New evidence: five studies that changed the picture

Beyond the cross-domain patterns, five recent studies provide direct evidence on how interaction design affects outcomes.

Wharton’s ChatGPT study (N=1,000, PNAS). Same AI, different design, opposite outcomes. Students with unrestricted access scored 17% worse on exams than no-AI students. Socratic mode — hints instead of answers — improved practice 127% and matched controls on exams. The AI was identical. The interaction design determined whether it helped or harmed.

Scientific Reports passive vs active (N=269, RCT, 2026). Three conditions: no AI, passive (AI generates, human edits), active (human drafts, AI refines). Passive use reduced self-efficacy, ownership, and meaningfulness — effects persisting after returning to manual work. Active collaboration preserved all three. Same productivity. Opposite wellbeing outcomes.

Anthropic’s AI Fluency Index (millions of interactions). “Augmentative” users (enhance own work with AI) showed 2x the fluency behaviors and were 5.6x more likely to question AI reasoning than “delegative” users (hand work to AI). Generating outputs made users less critical. Iterative refinement made them more critical. The interaction mode shapes not just output quality but the user’s relationship with their own judgment.

DORA’s flow-value paradox (2024). AI tools increase flow state AND decrease perceived value of developers’ own work. More flow, less meaning. This is “shallow flow” — absorption without purpose. If AI handles the interesting parts and leaves verification, the remaining flow may be the wrong kind.

Wharton’s chess study. System-regulated assistance produced 64% gains versus 30% for on-demand. Users who chose when to use AI overused it. The system had to enforce friction users wouldn’t choose. This runs counter to every product instinct in the industry.

Do efficiency and happiness conflict?

This is the question I initially assumed had a simple answer. It doesn’t.

About 10 studies have now measured both productivity and satisfaction simultaneously with AI tools. The results split three ways:

Compatible (4-5 studies). Noy & Zhang (Science, N=444): writing 40% faster + quality up 18% + satisfaction up. Brynjolfsson (NBER, N=5,179): customer service +14% throughput + retention up. P&G field experiment (N=776): innovation output matched teams + positive emotional responses. These all share a feature: AI was used as augmentation, not replacement. The human retained agency.

Conflicting (3-4 studies). Nature Scientific Reports (N=3,562): performance up immediately, but intrinsic motivation dropped when returning to solo work. MIT cognitive debt (N=54): users reported feeling sharper while EEG showed lowest neural engagement; 83% couldn’t recall key points from their own essays. Wharton PNAS (N=1,000): practice scores +48% (efficient!) but exam scores -17% (learning destroyed). The pattern: short-term efficiency gains masking long-term capability erosion.

Design-dependent (2 studies). Scientific Reports passive vs active (N=269): same productivity, opposite wellbeing outcomes depending on interaction mode. This is the most important finding: the conflict is not inherent to AI. It is a product of specific design choices.

The honest synthesis: AI can increase both efficiency and happiness — but only when the interaction design preserves human agency, cognitive engagement, and skill-building. Designs that maximize short-term efficiency by minimizing human effort systematically undermine long-term satisfaction, self-efficacy, and meaning.

This is not “they’re compatible, don’t worry.” It is: “there is a design space where they’re compatible, but the default design choice in every major AI product is not in that space.”

Three conditions moderate the relationship:

Time horizon. Short-term, they usually align — AI feels helpful and produces faster output. Long-term, tension emerges as motivation erodes and skills atrophy. The Nature Scientific Reports finding (N=3,562) — that AI boosts performance but damages motivation for subsequent independent work — is the clearest evidence of this temporal conflict.

Skill level. For novices, both dimensions improve together: AI acts as a mentor, learning itself is satisfying, capability grows (Brynjolfsson: +34% for novices). For experts, the tradeoff is harder: efficiency gains are small (METR: -19% for experienced developers), and AI can threaten the competence that defines expert identity.

Task type. Removing tedium aligns both dimensions (everyone is happier when boring work disappears). Removing judgment-making conflicts (people derive meaning from making decisions, even hard ones).

What the evidence says works

Five interaction patterns consistently produce better outcomes than “AI generates, human reviews.” They share one property: the human does the cognitive work.

Hints, not answers. The Wharton PNAS study is the cleanest evidence. Socratic AI preserves learning while accelerating practice. Intelligent tutoring systems meta-analysis (144 studies): scaffolding effect d = 0.46.

Human first, AI second. The Scientific Reports study tested this directly. Human drafts first: self-efficacy preserved. AI generates first: self-efficacy damaged. In radiology, human-first workflows (85.0%) beat AI-first (80.8%). Form your judgment before seeing AI’s.

AI executes human judgment at scale. Meta’s DrP: 50,000 automated analyses/day, MTTR down 20-80%. Engineers write investigation logic; AI executes their thinking. Five years in production.

Built-in friction. Wharton chess: system-regulated assistance 64% vs on-demand 30%. Students know overuse is harmful but can’t self-regulate. The system has to enforce limits.

Social context. CHI 2026 triadic programming: having another human present reduced AI dependence more than any technical design. Social accountability prevents cognitive offloading.

The verification cost connection

Here is where the economic and human arguments converge.

In “AI generates, human reviews,” verification is expensive because the reviewer lacks context. Interactive reasoning interfaces (N=125) cut verification time 10.5% and improved error detection 12.1 percentage points. Letting reviewers interact with AI reasoning rather than just read it helps.

But the deeper fix is changing who does the reasoning. In “human judges, AI implements,” the human reviews the implementation of their own decisions. They have context. Verification cost drops because the cognitive architecture of the review changed.

Both problems — inflated verification cost and degraded human capability — stem from the same root cause: giving AI the judgment and leaving humans with the checking. Fix one and you fix both.

But — and this is where my earlier version of this article was too clean — “fixing both” is conditional. It works when the task involves judgment. It works when there’s time for the human to think. It works when the human has enough expertise to make the judgment. For mechanical tasks under time pressure, Generate mode is genuinely more efficient, and the human cost of using it is low because there’s nothing meaningful to lose. The answer is not “always Scaffold.” It is “know which mode fits the situation.”

Four modes

AI tools need four interaction modes, shifting based on context.

Generate. AI leads, human reviews. Routine, mechanical, low-stakes tasks where understanding doesn’t matter. Boilerplate, formatting, scheduling. Current default. Should be the minority. Risks: vigilance decrement, offloading, deskilling.

Scaffold. AI provides hints, structure, partial solutions. Human completes the work. For any task where understanding matters — debugging, learning, writing, diagnosing. The Wharton finding: preserved difficulty preserves learning. Show approaches, not solutions.

Challenge. AI acts as adversary, critic, stress-tester. High-stakes decisions, novel situations, creative work. “Here’s why your architecture might fail.” “This diagnosis might be wrong because…” Makes thinking better, not easier.

Step back. AI deliberately does nothing. Flow state, skill-building struggle, personal voice work. First stretch of a coding session. Critical debugging. Writing that needs to sound like you. Aviation mandates manual flying hours for a reason.

When to use which: the context-dependent framework

No single mode is always right. The correct mode depends on four variables:

Task type — Is this mechanical (boilerplate), creative (design decisions), judgment (debugging, diagnosing), or learning (building new skills)?

Skill level — Novice (needs teaching), intermediate (needs practice), or expert (needs challenge, not help)?

Time pressure — Can we afford to learn, or must we ship?

Consequence severity — If AI is wrong, is it a formatting error or a patient death?

Some concrete mappings:

Situation	Mode	Why	Tradeoff
Expert + mechanical task + any pressure	Generate	Expert verifies fast, task is routine	Efficiency high, no meaningful skill loss
Novice + any task + low pressure	Scaffold	Best learning opportunity	Efficiency moderate, learning maximized
Novice + mechanical task + high pressure	Generate now, study later	Must ship, but schedule learning after	Accept short-term skill gap
Anyone + judgment task + high consequence	Challenge	AI stress-tests human reasoning	Efficiency moderate, error rate lowest
Anyone + learning task + any pressure	Scaffold or Step Back	Never Generate for learning — Wharton’s -17%	Accept slower output for capability
Crisis + known problem with playbook	Generate (execute playbook)	Meta DrP: codified logic runs automatically	Only for pre-encoded known issues
Crisis + novel problem	Scaffold	AI gathers data fast, human makes all decisions	Judgment must stay with humans when stakes are highest

The key insight: time pressure changes who does the mechanical work. It never changes who does the judgment. Even under a deadline, the human makes the decisions and AI accelerates the implementation. The only exception is pre-encoded problems with tested playbooks.

For novices vs experts: novices need more Scaffold (AI as teacher). Experts need more Challenge (AI as sparring partner). Giving experts beginner-style scaffolding wastes their time. Giving novices expert-style challenge frustrates them. A system that can’t distinguish between users treats everyone suboptimally.

The efficiency-happiness tradeoff by mode

Mode	Short-term efficiency	Long-term efficiency	Short-term happiness	Long-term happiness	Skill
Generate	Highest	Declining (skill erosion → worse verification → more errors)	Medium (novelty) → Low (boredom, meaninglessness)	Low (self-efficacy drops)	Declining
Scaffold	High	Stable or rising (skills maintained → fast verification → fewer errors)	High (achievement, ownership)	High (competence maintained)	Maintained/growing
Challenge	Medium	Rising (deeper thinking → capability growth)	Medium (challenging, sometimes frustrating)	High (growth, mastery)	Growing
Step Back	Lowest	Rising (independent capability preserved)	Varies (some people enjoy struggle, some don’t)	High (self-confidence, autonomy)	Highest growth

If you optimize only for short-term efficiency, Generate wins. If you optimize for the product of efficiency × happiness × skill over time, Scaffold wins. The industry optimizes for the first column. The evidence suggests the fourth column matters more.

The experiment that doesn’t exist yet

No single study has tested this complete framework. The closest are the Wharton PNAS study (scaffold vs generate for learning), the Scientific Reports study (passive vs active for self-efficacy), and METR (AI vs no-AI for experienced developers). None measured all the variables that matter simultaneously: generation cost, verification cost, error rate, skill retention, and satisfaction.

Four arms. (1) No AI. (2) AI generates, human reviews. (3) AI handles mechanical, human handles judgment. (4) AI challenges human’s reasoning, implements human’s decisions.

Participants. 60-100 professional developers, stratified by experience. Real tasks from boilerplate to debugging production race conditions.

Metrics measured simultaneously: generation cost, verification cost, error rate and severity, total economic cost, 2-week skill retention, self-efficacy, perceived vs actual productivity, satisfaction and meaning.

Duration. 4-6 weeks active, 2-week washout, 2-week skill retention.

If the hypothesis is correct, Arm 3 produces lower total cost than Arm 2 (because verification and error costs drop when humans retain judgment context) while also producing better skill retention and satisfaction. If it’s wrong, that’s equally important to know.

The current state — companies choosing interaction designs worth trillions of dollars based on intuition rather than measurement — is what needs to change regardless of which arm wins.

What we’re actually optimizing for

Current AI tools optimize for task completion speed. The implicit question: “Did the output get produced?”

A better question: “Is the human-AI system getting stronger over time?”

That means measuring not just output but understanding. Not just this sprint but the next five. It means treating human judgment, skill, and engagement as system resources that degrade under the wrong conditions and grow under the right ones.

The economic and human arguments converge — but with a condition I initially glossed over. They converge when the interaction design preserves human agency and cognitive engagement. They diverge when short-term efficiency is prioritized over long-term capability. They depend on task type, skill level, time pressure, and consequence severity.

This is not a simple “do X and both improve.” It is: there exists a design space where both improve, the evidence tells us roughly where that space is, the default design in every major AI product is not in it, and the right design depends on context in ways that a one-size-fits-all tool cannot capture.

Building AI tools that adapt to context — that Generate for boilerplate, Scaffold for judgment, Challenge for high-stakes decisions, and Step Back for learning — is harder than building tools that do one thing. It is also the only approach the evidence supports for producing both economic value and human value over time.

Sources

Research for this article is compiled in:

HCI master research notes — 2,200+ lines of evidence including efficiency-happiness dual measurement studies
AI real economic impact — the verification cost calculation across 8 domains
HCI interaction design evidence — structured evidence on how designs affect costs and learning