In 1984, educational psychologist Benjamin Bloom published a finding that has haunted education for four decades: students who received one-to-one tutoring performed two standard deviations higher than students in conventional classrooms — the average tutored student outperformed 98% of regular students. The problem was that one-to-one tutoring is too expensive for any society to deploy at scale. Bloom called this the “2 Sigma Problem”: can we find a method of group instruction as effective as one-to-one tutoring?
Forty years later, AI appeared to offer an answer. ChatGPT is available 24/7, infinitely patient, provides instant feedback, and personalizes responses. In theory, this is the perfect tutor Bloom dreamed of.
The data tells a different story.
Cognitive debt
In 2025, Hamsa Bastani’s team at Wharton published a randomized controlled trial in PNAS. They ran an experiment with nearly 1,000 high school math students in Turkey, setting up three groups:
- GPT Base: a standard ChatGPT-4 interface with no restrictions
- GPT Tutor: the same GPT-4, but with teacher-designed guardrails — hints only, no complete answers
- Control: no AI, just textbooks and notes
Each experiment had two phases. In the first, students worked on practice problems under their respective conditions. In the second, everyone took a closed-book exam — no AI, no help of any kind.
The practice-phase results were what you’d expect: GPT Base improved scores by 48%, GPT Tutor by 127%. AI works.
Then came the exam.
The GPT Base group scored 17% worse than the control group — the students who never had AI access at all. Not tied, not slightly worse — significantly worse. Students with unrestricted AI learned less than students with nothing.
The researchers called this “cognitive debt” — borrowing the concept of technical debt. What you get quickly now, you repay with interest later.
The GPT Tutor group? Their exam scores were virtually identical to the control group (−0.4%, not statistically significant). The cognitive debt was completely eliminated.
Crutch vs. scaffold
The key to this result isn’t whether AI is useful. It’s how students use it.
The researchers analyzed every conversation between students and the AI. In the GPT Base group, the most common message was: “What is the answer?” Students then copied the response directly. In the GPT Tutor group, messages were significantly more numerous, containing things like “I tried this approach but got stuck” and “Can you give me a hint?”
A more elegant analysis went deeper: GPT Base gave incorrect answers 49% of the time when directly solving math problems (42% logical errors, 8% arithmetic errors). If students were being misled by wrong answers, then GPT Base making more errors on a practice problem should predict worse performance on the corresponding exam problem. But the data showed no statistically significant relationship between error rate and exam performance.
This means students weren’t reading or understanding the AI’s responses at all. They were copy-pasting. The text passed through the screen but not through the brain.
This is consistent with the MIT Media Lab’s EEG study — ChatGPT users showed the lowest neural engagement across all 32 measured brain regions, and 83% couldn’t recall key arguments from their own essays. The essays carried their names, but the ideas weren’t theirs.
An even more sobering finding: the students didn’t know they hadn’t learned. The GPT Base group scored worse on the exam but reported believing they had learned more and performed better. This perception-reality gap mirrors what I described in my article on AI’s cognitive cost — METR found developers were 19% slower with AI but believed they were 20% faster.
Why hints beat answers
The GPT Tutor’s design was remarkably simple. It used the same model as GPT Base (GPT-4). The only difference was the system prompt. The paper published the full prompt, and the core rules were:
- Never give the full solution. “You should in no circumstances provide the student with the full solution.”
- Require students to show their work first. No help until the student demonstrates what they’ve tried.
- Progressive support. Start with minimal information. Give more only if the student is still stuck.
- Teacher-authored solutions and common mistakes. The prompt included correct solutions and hints for common errors — preventing GPT-4’s hallucination problem.
Behind this design is one of the most robust findings in cognitive psychology: desirable difficulties. UCLA’s Robert and Elizabeth Bjork have studied a counterintuitive phenomenon since the 1990s: conditions that make learning harder actually enhance long-term retention and understanding.
Their core principle: current performance is an unreliable index of learning. Methods that make you perform better during practice (like giving answers directly) may make you remember less later. Methods that make practice harder (like giving only hints) actually make you remember more.
This explains why GPT Tutor outperformed GPT Base in both practice (+127% vs +48%) and exams (0% vs −17%). GPT Tutor’s hints were accurate (because the prompt included teacher-provided solutions), and it forced students to do their own reasoning. GPT Base’s answers were often wrong (49%), and students weren’t thinking at all.
The generation effect provides further explanation. Decades of meta-analytic research in cognitive psychology show that actively generating information produces better memory than passively receiving it — an effect size of approximately 0.40 standard deviations. When you derive a formula yourself (even with hints), your brain encodes it more deeply than when you simply read the answer. GPT Tutor preserved the generation effect. GPT Base destroyed it.
This isn’t just about math
The same pattern repeats across domains.
Anthropic’s study ran a similar experiment with 52 software engineers. Those who fully delegated to AI scored below 40% on comprehension tests. Those who used AI only for conceptual questions scored above 65%. Same tool, different usage patterns, completely different learning outcomes.
A 2026 study in Scientific Reports found the same pattern in writing. Passive AI use (copying AI-generated content) undermined self-efficacy, psychological ownership, and work meaningfulness — effects that persisted even after returning to manual work. Active collaboration (human drafts first, AI refines) preserved all three. The order determines the outcome: human first, then AI is fine. AI first, human edits is damaging.
RAND’s late-2025 survey found that AI homework use among middle and high school students rose from 48% to 62%. Simultaneously, 67% of students believed “the more students use AI for their schoolwork, the more it will harm their critical thinking skills.” They know something is wrong. They keep using it anyway.
An analysis of 6,875 student essays found a “Quality-Homogenization Tradeoff”: AI-assisted essays scored higher, but their structural variance dropped 70-78%. Everyone used the same argument patterns, the same transitions, the same “In conclusion” endings. They learned to use AI. They didn’t learn to think.
What actually works
If unrestricted AI is the worst model, what’s the best?
Khanmigo (Khan Academy) is probably the largest-scale experiment. It uses the Socratic method — guiding students through questions rather than giving answers. By late 2025, it had 5 million users across 110 countries and over 300 million student interactions. An RCT showed algebra scores improved by 0.34 standard deviations when used three times weekly for a semester. English Language Learners gained even more: 0.31 standard deviations.
But a critical finding: without active teacher involvement, student engagement dropped 60% after three weeks. Technology doesn’t replace teachers.
This leads to perhaps the smartest design: Stanford’s Tutor CoPilot. Instead of having AI tutor students directly, it helps tutors be better tutors. In an RCT with over 700 tutors and 1,000 students from underserved communities, students of tutors using Tutor CoPilot were 4 percentage points more likely to master math topics. The biggest gains came from the lowest-rated tutors: +9 percentage points. Cost: about $20 per tutor per year.
The logic: when AI tutors students directly, students easily treat it as a crutch. When AI assists a tutor, the human remains the one interacting with the student — AI just helps the tutor ask better questions and provide more precise hints. The human role is preserved.
Has Bloom’s dream been realized?
Back to Bloom’s 2 Sigma Problem. One-to-one tutoring produces a 2 standard deviation improvement. Khanmigo achieves 0.23-0.34 standard deviations. The gap remains enormous.
But I don’t think the bottleneck is AI capability. GPT-4 outperforms average teachers on most math exams. The bottleneck is design — how we structure the interaction between AI and student.
The Wharton study’s greatest contribution isn’t discovering the problem. It’s proving the problem is solvable. Same underlying model, different prompt design, and the result swings from −17% (cognitive debt) to 0% (debt eliminated). The technology didn’t change. Only the interaction did.
This aligns with the framework I proposed in my earlier article: AI tools need to shift dynamically between four modes — generate, scaffold, challenge, and step back. In learning contexts:
- Generate mode for mechanical tasks (data formatting, code boilerplate) — understanding doesn’t matter
- Scaffold mode for skill building — provide hints, structure, partial answers; student completes the reasoning
- Challenge mode for deep understanding — “Are you sure this proof holds?” “Under what conditions would this solution fail?”
- Step-back mode for critical training — AI does nothing, student faces difficulty alone
Nearly every AI learning tool today is stuck in generate mode. Their default behavior: you ask, I answer. The Wharton study proved this is the worst mode. The best mode is having AI act like a good teacher — not the person who gives you answers, but the person who helps you find them yourself.
Difficulty is a feature, not a bug
Vygotsky proposed the “zone of proximal development” a century ago — the gap between what a student can do independently and what they can achieve with help. Good teaching provides scaffolding within this zone: enough support that the student doesn’t give up, not so much that the student doesn’t think.
AI’s problem is how easily it turns scaffolding into a crutch. A study on programming education used grounded theory to analyze this distinction: when is AI a tool, when is it a tutor, and when is it a crutch? When AI provides support within the zone of proximal development, it’s a tutor. When it replaces the student’s cognitive work, it’s a crutch. The dividing line: whether the student does the hard work of thinking.
Deslauriers et al. at Harvard published a PNAS paper in 2019 with a precise formulation: students in active learning settings felt it was harder and more painful, but actually learned more. Students in passive settings felt good but learned less. The feeling of learning and learning itself are inversely correlated.
This is why the fundamental challenge for every AI learning tool isn’t technical — it’s psychological. When AI gives you the answer, you feel good. You feel like you understand, it feels smooth, it feels efficient. That fluency is an illusion. The Wharton students thought they learned more. The METR developers thought they were faster. Both were wrong, and neither knew they were wrong.
AI tools that get this right will feel less comfortable. They’ll make you do more thinking, make more mistakes, take more time. They won’t give you the answer — they’ll give you a direction and make you walk there yourself. In the short term, this feels like wasting time. But forty years of cognitive science says: this is exactly how you learn.
Difficulty isn’t a bug in learning. Difficulty is a feature. Tools that eliminate difficulty don’t eliminate the boring part of learning — they eliminate learning itself.
Sources: All claims link to primary sources inline. Key studies: Wharton/PNAS cognitive debt (2025) · Anthropic skill formation (2026) · Scientific Reports passive vs active AI (2026) · Bjork & Bjork desirable difficulties (2011) · Bloom 2 Sigma Problem (1984) · Khanmigo overview (2026) · Stanford Tutor CoPilot (2024) · RAND student survey (2026) · MIT Media Lab EEG · Deslauriers perception vs actual learning (PNAS 2019) · Generation Effect meta-analysis