Before I started building AI for incident response, I was the one getting paged at 3 AM. As an engineer at a platform serving over 100 million daily active users, I learned that the hardest part of an incident is never finding the data. It’s figuring out what the data means while your phone is buzzing and a Slack channel is filling with questions. That experience is what led me to build an AI-powered incident management system — and what taught me where AI helps and where it doesn’t.
At QCon London this month, Alex Palcuie said the same thing from the other side. Palcuie is on Anthropic’s AI reliability engineering team. His job is keeping Claude online. He previously ran incident response at Google Cloud Platform. “It would be hypocritical to say that Claude fixes everything,” he said. “My team exists, we’re hiring for many positions. This should show you that no, it doesn’t work.”
The AI SRE market is valued at over $32 billion. Gartner predicts 85% of enterprises will adopt AI SRE tooling by 2029, up from under 5% today. The money is real. But the gap between what the industry promises and what happens in production is wide.
Excellent observer, poor diagnostician
Palcuie broke incident response into four phases: observe, orient, decide, act. AI is “fantastic” at observation. “It reads the logs at the speed of I/O, it doesn’t get bored.”
He recounted a New Year’s Eve incident where Claude identified fraud that a human would have filed as a bug — tracing HTTP 500 errors to 4,000 suspicious accounts created simultaneously. But he also described a recurring failure where Claude’s KV cache would break, monitoring showed a request spike, and Claude gave the same wrong answer every time: “Request volume increase. Capacity problem. Add more servers.” The actual cause was always a cache failure. Claude saw correlation and called it causation.
A study across ChatGPT, Claude, and Gemini found a 23% hallucination rate for technical details when analyzing production incidents. In code, you iterate past mistakes. During a live incident, a confidently wrong diagnosis costs $9,000 per minute for Global 2000 companies.
There is a structural reason for this. AI coding tools had billions of lines of public code to train on — GitHub, Stack Overflow, documentation. AI SRE tools have almost nothing. Incident reports, runbooks, and postmortems are private, siloed inside each company, and rarely standardized. General-purpose LLMs were never trained on SRE work at scale because the data doesn’t exist in the public domain. Meta fine-tuned Llama 2 on 5,000 internal investigations and got 42% root cause accuracy in the top five suggestions. A recent reinforcement learning approach trained a 14B model on failed diagnostic trajectories and matched Claude Sonnet 4.5 on the AIOpsLab benchmark. The direction is right, but the field is early.
AI does not eliminate toil. It redistributes it.
The Catchpoint SRE Report has tracked toil using the same methodology for eight years. Between 2020 and 2024, toil declined steadily. In 2025, it reversed. In 2026, median reported toil jumped from 20% to 34%.
When asked whether AI had reduced toil: 49% said yes, 35% said no change, 16% said it increased. The report’s summary: “AI does not remove toil automatically. It redistributes it.”
The gap between management and individual contributors was striking. Directors see clearer incident reports and shorter meetings. Individual contributors are the ones checking AI output, recovering when an automated step acts on bad data, and explaining what the AI did. AI can reduce toil at the coordination level while adding it at the keyboard.
The Jevons Paradox
Palcuie named this “the favorite paradox in the AI industry.” When technology makes a resource cheaper to use, consumption rises. AI makes writing code easier, so teams write more. More code means more complexity. More complexity means stranger failures. His summary: “All the improvements in the tooling will be cancelled by this ever-growing complexity.”
MTTR has gotten worse every year since 2021. Organizations taking more than one hour to recover rose from 47% to 82%. This happened during unprecedented investment in AIOps. Meanwhile, outage frequency is declining — but the outages that happen are more severe. The 2024 CrowdStrike incident crashed 8.5 million systems and cost Fortune 500 companies over $5 billion.
AI is making the systems bigger faster than it is making the failures cheaper.
Other industries already solved this
In 1983, cognitive psychologist Lisanne Bainbridge published “Ironies of Automation,” now cited over 1,800 times. Her finding: when you automate most of the work, operators get less practice with the remaining tasks — exactly the tasks they need when automation fails.
Aviation proved her right. Air France 447 crashed in 2009, killing 228 people, when pilots couldn’t hand-fly an automated aircraft after the autopilot disconnected. A study of 30 airline pilots found all of them performed below certification standards on basic maneuvers. The FAA and EASA now mandate recurrent manual flying practice. Google’s DiRT program, running since 2006, applies the same principle — deliberately injecting failures so engineers practice when it’s not an emergency.
But practice alone isn’t enough. The question is how to divide the work. A 2025 Stanford-CMU study found that human-led workflows augmented by AI outperformed fully autonomous agents by 68.7%. Full automation was faster and cheaper but achieved 32-50% lower success rates. In healthcare, AI-as-second-opinion workflows improved diagnostic accuracy from 75% to 82-85% while reducing alarm burden by 80%.
Meta’s DrP platform is the strongest example from our own industry. It runs 50,000 automated analyses per day across 300 teams. MTTR dropped 20-80%. But DrP is not autonomous. Engineers codify their investigation logic into analyzers. The machine executes at scale. The knowledge comes from humans. That’s the model that works: humans encode judgment, machines execute volume. We designed our own system around the same division — the AI surfaces signals, correlates alerts, and reconstructs timelines, but the engineer makes the diagnostic call.
There is one more lever. W. Ross Ashby’s Law of Requisite Variety states that a controller must match the complexity of the system it controls. The AI SRE industry focuses on amplifying the controller. But Ashby’s law has a second strategy: reduce the system’s complexity. Companies that consolidated microservices back into simpler architectures saw MTTR drop 45% and cloud costs drop 63-80%. Sometimes the best reliability investment isn’t a better AI. It’s a simpler system.
Reframing the question
I started building an AI SRE because I’d been on the other side of the pager and believed automation could fix it. After a year of building, testing, and studying what works in production, I believe something different: the value of AI in operations isn’t replacing the engineer’s judgment. It’s making sure the engineer spends that judgment on the right problems — the causal reasoning, the “this has never happened before” moments — instead of on data gathering a machine can do in seconds.
Bainbridge predicted in 1983 that automation would create operators who are simultaneously more needed and less prepared. We are building that exact scenario in software operations. The companies that get AI SRE right will treat it like aviation treats autopilot: a tool that demands more training, not less. The ones that get it wrong will learn what Air France learned — that when automation fails, the humans it was supposed to help won’t be ready.
The future of AI in operations isn’t autonomy. It’s leverage.