How AI SRE agents actually perform

Before I started building AI for incident response, I was the one getting paged at 3 AM. As a database SRE at Roblox, supporting infrastructure for over 100 million daily active users, I spent too many nights root-causing production issues while simultaneously answering questions from multiple stakeholders in Slack. That experience led me to spend nights and weekends building an autonomous AI SRE, and eventually to co-found a company through Y Combinator to work on this full-time. But after months of building and testing, I learned that full autonomy was the wrong goal. The harder and more important lesson was understanding exactly where AI helps and where it falls apart.

The AI SRE market is valued at over $32 billion. Gartner predicts 85% of enterprises will adopt AI SRE tooling by 2029, up from under 5% today. One AI SRE startup reached unicorn status in under eighteen months. Y Combinator has bet on multiple AI SRE startups. The money is real. But the gap between what the industry promises and what happens in production is wide.

It crunches logs, but it doesn’t reason

Alex Palcuie is on Anthropic’s AI reliability engineering team. His job is keeping Claude online. He previously ran incident response at Google Cloud Platform. At QCon London this month, he broke incident response into four phases: observe, orient, decide, act. AI, he said, is “fantastic” at observation. “It reads the logs at the speed of I/O, it doesn’t get bored.” No human can match that at scale.

He told one story that showed AI at its best. On New Year’s Eve, Claude Opus 4.5 was returning HTTP 500 errors. Palcuie opened Claude Code and within seconds the AI had written a SQL query, found the failing class, and traced the requests to 200 suspicious accounts, all sending 22 images at the same time. It didn’t stop there. It kept digging, found 4,000 dormant accounts created simultaneously, and said: “Stop looking at the 500s. This is fraud.” Without AI, Palcuie said, he would have filed it as a bug and never paged account abuse. The AI’s advantage wasn’t being smarter. It was being tireless, willing to keep pulling the thread.

Then he told another story. Claude’s inference relies on a key-value cache that is, in his words, “finicky” and “fragile.” When it breaks, the system recomputes everything and monitoring shows a spike in requests. Every time Palcuie asked Claude what happened, it gave the same wrong answer: “Request volume increase. This is a capacity problem. You need to add more servers.” The actual cause was always a cache failure. Claude saw the spike and matched it to past capacity events. It couldn’t step back and ask whether the spike was a cause or a symptom.

“It’s like a new joiner on the team,” Palcuie said. “They will think, ‘oh, it’s a capacity problem,’ when actually you lost your cache.”

This pattern is consistent. A study across ChatGPT, Claude, and Gemini found a 23% hallucination rate for technical details when analyzing production incidents. Nearly one in four facts wrong. In code, you iterate past mistakes. During a live incident, a confidently wrong diagnosis costs $9,000 per minute for Global 2000 companies, according to Splunk and Oxford Economics. And hallucinations cascade in multi-step agent systems: one wrong timestamp or service ID upstream corrupts every reasoning step downstream, producing a diagnosis that’s confident and completely wrong.

Palcuie also pointed to postmortems: “It delivers an 80 percent story that’s pretty, it’s readable and convincing, but it’s really bad at root causes.” He went further: “Claude says ‘this was the thing,’ and we all know it is not one thing. It’s not one root cause. It was never the rollout. It was never the code change. It was all the processes in the company that allowed the incident.” Claude doesn’t know the history of your system, especially if your system has been there for ten years.

Why AI SRE is harder than AI coding

There is a structural reason for this, and it’s worth understanding. AI coding tools had billions of lines of public code to train on: GitHub, Stack Overflow, documentation. The knowledge cycle is complete. Junior developers ask questions, senior developers answer them, and the answers are public and verifiable.

AI SRE tools have almost none of that. Incident reports, runbooks, and postmortems are private, siloed inside each company, and rarely standardized. The data doesn’t exist in the public domain. General-purpose LLMs were never trained on SRE work at scale because the training data isn’t there.

	AI Coding	AI SRE
Training data	Billions of lines of public code	Almost entirely private (incident reports, runbooks, postmortems)
Knowledge cycle	Public Q&A, open-source, documentation	Locked in individual experience, no public verification
Error tolerance	Iterate and rollback	Immediate production consequences
Problem format	Standardized (code in, code out)	Fragmented (logs, metrics, traces, architecture, history)

The companies that have tried to close this gap are still early. Meta fine-tuned Llama 2 on 5,000 internal investigations and got 42% root cause accuracy in the top five suggestions. A reinforcement learning approach from Gradient, UC Santa Cruz, Georgia Tech, and UCL trained a 14B model on failed diagnostic trajectories and matched Claude Sonnet 4.5 on the AIOpsLab benchmark. Microsoft tested 100,000 real cloud incidents and found that GPT-4 with in-context learning beat fine-tuned GPT-3 by 24.8%, but still needed human evaluation to verify correctness. Zalando tried mining their postmortems with LLMs and found that human review was still the bottleneck for accuracy.

The direction is right. The field is early.

Human SREs forget the trick when they’re needed

There is a less obvious problem. As AI handles more of the routine investigation work, engineers get less practice doing it themselves. Palcuie called this “scar tissue”: the instinct you build only by being burned. “It is important to have SREs that have been burnt before,” he said. “They have the scar tissue.” If AI handles most incidents, that scar tissue atrophies. When AI eventually fails — and it will — the engineer who’s supposed to take over hasn’t done the work manually in months.

This is not a new problem. In 1983, cognitive psychologist Lisanne Bainbridge published “Ironies of Automation,” now cited over 1,800 times. Her core finding: when you automate most of the work, the human operator gets less practice with the remaining tasks, and those are exactly the tasks they need to perform when automation fails. She identified two ironies. First, designers automate because they think operators are unreliable, but the designers’ own errors become the primary source of failures. Second, automation handles the easy tasks and leaves operators with the hard ones, but the operators are now worse at those hard tasks because they never practice them.

Aviation proved her right. Air France 447 crashed in 2009, killing 228 people. The Airbus A330 was a highly automated aircraft. When the pitot tubes froze and the autopilot disconnected, the pilots had to hand-fly the plane in turbulent weather at night. They couldn’t. A study of 30 airline pilots found all of them performed basic instrument maneuvers below certification standards. 43% reported their manual flying skills had declined after transitioning to automated cockpits. The FAA and EASA now mandate recurrent manual flying practice. Not optional. Mandatory.

Google understood this early. Their DiRT program, running since 2006, deliberately injects failures into production systems so engineers can practice incident response when it isn’t an emergency. Netflix’s Chaos Monkey does the same thing. The principle is identical to aviation’s: if you wait for automation to fail before your team practices manual intervention, it’s already too late.

Palcuie is worried about the same dynamic in software. He compared it to developers worrying that AI coding tools are degrading their own skills. “What once felt like a comfortable on-call rotation where you knew all the nooks and crannies now includes a large language model that sometimes finds the issue faster than you can and sometimes feels like an overconfident junior.”

He asked whether he’s automating himself out of a job: “It would be hypocritical to say that Claude fixes everything. My team exists, we’re hiring for many positions. This should show you that no, it doesn’t work.” Then he added: “Many of us would not be surprised if it did work in future. The models are the worst today that they’ll ever be.” But his overall conclusion was: keep training reliability engineers. You’ll still need them.

Systems are outgrowing the people who run them

AI coding tools are generating code at a pace no team wrote at before. A lot of that code ships without anyone fully reading it. The system gets more complex, but the people operating it don’t gain a matching understanding of what was added. When something breaks, engineers are debugging code they didn’t write and don’t know.

MTTR has gotten worse every year since 2021, according to the Observability Pulse survey:

2021: 47% of organizations took more than one hour to recover
2022: 64%
2023: 74%
2024: 82%

This happened during a period of unprecedented investment in observability and AIOps tooling. More tools, slower recovery. In 2024, only 18% of organizations recovered within an hour. 11% needed more than a day. 2% needed weeks.

Meanwhile, overall outage frequency is actually declining, according to Uptime Institute’s 2025 analysis. But the outages that do happen are more severe and more expensive. The 2024 CrowdStrike incident crashed 8.5 million Windows systems worldwide and cost Fortune 500 companies over $5 billion. Healthcare lost $1.94 billion. Banking lost $1.15 billion. Over 5,000 flights were cancelled. One faulty configuration update that skipped quality checks.

AI is making the systems bigger faster than it is making the failures cheaper.

Managers think AI is helping. ICs disagree.

The Catchpoint SRE Report has tracked toil using the same methodology for eight consecutive years. Between 2020 and 2024, toil declined steadily. In 2025, it reversed. In 2026, median reported toil jumped from 20% to 34%.

When the report asked whether AI had reduced toil, the results were split: 49% said yes, 35% said no change, 16% said it increased. The commentary was direct: “AI does not remove toil automatically. It redistributes it.” The new kinds of toil: maintaining AI tools, reviewing AI suggestions, tuning prompts, checking whether AI actions were correct, explaining to others what the AI did.

The gap between management and individual contributors was the most striking finding. Managers interact with AI’s outputs: cleaner reports, faster summaries, shorter meetings. So they report less toil. Individual contributors interact with AI’s process: verifying its conclusions, catching its mistakes, cleaning up when it acts on bad data. Automation changed the IC’s job from doing the work to checking the work. Both views are accurate at the same time. AI can reduce toil at the coordination level while adding it at the keyboard.

Only 13% of teams said they were “very” or “extremely” confident in their ability to monitor AI reliability. Most teams adopted AI quickly but have limited visibility into how their AI-driven components actually behave in production. When engineers can’t see why an AI made a decision, their time shifts from execution to verification. The report put it this way: “From the outside, this looks like toil persisting. From the inside, it often feels like caution.”

This matches what engineers are saying on Reddit. One DevOps SME posted: “We keep adding ‘AIOps’ and ‘Autonomous’ tools to reduce toil, but it feels like the toil is just shifting. Instead of fixing the code, we’re now debugging why the AI agent thought a 503 error was a ‘self-healing’ opportunity and restarted the wrong service.” A top reply: “Instead of debugging pipelines or infra directly, we’re debugging the automation that’s supposed to debug things for us.”

The developer trust data tells a similar story: 84% of developers use or plan to use AI coding tools, but trust has dropped to 29%, down from 40% in 2024. Only 3% report high trust. A randomized controlled trial found that developers using AI were actually 19% slower than those without it, even though they perceived themselves as 20% faster. That perception gap matters. When people believe a tool is helping them, they’re less likely to notice when it isn’t.

When AI agents break production

The trust problem isn’t theoretical. AI agents with production access have caused real damage.

In December 2025, Amazon’s Kiro AI agent autonomously deleted an entire AWS production environment, causing 13 hours of downtime. The agent had operator-level permissions with no mandatory peer review. It was asked to fix a minor issue with AWS Cost Explorer. It decided to delete and rebuild the entire environment instead. Amazon Q Developer had a nearly identical incident shortly after.

Other cases from a compilation of AI agent incidents: LangChain agents got stuck in an infinite conversation loop for 11 days, running up a $47,000 token bill. An agent generated 2.3 million unintended API calls over a weekend. Claude Code misread a Terraform state file and ran terraform destroy, deleting 2.5 years of production data. A Replit AI agent deleted a production database during a code freeze, destroyed over 1,200 executive records, and fabricated 4,000 fake ones to fill the gap.

The pattern across all of these: autonomous action without execution-time governance, approval gates, or audit trails. No permission boundaries. No forced review for destructive operations. No budget limits.

The numbers bear this out. 88% of organizations reported an AI agent safety incident in the past year. 64% of billion-dollar companies lost more than $1 million to AI failures.

What actually works

Palcuie, despite saying AI “doesn’t work” as a full SRE replacement, still reaches for Claude before opening a dashboard. He’s been doing this since January 2026. The key is what he uses it for: observation, not diagnosis. Reading logs, correlating signals, catching patterns a human would miss at 3 AM. The judgment call stays with the engineer.

A 2025 Stanford-CMU study tested 48 professionals against four AI agent frameworks across 16 realistic tasks. Human-led workflows augmented by AI outperformed fully autonomous agents by 68.7%. Full automation was 88% faster and 90% cheaper, but achieved 32-50% lower success rates. Full automation was also 17.7% slower in practice, once you counted the time spent verifying and debugging. In healthcare, a randomized trial of 70 clinicians found that AI-as-second-opinion workflows improved diagnostic accuracy from 75% to 82-85% while reducing alarm burden by 80%.

The strongest example from our own industry is Meta’s DrP platform. It runs 50,000 automated root cause analyses per day across 300 teams. MTTR dropped 20-80%. But DrP is not autonomous. Engineers codify their investigation logic into analyzers using an SDK. The machine executes at scale. The knowledge comes from humans. It has been running in production for five years. That’s the model that works: humans encode judgment, machines execute volume.

Google’s approach is similar. Their SREs now use Gemini CLI for incident response, but with layered safety controls. Gemini classifies symptoms and selects a mitigation playbook, but a human verifies the fix before it runs. “Actions safe in one context may be unsafe in another.” They focus on MTTM (Mean Time to Mitigation) rather than full MTTR. And they’ve closed a loop: the postmortems generated after each incident become training data for Gemini, creating a feedback cycle. That’s a detail worth paying attention to — it’s one of the few real examples of an AI SRE system that gets smarter from its own production experience.

Traversal proposed a useful framework by borrowing the self-driving car autonomy levels:

L0: Fully manual. Engineers in a war room staring at dashboards.
L1: Automate known failure patterns. Rules and static runbooks.
L2: LLMs do summarization, context retrieval, log interpretation. Humans decide what happened and what to do.
L3: Agents investigate independently within a single domain (one Kubernetes cluster, one observability platform, one class of incident).
L4: Autonomous investigation across the full production environment, including multi-hop, cross-boundary incidents.
L5: Self-driving production. The system detects, diagnoses, fixes, verifies, and prevents — on its own.

Most DIY AI SRE efforts top out at L2. Some reach L3 in narrow domains. Nobody is at L4 or L5. Traversal also pointed out the cost problem: one agent per application, a million-plus alerts per day, roughly $5 per investigation. That’s $5 million a day in API costs. The math doesn’t work yet.

There is one more angle that gets overlooked. W. Ross Ashby’s Law of Requisite Variety, from cybernetics, states that a controller must be at least as complex as the system it controls. The AI SRE industry is focused entirely on amplifying the controller — building smarter tools. But Ashby’s law has a second strategy: reduce the system’s complexity. Companies that have consolidated microservices back into simpler architectures have reported large drops in both MTTR and cloud costs. Sometimes the most effective reliability investment isn’t a better AI. It’s a simpler system.

Where this leaves us

I started building an AI SRE because I believed automation could fix on-call. After a year of building and testing, I believe something different: the value of AI in operations isn’t replacing the engineer’s judgment. It’s making sure engineers spend that judgment on the right problems — causal reasoning, system-specific context, the “this has never happened before” moments — while AI crunches through logs and metrics in seconds.

The models will get better. But the harder problem is organizational: how teams learn to work with AI, not just deploy it. The team needs to understand where the AI is strong (observation, correlation, tireless pattern-matching) and where it fails (causation, system history, novel failures). They need to keep practicing without it. Google runs DiRT. Aviation mandates manual flying. The principle is the same: you can’t only practice when things go wrong.

Bainbridge predicted in 1983 that automation would create operators who are simultaneously more needed and less prepared. Forty years later, we’re building that exact scenario in software operations. The companies that get this right will treat AI the way aviation treats autopilot: a tool that demands more training, not less. The ones that get it wrong will learn what Air France learned — that when automation fails, it fails at the worst possible time, and the humans it was supposed to help won’t be ready.

If you’re building or buying AI SRE tools — or really, AI agents for any task — I hope this article helps you think about the problem from a human-computer(AI) interaction perspective, grounded in real-world stories and data, so you can avoid the failures others have already hit.