How AI actually works in healthcare

Our startup is pivoting into healthcare, so I went to the AI Healthcare Conference at Stanford the other day. A lot of the speakers were physicians, and they talked about AI being used across a wide range of clinical settings, from diagnostics and imaging to drug discovery and bedside monitoring. Genuinely interesting and impressive. I’m especially hopeful about what AI can do for shortening clinical trials and enabling precision medicine.

But I also started digging into the research on my own, and found some results that are unintuitive and, I think, critical to understand. This piece is my attempt to lay out the full picture, use case by use case: where AI is delivering, where the evidence is more complicated than the pitch, and what the systemic risks look like. I’m not trying to be a doomsayer. I just think the stakes in healthcare are too high for anything less than an honest accounting.

Healthcare AI is a $7.8 billion funding market as of 2025, with deal sizes averaging $112 million, up 211% from 2023. Eight new unicorns emerged last year, including Abridge ($5.3B), Hippocratic AI ($3.5B), and OpenEvidence ($6B). 70% of healthcare organizations now use AI in some form. Over 1,451 AI/ML medical devices have received FDA authorization. The scale of deployment is real. The question is what that deployment is actually doing.

Where AI is delivering clear results

Drug discovery and molecular biology

This is where I’m most optimistic. Everyone in biotech is using AlphaFold now — over 3 million researchers in 190 countries. AlphaFold 3 predicts how proteins interact with DNA, RNA, and drug molecules, with 50-100% improvement over previous docking methods. Isomorphic Labs released a Drug Design Engine that more than doubles that accuracy and can identify binding pockets on proteins previously considered “undruggable,” opening up targets for complex cancers and neurodegenerative diseases.

AI-designed drugs are reaching patients. The first to complete a Phase 2a trial, Insilico Medicine’s rentosertib for pulmonary fibrosis, showed positive results in Nature Medicine. Iambic’s AI-designed cancer drug showed 28% response rates in heavily pretreated patients, going from discovery to human dosing in under two years. An AI-designed Parkinson’s drug received FDA clearance for human trials in January. Insilico reports cutting early-stage discovery from 4.5 years to 12-18 months.

At the conference, Hoifung Poon from Microsoft Research gave a keynote on building “virtual patients” from multimodal data. His team’s GigaTIME paper in Cell generated 300,000 virtual tissue slides across 14,000 cancer patients — work that would have taken decades and cost billions traditionally. Serge Saxonov from 10x Genomics talked about building a Virtual Cell Atlas from over 300 million cells.

AI is also reshaping gene editing. OpenCRISPR-1 is the first AI-generated CRISPR-Cas protein to successfully edit human DNA, with reduced off-target effects. An AlphaFold3-powered system achieved 32x increased editing activity and demonstrated successful therapy for Duchenne muscular dystrophy in mice. MIT is designing proteins by their motion, not just their shape.

Self-driving labs — where AI designs experiments, robots run them, and the system iterates without human intervention — are operational in chemistry and materials science, expanding into biology. It’s early, but compressing years of lab work into weeks is no longer theoretical.

Medical imaging

AI in radiology is the most established clinical success. Aidoc’s triage system detects acute findings with 97% sensitivity and 98% specificity across 11 conditions. Eyonis LCS hits 93.3% sensitivity for CT lung cancer screening with a 99.9% negative predictive value. These tools catch things radiologists miss, and the output is verifiable against pathology — there’s a ground truth to check against.

Clinical trials

Trial enrollment is a huge bottleneck, and AI is making a dent. TrialMatchAI, published in Nature Communications, matches patients to trials with over 90% accuracy. A multimodal system reduced patient review time by 80%. A site that enrolled 1 patient in 1.5 years switched to AI-powered matching and enrolled 6 in 4 months.

This is one of the hotter VC categories right now. The idea: AI helps patients figure out where to go, what’s covered, and how to get the right care. Transcarent raised $126M at a $2.2B valuation. Hippocratic AI raised $126M at $3.5B, with over 115 million patient interactions across 50+ health systems. Sage Care emerged from stealth with $20M. Accolade, Included Health, Sword Health are all building AI-powered navigation layers.

Clinical results look promising. MyEleanor, an AI navigator at Montefiore Einstein, nearly doubled colonoscopy completion rates in underserved populations (10% to 19%). In Portugal, an AI care navigator changed 59% of patients’ actual behavior after assessment, with the right-level-of-care rate going from 30% to 64%.

Robotic surgery

The da Vinci 5, launched in 2024, uses computer vision trained on 100,000+ surgical images to identify critical anatomy in real-time, alerting surgeons before they approach at-risk structures. Force feedback reduces unnecessary tissue pressure by 30%. Early prostatectomy comparisons show shorter operative times versus the previous generation.

Rare disease diagnosis

50-80% of rare disease patients remain undiagnosed after whole genome sequencing, with diagnostic odysseys averaging over 5 years. DeepRare, published in Nature, uses multi-agent AI to process genetic data alongside clinical descriptions. RareCollab achieved 77% top-5 diagnostic accuracy by combining genomic and transcriptomic data, a 20% improvement over conventional methods.

Hospital operations

Less glamorous, some of the most directly life-saving results. An AI sepsis detection system at Lausanne University Hospital reduced in-hospital and 90-day mortality. A model predicting sepsis deterioration trajectories gave a median 17.6 hours of warning, reducing ICU stay by 1.8 days and 28-day mortality by 5.7%. AI remote monitoring flagged 3,400 undiagnosed heart disease cases from 85,000 ECGs.

Precision dosing

Early but interesting. CURATE.AI, in npj Precision Oncology, creates N-of-1 profiles to dynamically personalize chemotherapy doses for individual patients. Diadia Health launched in March 2026 with an AI causal reasoning platform for chronic disease, claiming 60% less trial-and-error in treatment selection. Reinforcement learning is being applied to real-time pediatric dose optimization. All early-stage, but the direction is toward treating each patient as a unique biological system rather than following population-level averages.

Elderly care

The global population aged 65+ is projected to nearly double from 727 million today to over 1.5 billion by 2050. AI is being applied to fall detection (computer vision, depth sensors, ambient pressure sensors), remote monitoring through wearable digital biomarkers, medication management, and cognitive assessment for early dementia signals. The most effective systems in 2026 are narrow and operational — they give human caregivers earlier visibility into mobility changes or medication non-adherence, rather than trying to replace the caregiving relationship. AI remote monitoring has already flagged 3,400 undiagnosed heart disease cases from 85,000 ECGs in elderly populations, cases that would have gone undetected until a cardiac event.

Epidemic surveillance

Post-COVID, there’s significant investment here. ARIES, a multi-agent framework, autonomously queries WHO, CDC, and journals to identify emerging threats in near real-time. AI systems are integrating epidemiological data with web data, climate data, and wastewater surveillance for earlier outbreak detection. Data quality and adoption in low-resource settings remain major barriers.

Where the evidence is more complicated

The use cases above share common traits: structured data, verifiable outputs, clear ground truth. Imaging has scans you can compare against pathology. Drug discovery has molecular simulations you can validate in the lab. Sepsis prediction has patient outcomes you can measure.

The pattern changes when AI enters messier territory: open-ended clinical reasoning, free-text documentation, real-time decision-making with ambiguous information.

Diagnostic decision support

AI is genuinely good at medical knowledge. GPT-4 scores above 80% on USMLE-style questions. Google’s Med-PaLM 2 hits 86.5%. DeepSeek recently hit 92.6%. So the natural assumption is: give doctors access to AI, and diagnostic accuracy should go up.

It doesn’t.

A 2024 randomized clinical trial in JAMA Network Open tested this directly. Fifty physicians, mostly in internal medicine, with some in emergency and family medicine, were given clinical vignettes based on real patients. Each case included history, physical exam findings, and lab results across a broad range of conditions. The doctors had to work through the full diagnostic process: differential diagnosis, supporting and opposing evidence, final diagnosis, next steps. Half got access to an LLM alongside their usual resources. Half got conventional resources only. The LLM group scored 76% on diagnostic reasoning. The control group scored 74%. Not statistically significant (P = .60).

The LLM on its own scored 16 percentage points higher than the conventional group (P = .03). The AI was capable. But giving it to doctors didn’t make them better.

A 2026 meta-analysis in npj Digital Medicine looked at 10 studies and found the same pattern. No significant improvement in diagnostic accuracy. No time savings. Factual error rates stayed at 26-36%.

An analysis of 52 clinical studies tested whether human-AI teams achieve “1 + 1 > 2.” Out of 87 experimental conditions, zero reached the theoretical ideal. Junior clinicians got some benefit. Senior clinicians got almost none. The people whose judgment you’d most want applied to checking AI output are the ones least helped by the tool.

Ambient clinical documentation

Ambient AI scribing is probably the most commercially successful AI use case in healthcare right now. The market hit $600 million in 2025 with 2.4x growth. 63% of Epic hospitals adopted ambient documentation by mid-2025, one of the fastest adoption curves in healthcare IT. Products like Nuance DAX Copilot, Abridge, and Ambience listen to encounters and generate notes. Two-thirds of physicians report saving 1-4+ hours daily. Burnout indicators improve. Doctors like it. Abridge and Ambience Healthcare raised a combined $793 million in 2025.

But when researchers looked at what actually happens to the notes, the picture got more complicated.

A 2026 study of 23,760 AI-generated notes found that clinicians edited 84.4% of them before signing. These were not style edits. They were clinical content changes: procedure orders (39.9%), symptoms (30.3%), medications (27.3%), diagnoses (25.9%). The Assessment and Plan section — which captures clinical reasoning and drives billing — accounted for 59% of modifications.

A validation study found hallucinations in 31% of AI-generated notes versus 20% for physician-authored ones. Hallucinations here means fabricated clinical details — things that didn’t happen during the encounter, appearing in the medical record as if they did.

Thirty clinicians were interviewed about why they edit. Common reasons: transcription errors, the AI attributing a patient’s words to the doctor, overconfident diagnostic statements, missing details. Every one of these requires medical judgment to catch.

An NEJM AI perspective put it plainly: “Proofreading content that was neither written nor dictated by the user is difficult to do well.” The entire safety argument rests on the assumption that every physician will thoroughly review every output, every time, for every patient. That’s an assumption about sustained vigilance that doesn’t hold up in any other industry.

There are real time savings. A longitudinal study found 7-15% reduction in note-writing time over 150 days, and 18% less after-hours documentation. That matters to burned-out physicians. But it’s smaller than the pitch suggests, because time saved on writing is partially consumed by reading, verifying, and correcting.

Medical coding and revenue cycle

I was also at a conference for revenue cycle management recently, where people were talking about using AI for medical coding and voice AI. Spending on AI in RCM grew from $3.2 billion to $8.5 billion between 2023 and 2026. But only 15-18% of providers have deployed production-grade systems.

The accuracy numbers explain the slow adoption. The MedCode benchmark found the best AI model (Gemini 3.1 Pro Preview) hit 55% accuracy on ICD-10-CM tasks. A 2026 analysis of LLMs on HCC coding found about 70% rejection rates from providers. Incorrect billing codes aren’t just an efficiency issue — they affect reimbursement, compliance audits, and downstream clinical decisions that reference a patient’s coded history.

Meanwhile, the payer side has moved much faster. 84% of insurers now use AI to flag, route, and deny claims at scale. Medicare Advantage insurers using AI doubled their denial rates for elderly patients. About 75% of those denials were overturned on appeal, but fewer than 1% of patients ever appeal. The HHS OIG found 13% of MA denials were for services that actually met coverage rules.

So you have an asymmetry: insurers automating denials at scale, while providers are still mostly responding manually. Physicians spend 14 hours per week on prior authorization. 41% of providers report denial rates above 10%, up from 30% in 2022. Prior authorization delays care for 94% of patients, and over 80% of physicians have watched patients abandon treatment because the process was too slow or confusing. CMS launched the WISeR Program in January 2026, piloting AI-based prior authorization screening across six states and 6.4 million beneficiaries. Whether that brings balance or just adds another layer remains to be seen.

Equity and algorithmic bias

At the Stanford conference, Maya Yiadom (Stanford Emergency Medicine) and Michele Samorani (Santa Clara University) gave a session on AI quality and equity that I didn’t fully absorb at the time. I went back and read their work afterward.

Yiadom’s research looks at AI-assisted heart attack screening in the ED. She analyzed nearly 280,000 visits and found that standard age-based screening systematically misses younger Black, Native American, and Pacific Islander patients, who develop acute coronary syndrome at earlier ages. Her AI model identified 11.1% more cases. But the important detail: it worked best when embedded within the physician’s decision-making process, not when used as a standalone second opinion.

Samorani’s research is about appointment scheduling. Hospitals use ML to predict which patients will no-show, then assign “high risk” patients to worse time slots. The problem: no-show probability correlates with socioeconomic status. His team’s study found Black patients waited 30% longer. The algorithm was optimizing for exactly what it was told to optimize — clinic throughput — and in doing so amplified the inequities that already make these patients distrust healthcare.

Trust in physicians and hospitals fell from 72% to 40% between 2020 and 2024. An algorithm affecting an estimated 200 million Americans systematically underestimated how sick Black patients were by using medical expenses as a proxy for illness severity. AI deployment without adequate equity testing is compounding a trust crisis that was already severe.

Alert fatigue

Clinical decision support (CDS) systems generate alerts when AI detects a potential issue — drug interactions, abnormal lab values, dosing errors. In theory, these catch problems before they reach patients. In practice, clinicians override them almost reflexively.

In emergency departments, override rates reach 92.9%. In outpatient settings, 52.6% of medication-related alerts are overridden. The reason is instructive: in one ED study, only 7.3% of the alerts themselves were clinically appropriate. The system was generating so much noise that physicians learned to ignore it all.

This creates a specific failure mode. When 93% of alerts are irrelevant, physicians develop a habit of clicking through. When a genuinely dangerous alert fires, it looks identical to the noise. A 2010 Human Factors study on automation complacency showed this is a general principle: when automated systems are highly reliable, humans detect a smaller proportion of the errors that remain.

Bigger-picture concerns

Some issues cut across individual use cases.

The regulatory evidence gap

Here’s a number that stopped me: of the 1,451 FDA-cleared AI medical devices, a 2025 JAMA study found that only 6 devices (1.6%) cited a randomized clinical trial, and only 3 devices (<1%) reported actual patient health outcomes. Nearly half (46.7%) of FDA decision summaries didn’t even describe the study design. Over half (53.3%) omitted the sample size.

This happens because 97% of AI devices enter through the 510(k) pathway, which requires showing “substantial equivalence” to an existing device rather than conducting new clinical trials. It’s a process designed for incremental hardware updates, now being used to approve AI software that makes clinical decisions. The FDA cleared 6 AI devices in 2015 and 295 in 2025. The pace of approval has outrun the pace of evidence.

The demographic data is even more concerning. The same study found that only 3.6% of approvals reported race or ethnicity of study subjects. 99.1% provided no socioeconomic data. 81.6% didn’t report age. So we don’t know whether these devices work equally well across the populations they’re being used on.

Deskilling

A multicenter study in The Lancet Gastroenterology & Hepatology found that endoscopists who used AI-assisted colonoscopy showed a 6.0 percentage point drop in adenoma detection rate when the AI was removed (28.4% to 22.4%, P = 0.0089). Routine AI exposure degraded their unaided performance. Missed adenomas translate directly to increased colorectal cancer risk.

A review in Artificial Intelligence Review introduced the term “second singularity” — the point where repeated delegation to AI leads to irreversible loss of professional expertise. The vulnerabilities they identified: physical examination, differential diagnosis, clinical judgment, and physician-patient communication.

Bainbridge predicted this in 1983. Automate the routine work, and humans lose practice with the skills they need when automation fails. Aviation responded with mandatory manual flying after Air France 447 crashed in 2009, killing 228 people, because the pilots couldn’t fly manually when the autopilot disconnected. A npj Digital Medicine perspective argued medicine should do the same: require periodic unassisted practice and performance benchmarking.

Data contamination

A team from NUS, Harvard, Stanford, Google, and Mayo Clinic analyzed over 800,000 synthetic data points and found that when AI-generated clinical text feeds into training data for the next generation of AI, diagnostic reliability collapses. After four generations: vocabulary in radiology reports dropped 98.9%. Unique medical terms fell 66%. Rare findings (pneumothorax, effusions) disappeared entirely.

When life-threatening pathology was present, false reassurance (“no acute findings”) tripled from 13.3% to 40.3%. Model confidence stayed high. Physician evaluation confirmed the output was clinically useless after just two generations.

If ambient scribe notes with 31% hallucination rates are stored in EHRs and later used to train the next round of AI, the system feeds itself contaminated data. The researchers found maintaining at least 75% real data could preserve diversity. Scaling synthetic data alone accelerated collapse and worsened demographic bias.

When AI directly harms patients

A Reuters investigation in February 2026 found that J&J’s TruDi surgical navigation device saw adverse events jump from 7 reports before AI was added to over 100 after. At least 10 people were injured, including strokes from accidentally damaged arteries. Two lawsuits were filed. Researchers found 60 FDA-authorized AI devices linked to 182 product recalls, 43% within one year of approval.

Other documented incidents: a sepsis alert triggered inappropriate IV fluid for a dialysis patient (caught by a clinician). Kaiser Permanente therapists went on strike over an AI mental health screening system delaying care. ChatGPT Health failed to recommend emergency care in over half of serious cases.

An empirical analysis identified 295 health-related AI incidents from 2012-2025, likely an undercount.

What I’m left thinking about

I came out of the Stanford conference optimistic. I still am. AI is compressing drug discovery timelines from years to months. It’s catching cancers on scans that human eyes would miss. It’s matching patients to clinical trials they would never have found. These are not incremental improvements. They’re the kind of changes that will look obvious in retrospect.

But I keep coming back to a few things I can’t resolve.

Why does an AI that scores 92% on medical exams fail to improve physician accuracy when you put them in the same room? Why do doctors edit 84% of AI-generated notes, and why does nobody seem to be measuring the cognitive cost of that editing? Why have we cleared over 1,400 AI medical devices while fewer than 10 have been tested in a randomized trial? Why are endoscopists getting worse at finding polyps after using AI, and what does that mean for the next generation of doctors who train with AI from day one?

I don’t have neat answers. I think the people building these tools are, for the most part, genuinely trying to improve healthcare. And I think the gap between what we’re deploying and what we’ve rigorously tested is wider in medicine than in any field I’ve looked at. Both of these things are true at the same time.

What I keep reminding myself is that the stakes are different here, and not just because patients are more vulnerable than software users. In software, the person who builds the product, the person who uses it, and the person who pays for it are roughly aligned. Bad product, users leave, company loses money. The feedback loop is direct.

Healthcare doesn’t work that way. The startup building the AI scribe needs growth metrics to raise the next round. The VC needs a return. The doctor wants less paperwork and trusts the tool without fully understanding what it’s doing under the hood. The patient doesn’t know their note was AI-generated, doesn’t know their insurance denial was algorithmic, doesn’t know the surgical navigation system was cleared without a clinical trial. The insurer’s incentive is to deny more claims faster, which is directly opposed to the patient’s interest. And the regulator is using a 1990s device-clearance framework to approve 2026 AI software.

The person who makes the decision, the person who bears the consequence, the person who pays, and the person who regulates are four different parties. Their incentives don’t align. Their power isn’t equal either. Insurers have legal teams and automated denial systems; patients don’t even know how to appeal. Startups and tech companies control the algorithms and the data; doctors use the tools without fully seeing what’s inside; patients don’t know their note was AI-generated. And the information asymmetry runs deep: when over half of FDA decision summaries don’t even disclose sample sizes, it’s hard for anyone outside the system to make an informed judgment about what’s safe.

The track record of the most powerful actors doesn’t inspire confidence. Insurers doubled denial rates with AI while 75% of those denials turned out to be wrong. Regulators cleared over 1,400 AI devices, fewer than 10 backed by clinical trials. These aren’t edge cases. They’re the system working as designed, just not designed for the patient.

That’s what makes healthcare AI different from every other domain I’ve looked at. And it’s why I think the questions above aren’t just academic. They’re questions about who’s actually looking out for the patient when everyone else in the room has more power, more information, and a different reason to keep moving fast.

Sources: All claims link to primary sources inline. Key studies: JAMA RCT (2024) · npj Digital Medicine meta-analysis (2026) · npj AI complementarity study (2025) · Ambient scribe editing (2026) · NEJM AI review burden (2024) · He et al. data contamination (2026) · Endoscopy deskilling (2025) · Flight rules for clinical AI (2026) · Reuters surgical AI (2026) · Medicare Advantage AI denials (2026) · NVIDIA survey (2026) · FDA device evidence gaps (JAMA 2025) · Alert override rates