A map of AI for scientific discovery

In November 2025, Nature ran a news story on AI-Newton: a system from Peking University that, given noisy simulated data from dozens of classical mechanics experiments, rediscovered concepts like mass and kinetic energy and laws like Newton’s second law and energy conservation. The headline frame — AI rediscovers physics — is accurate enough. But it sits in a crowded field where the same phrase gets applied to AlphaFold, Sakana’s AI Scientist, Google’s AlphaEvolve, and a dozen symbolic-regression papers. Those systems do not do the same work.

This article is a map. Not a timeline of breakthroughs, not a prediction of when AI will win a Nobel. The question is narrower: when someone says AI “discovered” something in science, what kind of object did it actually produce — and how do you check it?

The answer matters because the verification burden differs by an order of magnitude across approaches. A cap-set construction that passes a combinatorial checker is not the same epistemic object as a workshop paper on nanoGPT variants, and neither is the same as rediscovering (F = ma) from position traces.

What are you trying to discover?

Start with outputs, not institutions.

Output type	Human-readable?	Falsifiable?	Examples
Structure or property prediction	Partially	Experimentally	AlphaFold, GNoME
Implicit dynamics model	No	Hard	PINN, Hamiltonian NN
Single explicit equation	Yes	Yes	SINDy, PySR, AI Feynman
Concepts + general laws	Yes	Yes	AI Physicist, AI-Newton
Equations + formal proof	Yes	Yes	AI-Descartes, AI-Hilbert
Programs or algorithms	Read the code	Run the evaluator	FunSearch, AlphaEvolve
Full research artifact	Yes	Peer review	AI Scientist, Kosmos

Most public confusion comes from treating row 3 and row 7 as the same achievement. They are not.

Route 1: Symbolic regression — recover the formula

The problem: Given data ((x, y)) or trajectories, find a short mathematical expression that fits.

Lineage: Schmidt & Lipson’s “distilling free-form natural laws” (2009) → SINDy (sparse regression on a fixed library) → AI Feynman (Udrescu & Tegmark, 2020) → modern PySR.

AI Feynman is the landmark. It combines neural smoothing, symmetry and dimensional analysis, recursive decomposition of hard expressions into subproblems, and symbolic regression on the pieces. On Feynman’s Lectures formulas as benchmark, it recovers 100 equations. AI Feynman 2.0 adds modular graph structure for messier multi-variable cases.

What it actually demonstrates: Given clean enough data and a well-posed single system, you can search formula space efficiently enough to recover known physics. This is real. It is also per-problem: one dataset tends to yield one equation, not a reusable theory library.

LLM-SR (2024) pushes the same direction: an LLM proposes equation structures as programs, data feedback prunes bad candidates. Same output type, better search prior.

Limits: No concept layer (the system does not invent “mass” as an abstract quantity). No cross-experiment knowledge base. Search explodes when variables, noise, or system complexity grow.

Route 2: Structured neural models — bake physics into the architecture

PINNs (Karniadakis et al., Nature Reviews Physics 2021) penalize PDE residuals in the loss. Hamiltonian and Lagrangian neural networks (Greydanus et al., 2019; Cranmer et al., 2020) learn (H(q,p)) or (L(q,\dot{q})) and derive dynamics from structure.

What they actually demonstrate: If you already know the form of the governing physics (a PDE family, Hamiltonian structure), neural nets can fit trajectories with better sample efficiency and some extrapolation than a plain MLP.

What they are not: Law discovery. You needed the physics upfront. The output is typically a high-dimensional function, not (E = \frac{1}{2}mv^2 + \frac{1}{2}kx^2) on a chalkboard.

These methods are engineering tools for simulation and inversion. Important, but a different problem from AI-Newton.

Route 3: Concepts and knowledge bases — compress many experiments into laws

This is the line AI-Newton belongs to, and it has a direct predecessor.

AI Physicist (Wu & Tegmark, arXiv:1810.10525, 2019) proposes a “theory hub”: divide environments, learn specialized theories, unify them, snap complex fits into simple symbolic formulas. It works on toy 2D physics worlds with mixed gravity, electromagnetism, and collisions.

AI-Newton (Fang et al., arXiv:2504.01538, 2025) scales the ambition. A Rust physical DSL encodes concepts and laws. A Python workflow runs four steps per trial: select experiment and concepts (with a recommendation engine), discover laws via symbolic regression and plausible reasoning (extend an existing general law with a new term when data demands it), simplify via differential algebra, and promote successful specific laws to general ones in a knowledge base. Input observables are things like ball positions — not mass, energy, or force labels.

On 46 classical mechanics experiments with added noise, the system recovers on the order of 90 concepts and 50 general laws, including energy conservation and Newton’s second law.

What it actually demonstrates: The bottleneck in symbolic regression is not just search — it is representation. Human physicists do not fit each experiment in isolation. They build concepts, derive general laws, and apply those laws to new systems. AI-Newton is the first system I have seen that encodes that architecture explicitly and shows it working on a non-trivial suite.

Honest limits: Classical mechanics only. Simulated data. Requires commercial Maple for parts of the pipeline. No vector calculus yet. The “no prior physical knowledge” claim means no labeled mass or energy — not zero inductive bias (the DSL still encodes what counts as a lawful expression). These are PoC constraints, not refutations, but they bound how far you can generalize the headline.

Route 4: Data plus background theory — discover with proofs

AI-Descartes (Cornelio et al., Nature Communications 2023) combines experimental data with background knowledge to discover laws.

AI-Hilbert (Cory-Wright et al., Nature Communications 2024) goes further: given polynomial axioms and noisy data, it searches for new polynomial laws via mixed-integer and semidefinite optimization, and produces Positivstellensatz certificates — machine-checkable consistency proofs with the background theory.

Contrast with AI-Newton: AI-Newton tries to grow concepts from raw observables. AI-Hilbert and AI-Descartes assume you already have a formalized body of theory and ask what new law is consistent with both theory and data. Less romantic, often more rigorous. The proof certificate is something an LLM-only pipeline cannot easily fake.

Limits: Polynomial setting. You need axioms worth having. Scaling to messy real-world experiments is open.

Route 5: Program evolution — search code space with an evaluator

DeepMind’s sequence here is instructive:

AlphaTensor (2022): matrix multiplication algorithms
AlphaDev (2023): CPU assembly optimizations
FunSearch (Romera-Paredes et al., Nature 2024): LLM proposes programs, an evaluator scores them, an island-based evolutionary loop keeps the good ones
AlphaEvolve (2025): same paradigm at codebase scale with Gemini; 4×4 matrix multiply in 48 scalar multiplications; datacenter scheduling gains

What FunSearch actually demonstrates: This is the clearest example of genuinely new scientific knowledge from an LLM pipeline I know of. FunSearch found cap-set constructions that improved the best-known asymptotic bound — results mathematicians verified independently. The LLM did not “know” the answer from training data in a retrievable form; the evaluator filtered millions of wrong programs.

What AlphaEvolve adds: Whole programs, not single functions. Infrastructure and algorithms, not physical laws.

The hard requirement: A fast, objective evaluator. No evaluator, no loop. This is the same structural fact as in software engineering: generation is cheap, verification is the bottleneck (I wrote about this for labor economics). Here it is load-bearing for the science, not just the economics.

FunSearch is partially open-sourced. AlphaEvolve’s agent is not; OpenEvolve is a community reimplementation of the paradigm.

Route 6: End-to-end research agents — automate the paper, not necessarily the insight

The AI Scientist (Lu et al., arXiv:2408.06292; Nature 2026) runs an ML research loop: ideation, literature search, experiment code, analysis, LaTeX writing, automated review. AI Scientist-v2 (arXiv:2504.08066) removes human templates, uses agentic tree search, and produced the first fully AI-written paper accepted at an ICLR workshop peer review.

Kosmos (Edison Scientific, 2025) is a different beast: closed, commercial, aimed at biology and chemistry. A typical run lasts ~12 hours, reads on the order of 1,500 papers, writes tens of thousands of lines of analysis code, and outputs a cited research report.

What they actually demonstrate: Autonomous execution of the research workflow — especially in silico ML where experiments are cheap scripts. That is a labor automation result. It does not, by itself, mean the system found a new law of nature. Workshop acceptance is a meaningful bar for process quality. It is not the same bar as a novel physical principle validated by independent experiment.

The paradigm is LLM orchestration: read, code, write. AI-Newton’s paradigm is symbolic knowledge accumulation. Comparing them on “who discovered more science” is a category error.

Route 7: Domain-specific prediction — where the largest real impact already lives

AlphaFold (Jumper et al., Nature 2021) predicts protein structure. GNoME (Google DeepMind, Nature 2023) searched for stable crystals and flagged hundreds of thousands of candidates, hundreds of which were later synthesized in the lab.

These systems do not output (F = ma). They output structures or material candidates that experimentalists can test. For many fields, this is what progress looks like — not rediscovering textbook equations, but narrowing an intractable search space.

The epistemic object is a prediction with an experimental follow-up path, not a law.

How the routes relate

flowchart LR
  DATA["Data / benchmarks / literature"]
  DATA --> SR["Symbolic regression<br/>AI Feynman, PySR"]
  DATA --> KB["Concept KB<br/>AI-Newton"]
  DATA --> PROOF["Theory + proof<br/>AI-Hilbert"]
  DATA --> CODE["Program search<br/>FunSearch → AlphaEvolve"]
  DATA --> AGENT["Research agents<br/>AI Scientist, Kosmos"]
  DATA --> DOMAIN["Domain models<br/>AlphaFold, GNoME"]

  SR -->|"single equation"| OUT1["Readable formula"]
  KB -->|"general → specific"| OUT2["Law library"]
  PROOF -->|"certificate"| OUT3["Proven formula"]
  CODE -->|"passes evaluator"| OUT4["Algorithm / construction"]
  AGENT -->|"peer review"| OUT5["Paper / report"]
  DOMAIN -->|"lab test"| OUT6["Structure / material"]

Tegmark lineage: AI Feynman (single formula) → AI Physicist (theory hub) → conceptually adjacent to AI-Newton (formal KB), though developed independently at PKU.

DeepMind lineage: AlphaTensor → AlphaDev → FunSearch → AlphaEvolve. Same evaluator-driven search DNA; different domains.

Cornelio lineage: AI-Descartes → AI-Hilbert. Data plus logic plus optimization.

Sakana lineage: AI Scientist (papers) and Darwin Gödel Machine (self-modifying code) share an agentic coding stack but target different goals.

Which lines look most promising?

“Promising” depends on what you optimize for. There is no single winner.

If you care about verified new knowledge in the next few years

Program evolution with evaluators (FunSearch → AlphaEvolve) has the strongest existence proof. Cap-set was not a reproduction of training data. It was checked. The paradigm generalizes to any domain where you can write a scorer: combinatorics, algorithms, kernel optimization, parts of materials simulation. The ceiling is set by evaluator quality, not LLM fluency.

This line is less photogenic than “AI rediscovers Newton,” but it is the one I would bet on for repeatable, checkable discoveries at scale.

If you care about the deepest version of “understanding nature”

Concept and knowledge-base systems (AI Physicist → AI-Newton) are structurally closest to how physics actually works: invent quantities, state general principles, derive system-specific predictions. If this scales beyond classical mechanics toy worlds — to messy data, to fields where the DSL is not hand-designed — it is the route that produces the kind of knowledge textbooks are made of.

The risk is engineering hell: Maple dependencies, bespoke DSLs, era-control heuristics. The reward is compounding — each general law shrinks search for the next experiment. AI-Newton’s incremental progression (simple concepts before complex ones) is not a gimmick; it is how you keep combinatorial explosion manageable.

AI-Hilbert-style proof-carrying discovery is promising where formal background theory already exists — chemistry fragments, control theory, anything polynomializable. The proof certificate solves a real problem LLM agents have: confident wrong statements. Less general than AI-Newton’s ambition, more trustworthy where it applies.

If you care about societal impact on science as practiced today

Domain-specific foundation models (AlphaFold lineage, materials GNoME, protein and genomic LMs) are already changing how labs work. They do not solve “automated theory formation,” but they solve problems scientists actually lose sleep over.

Research agents (AI Scientist, Kosmos) are promising as compression of research labor — literature synthesis, analysis code, draft writing — especially in data-rich computational fields. I would not conflate that with theory discovery, but I would not dismiss it either. A 12-hour Kosmos run that saves a team six weeks of exploratory analysis is economically meaningful even if every hypothesis it generates is wrong.

What looks like a plateau

Pure symbolic regression (AI Feynman, PySR alone) is mature. It will remain a component inside larger systems — including AI-Newton’s law-discovery step — but “SR but bigger” is probably not the next leap. The action moved to what wraps the SR: concept libraries, evaluators, agents, proofs.

PINN / HNN remain useful for simulation. They are not on a trajectory toward autonomous theory formation.

My synthesis

Three bets, stated plainly.

Bet 1 (near-term, epistemic): Evaluator-grounded program search is the most reliable path to new results humans can verify without trusting the model’s prose. FunSearch proved it in mathematics. AlphaEvolve is pushing it into engineering. Expect this pattern in materials, chemistry, and algorithm design before it produces a new conservation law.

Bet 2 (medium-term, scientific): The AI-Newton architecture — concepts, general laws, plausible extension — is the right shape for physics-style discovery, even if the current implementation is a PoC. The open problems are scaling the DSL, handling real noise and real experiments, and integrating LLMs without giving up falsifiability. A hybrid seems likely: LLM proposes concept candidates, symbolic machinery verifies and stores them.

Bet 3 (practical impact): The science that changes daily lab work will keep coming from domain-specific models and agentic workflows, not from any single “discover laws from scratch” system. AlphaFold did more for biology than any symbolic-regression paper. Kosmos-style agents may do the same for exploratory analysis — if the outputs stay tethered to evidence.

What I would not bet on: end-to-end paper factories replacing the need for human judgment about what is worth testing. Workshop acceptance is a milestone for automation. It is not the end of the scientific method.

The field is not one race. It is several different races, with different finish lines and different referees. Pick the output type you care about, then pick the route. Everything else is naming.

Sources and further reading

Research notes: notes/AI_physics_discovery_methods_primer.md, notes/AI_Newton_2025_深度解读.md
Local PDF library: readings/ai_physics_discovery/ (18 papers)
Surveys: Agentic Science survey (arXiv:2508.14111); EXHYTE framework
Key papers: AI Feynman · AI-Newton · AI-Hilbert · FunSearch · AlphaEvolve · AI Scientist v2 · AI Scientist Nature 2026