Researchers from Boston Children's Hospital's Manton Center for Orphan Disease Research, Harvard, and OpenAI used the o3 Deep Research reasoning model to reanalyze 376 previously unsolved rare-disease cases, and after expert review and clinical confirmation established 18 new diagnoses, an additional diagnostic yield of 4.8% on cases that had already survived multiple commercial and institutional pipelines. The study was published June 18 in NEJM AI. The framing is deliberately narrow: the model never diagnosed a patient. It acted as an explanation-first reasoning layer on top of existing genomic pipelines, producing evidence-linked hypotheses that connected clinical features, inheritance pattern, variant evidence, and the scientific literature into a justification a human reviewer could interrogate.
The workflow fed the model a de-identified packet per case: standardized Human Phenotype Ontology terms, occasional clinician notes, metadata such as age and sex, and a filtered variant table carrying each variant's rarity, predicted protein effect, ClinVar classification, and signal quality across family members, with most cases including the child and both biological parents. Reviewers scored candidate explanations using the same ACMG and AMP framework clinical labs use, at least two reviewers per candidate, disagreements resolved by consensus, and a finding counted as a diagnosis only after a CLIA-certified lab confirmed the variant. Before touching unsolved cases the team calibrated on solved ones: the model recovered the correct gene and variant in 48 of 51 mixed rare-condition cases, 45 of 57 neuromuscular cases, and named the correct gene in all 15 long-read cases. Self-reported confidence tracked correctness, with a mean minimum score of 85.6 for consistently correct calls versus 42.1 for incorrect or unknown ones, though the team stressed these were not calibrated probabilities.
Yields varied by cohort, from 10% in neurodevelopmental cases and 13.3% in a small early-psychosis group down to 1% in sudden unexpected pediatric death. Seven of the 18 were rediscoveries, diagnoses that existed elsewhere but were missing from the record the team reviewed, which underscores that much of the problem is synthesizing fragmented evidence rather than novel reasoning. The model also showed flexibility: in one early-psychosis case it inferred a 22q11.2 deletion from a run of low-quality calls on chromosome 22 plus the child's cardiac, immune, and neurodevelopmental features, later confirmed by follow-up sequencing. It surfaced digenic explanations the prompt did not ask for, and proposed a testable mechanistic hypothesis linking an S1PR1 deletion to vitiligo.
The authors are careful about limits. The study was retrospective, cohorts were heterogeneous, reviewers were not blinded to model confidence, and the team measured no time saved, cost, or false-positive burden. They call for prospective, multi-center comparisons against standard practice with versioned prompts and audit logs. The Manton Center will lead the next stage through an OpenAI Foundation grant to build a platform-agnostic, low-cost genetics copilot. The result matters less as a capability headline than as a concrete template for AI-assisted reanalysis as a maintenance problem, since the same genome becomes newly interpretable every time the surrounding knowledge base moves.