OpenAI, working with Molecule.one, connected GPT-5.4 to “Maria,” an agentic chemistry system wired into a high-throughput automated laboratory, and gave it an open-ended goal: improve a useful but stubborn reaction class. The model generated and ranked thousands of research proposals; human chemists selected four to test. The standout, internally labeled OAI-M1-03, targeted Chan–Lam coupling — a carbon–nitrogen bond-forming reaction — for the historically low-yielding case of coupling primary sulfonamides with boronic acids. GPT-5.4 independently identified primary sulfonamides as the high-value substrate class and proposed that mild oxidants, in particular TEMPO, could raise yields.
Across two experimental cycles the Maria lab ran 10,080 reactions, more than a chemist running three a day would complete in a decade. Under the optimized conditions, measured yields improved for 88 percent of the boronic acids and 83 percent of the sulfonamides tested; mean yield rose from 16.6 percent to 25.2 percent, and the share of reactions clearing 30 percent yield went from 15.6 to 37.5 percent. A useful follow-up: the system found TEMPO could be swapped for a much cheaper analog, 4-hydroxy-TEMPO, with little performance loss. Crucially, the result survived the jump out of microliter screening — human chemists reproduced representative reactions at bench scale and saw higher yields for 11 of 14 substrate pairs, most more than doubling. Four external chemists reviewed the preprint and judged the finding novel.
OpenAI is careful to call this near-autonomous, not autonomous: humans wrote the steering and grading prompts, chose which proposals entered the lab, corrected experimental plans (the largest correction was avoiding DMSO as a solvent), handled consumables, and ran the bench validation. The whole effort took three months, from the first prompt on March 4 to sharing results on June 4. The work was scoped under OpenAI’s Preparedness Framework to a legitimate medicinal-chemistry problem, involved no toxins or weapons design, and used a model already evaluated with the UK AI Security Institute. The significance is the loop, not just the molecule — a frontier model proposed a surprising, specific, falsifiable hypothesis, designed and interpreted experiments, and arrived at a result human chemists could reproduce. The caveats are real and stated: a single reaction class, specialized infrastructure, and no proof yet that the method generalizes to other couplings or substrates.