OpenAI's o1 Outdiagnoses ER Physicians in Peer-Reviewed Harvard Study Published in Science

Share
A stethoscope rests on an open laptop displaying abstract medical data, symbolizing AI-assisted diagnosis in a clinical setting.

A study recently published in Science by researchers at Harvard Medical School, Beth Israel Deaconess Medical Center (BIDMC), and Stanford found that OpenAI's o1 model produced the correct or near-correct diagnosis in 67% of emergency room triage cases, compared to 55% and 50% for two internal medicine attending physicians evaluated under the same blinded conditions.

The result is the most rigorous head-to-head comparison of a large language model against practicing physicians published to date, and it's prompting its own authors to call for prospective clinical trials rather than deployment.

How the study was structured

Researchers selected 76 real patients admitted to the Beth Israel Deaconess emergency department in Boston. They fed the same electronic health record data available at the time of each patient encounter directly into OpenAI's o1 and 4o models.

Two internal medicine attending physicians independently produced diagnoses for the same cases. Critically, the team did not preprocess or clean the data before providing it to the AI.

"We did not pre-process the data at all," Harvard said in its press release. The AI worked through the same incomplete, unstructured records that clinicians faced in real time.

A separate pair of attending physicians then graded all outputs in a blinded review, with no knowledge of which diagnoses came from humans and which came from machines. Evaluations were conducted at three clinical touchpoints: initial ER triage, first physician contact, and admission to the medical floor or ICU.

Where the numbers land

At initial triage, when the least information is available, and urgency is highest, o1 hit 67% accuracy against 55% and 50% for the two physicians. When richer clinical detail was available later in the encounter, o1 reached 82% accuracy, while the human range was 70 to 79%. The difference at that second threshold was not statistically significant.

The study's Science abstract noted that the performance gap between o1 and the physicians was "especially pronounced at the first diagnostic touchpoint," where the information floor is lowest, and the stakes are highest.

"We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines," said Arjun Manrai, an assistant professor of biomedical informatics at Harvard's Blavatnik Institute and a senior co-author of the study.

Beyond the ER cases

The 76-patient head-to-head was one of several experiments in the paper.

Researchers also tested o1 against published case studies from Massachusetts General Hospital that have appeared in The New England Journal of Medicine, a set considered among the most diagnostically challenging in medicine. These cases "span many different areas of medicine" and are "full of either arcane or distracting matter," Manrai said during a press briefing.

Thomas Buckley, a Harvard PhD student who contributed to the study, said the results suggest o1 is approaching the ceiling of accuracy on these benchmark cases, which have been used to measure clinical reasoning in computational systems since 1959.

The study also evaluated management reasoning, covering tasks like antibiotic recommendations and goals-of-care conversations. On those tasks, o1 significantly outperformed prior AI models and also bested physicians who had access to conventional aids like Google search.

"Management reasoning is likely a more complex task than diagnostic reasoning," said Peter Brodeur, a clinical fellow at BIDMC and co-first author of the study. "It requires many considerations of not only the objective features of a case, but also subjective factors."

What the authors say it doesn't mean

The study's authors are explicit about the limitations of their findings. The model was tested only on text-based inputs, a domain where language models are strongest. Practicing physicians rely on imaging, EKGs, physiological signals, and direct patient interaction, none of which were part of this evaluation. The team acknowledged they're running parallel studies on non-text modalities.

The comparison also pitted a general reasoning model against internal medicine physicians, not emergency medicine specialists, a design that the authors acknowledged explicitly in the paper.

Brodeur noted that diagnostic accuracy alone isn't the full picture. "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm," he said in a statement to eWeek.

Adam Rodman, an assistant professor at Harvard Medical School who leads the school's AI curriculum task force and directs the AI program at BIDMC's Shapiro Center, was direct about what he thinks the results should and shouldn't be used to justify.

Rodman said he didn't think the results supported cutting doctors out of the loop, adding, "What these results support is a robust and ambitious research agenda to try to figure out how we can use these technologies to make patients' lives better."

The paper calls explicitly for prospective trials before any clinical deployment, and the authors say they intend to conduct them.