A study published in Science has found that an artificial intelligence reasoning model outperformed physicians in identifying correct diagnoses across a range of clinical cases,
including real-world patient data. The findings add to a growing body of evidence on the potential role of large language models in supporting clinical decision-making, while also highlighting important limitations.
Researchers at Harvard University tested OpenAI’s o1-preview model on a series of medical cases, including classic training scenarios and data drawn directly from the charts of 76 patients seen at a Boston emergency department.
Across clinical reasoning tasks, the AI model included the correct diagnosis, or a close equivalent, in its responses approximately 80% of the time, surpassing both specialist diagnostic software and human clinicians assessed in the study.
One case highlighted by the authors involved a recently transplanted, immunosuppressed patient presenting with respiratory symptoms who was later found to have a necrotising soft tissue infection requiring surgery. The model raised suspicion of this diagnosis considerably earlier than the treating physician.
However, a separate study published in JAMA Network Open on 13 April, which tested 21 AI models across individual steps of the diagnostic process, identified a consistent weakness: AI models based on large language models tend to converge prematurely on a diagnosis when faced with uncertainty and competing possibilities.
The authors of that study concluded that such models are not yet ready for autonomous decision-making in clinical settings. Both research teams agree that further investigation is needed, with clinical trials proposed to assess how AI can be safely and effectively integrated into patient care. The consensus among researchers is that AI should function as a support tool for clinicians rather than a replacement.
Sources: Brodeur P et al. Performance of a large language model on the reasoning tasks of a physician. Science (2026). DOI: 10.1126/science.adz4433. Rao A et al. Large language model performance and clinical reasoning tasks. JAMA Network Open (2026). DOI: 10.1001/jamanetworkopen.2026.4003
Researchers at Harvard University tested OpenAI’s o1-preview model on a series of medical cases, including classic training scenarios and data drawn directly from the charts of 76 patients seen at a Boston emergency department.
Across clinical reasoning tasks, the AI model included the correct diagnosis, or a close equivalent, in its responses approximately 80% of the time, surpassing both specialist diagnostic software and human clinicians assessed in the study.
One case highlighted by the authors involved a recently transplanted, immunosuppressed patient presenting with respiratory symptoms who was later found to have a necrotising soft tissue infection requiring surgery. The model raised suspicion of this diagnosis considerably earlier than the treating physician.
However, a separate study published in JAMA Network Open on 13 April, which tested 21 AI models across individual steps of the diagnostic process, identified a consistent weakness: AI models based on large language models tend to converge prematurely on a diagnosis when faced with uncertainty and competing possibilities.
The authors of that study concluded that such models are not yet ready for autonomous decision-making in clinical settings. Both research teams agree that further investigation is needed, with clinical trials proposed to assess how AI can be safely and effectively integrated into patient care. The consensus among researchers is that AI should function as a support tool for clinicians rather than a replacement.
Sources: Brodeur P et al. Performance of a large language model on the reasoning tasks of a physician. Science (2026). DOI: 10.1126/science.adz4433. Rao A et al. Large language model performance and clinical reasoning tasks. JAMA Network Open (2026). DOI: 10.1001/jamanetworkopen.2026.4003