How Well Does ChatGPT Perform as a Medical Diagnostic Tool? A Closer Look
Artificial intelligence models like ChatGPT have captured imaginations for their ability to generate text, answer questions, and assist in many fields. But how reliable is ChatGPT when it comes to diagnosing medical conditions? A recent study evaluated ChatGPT 3.5’s diagnostic accuracy using 150 clinical case challenges, revealing that while the AI can sometimes identify correct diagnoses, it struggles significantly and is far from a dependable medical tool on its own.
Main Findings
The researchers tested ChatGPT using real clinical cases where the AI had to suggest diagnoses and explain its reasoning. The model’s answers were then independently reviewed by medical experts who assessed whether the correct diagnosis was among ChatGPT’s suggestions and evaluated the quality of the explanations.
- Accuracy and Diagnostic Performance: ChatGPT correctly identified the right diagnosis in about half of the cases, showing a 50% accuracy roughly. This means it failed to provide the correct diagnosis half the time, which is a significant limitation for clinical use.
- Evaluating False Positives and Negatives: The study measured true positives, false positives, true negatives, and false negatives to get a full picture of the AI's diagnostic strengths and weaknesses. This helped calculate key metrics like precision, sensitivity, and specificity, all essential to understanding diagnostic reliability.
- Complexity and Cognitive Load: The explanations provided by ChatGPT varied in clarity. Some answers were straightforward and easy to understand (low cognitive load), while others were more complex and harder to follow, which could affect how useful the AI is in an educational or clinical setting.
- Limitations in Handling Lab Data: ChatGPT struggled with interpreting complex lab values and integrating them into its diagnostic reasoning, a critical skill in medical diagnosis that it currently lacks.
- ROC Curve and Overall Diagnostic Ability: Using statistical tools like Receiver Operating Characteristic (ROC) curves, the study quantified ChatGPT’s ability to discriminate between correct and incorrect diagnoses, underscoring its limited but not absent potential.
Conclusion
While ChatGPT shows promise as an educational aid by providing disease background and diagnostic reasoning, it is not yet reliable enough to be used as a standalone diagnostic tool for medical learners or clinicians. The AI’s tendency to misdiagnose or provide inaccurate information highlights the need for continued improvement and cautious use in healthcare contexts.
This study serves as a helpful benchmark for understanding both the capabilities and current limitations of large language models like ChatGPT in medicine, emphasizing that such technology should complement, not replace, human judgment for now.
Authored by A.H., B.N., and E.T., this research was published in the journal PLOS ONE. The authors are affiliated with institutions dedicated to advancing medical education and AI evaluation in healthcare.