“Performance of ChatGPT on Free-Response, Clinical Reasoning Exams”, Eric Strong, Alicia DiGiammarino, Yingjie Weng, Preetha Basaviah, Poonam Hosamani, Andre Kumar, Andrew Nevins, John Kugler, Jason Hom, Jonathan H. Chen2023-03-29 (, , )⁠:

Importance: [Twitter] Studies show that ChatGPT, a general purpose large language model chatbot, could pass the multiple-choice US Medical Licensing Exams, but the model’s performance on open-ended clinical reasoning is unknown.

Objective: To determine if ChatGPT is capable of consistently meeting the passing threshold on free-response, case-based clinical reasoning assessments.

Design: 14 multi-part cases were selected from clinical reasoning exams administered to pre-clerkship medical students 201932022. For each case, the questions were run through ChatGPT twice and responses were recorded. Two clinician educators independently graded each run according to a standardized grading rubric. To further assess the degree of variation in ChatGPT’s performance, we repeated the analysis on a single high-complexity case 20×.

Setting: A single US medical school.

Participants: ChatGPT.

Main Outcomes & Measures: Passing rate of ChatGPT’s scored responses and the range in model performance across multiple run throughs of a single case.

Results: 12⁄28 ChatGPT exam responses achieved a passing score (43%) with a mean score of 69% (95% CI: 65% to 73%) compared to the established passing threshold of 70%. When given the same case 20 separate times, ChatGPT’s performance on that case varied with scores ranging 35.9%–59.4% [GPT-4 outperformed their students.]

Conclusion: ChatGPT’s ability to achieve a passing performance in nearly half of the cases analyzed demonstrates the need to revise clinical reasoning assessments and incorporate artificial intelligence (AI)-related topics into medical curricula and practice.