Testing the Ability and Limitations of ChatGPT to Generate Differential Diagnoses from Transcribed Radiologic Findings

Radiology. 2024 Oct;313(1):e232346. doi: 10.1148/radiol.232346.

Abstract

Background The burgeoning interest in ChatGPT as a potentially useful tool in medicine highlights the necessity for systematic evaluation of its capabilities and limitations. Purpose To evaluate the accuracy, reliability, and repeatability of differential diagnoses produced by ChatGPT from transcribed radiologic findings. Materials and Methods Cases selected from a radiology textbook series spanning a variety of imaging modalities, subspecialties, and anatomic pathologies were converted into standardized prompts that were entered into ChatGPT (GPT-3.5 and GPT-4 algorithms; April 3 to June 1, 2023). Responses were analyzed for accuracy via comparison with the final diagnosis and top 3 differential diagnosis provided in the textbook, which served as the ground truth. Reliability, defined based on the frequency of algorithmic hallucination, was assessed through the identification of factually incorrect statements and fabricated references. Comparisons were made between the algorithms using the McNemar test and a generalized estimating equation model framework. Test-retest repeatability was measured by obtaining 10 independent responses from both algorithms for 10 cases in each subspecialty, and calculating the average pairwise percent agreement and Krippendorff α. Results A total of 339 cases were collected across multiple radiologic subspecialties. The overall accuracy of GPT-3.5 and GPT-4 for final diagnosis was 53.7% (182 of 339) and 66.1% (224 of 339; P < .001), respectively. The mean differential score (ie, proportion of top 3 diagnoses that matched the original literature differential diagnosis) for GPT-3.5 and GPT-4 was 0.50 and 0.54 (P = .06), respectively. Of the references provided in GPT-3.5 and GPT-4 responses, 39.9% (401 of 1006) and 14.3% (161 of 1124; P < .001), respectively, were fabricated. GPT-3.5 and GPT-4 generated false statements in 16.2% (55 of 339) and 4.7% (16 of 339; P < .001) of cases, respectively. The range of average pairwise percent agreement across subspecialties for the final diagnosis and top 3 differential diagnosis was 59%-98% and 23%-49%, respectively. Conclusion ChatGPT achieved the best results when the most up-to-date model (GPT-4) was used and when it was prompted for a single diagnosis. Hallucination frequency was lower with GPT-4 than with GPT-3.5, but repeatability was an issue for both models. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Chang in this issue.

MeSH terms

  • Algorithms*
  • Diagnosis, Differential
  • Humans
  • Reproducibility of Results