The Treasure Trove Hidden in Plain Sight: The Utility of GPT-4 in Chest Radiograph Evaluation

Soroosh Tayebi Arasteh; Robert Siepmann; Marc Huppertz; Mahshad Lotfinia; Behrus Puladi; Christiane Kuhl; Daniel Truhn; Sven Nebelung

doi:10.1148/radiol.233441

The Treasure Trove Hidden in Plain Sight: The Utility of GPT-4 in Chest Radiograph Evaluation

Radiology. 2024 Nov;313(2):e233441. doi: 10.1148/radiol.233441.

Authors

Soroosh Tayebi Arasteh¹, Robert Siepmann¹, Marc Huppertz¹, Mahshad Lotfinia¹, Behrus Puladi¹, Christiane Kuhl¹, Daniel Truhn^#¹, Sven Nebelung^#¹

Affiliation

¹ From the Department of Diagnostic and Interventional Radiology (S.T.A., R.S., M.H., M.L., C.K., D.T., S.N.), Department of Oral and Maxillofacial Surgery (B.P.), and Institute of Medical Informatics (B.P.), University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany; Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany (S.T.A.); and Institute of Heat and Mass Transfer, RWTH Aachen University, Aachen, Germany (M.L.).

^# Contributed equally.

PMID: 39530893
DOI: 10.1148/radiol.233441

Abstract

Background Limited statistical knowledge can slow critical engagement with and adoption of artificial intelligence (AI) tools for radiologists. Large language models (LLMs) such as OpenAI's GPT-4, and notably its Advanced Data Analysis (ADA) extension, may improve the adoption of AI in radiology. Purpose To validate GPT-4 ADA outputs when autonomously conducting analyses of varying complexity on a multisource clinical dataset. Materials and Methods In this retrospective study, unique itemized radiologic reports of bedside chest radiographs, associated demographic data, and laboratory markers of inflammation from patients in intensive care from January 2009 to December 2019 were evaluated. GPT-4 ADA, accessed between December 2023 and January 2024, was tasked with autonomously analyzing this dataset by plotting radiography usage rates, providing descriptive statistics measures, quantifying factors of pulmonary opacities, and setting up machine learning (ML) models to predict their presence. Three scientists with 6-10 years of ML experience validated the outputs by verifying the methodology, assessing coding quality, re-executing the provided code, and comparing ML models head-to-head with their human-developed counterparts (based on the area under the receiver operating characteristic curve [AUC], accuracy, sensitivity, and specificity). Statistical significance was evaluated using bootstrapping. Results A total of 43 788 radiograph reports, with their laboratory values, from University Hospital RWTH Aachen were evaluated from 43 788 patients (mean age, 66 years ± 15 [SD]; 26 804 male). While GPT-4 ADA provided largely appropriate visualizations, descriptive statistical measures, quantitative statistical associations based on logistic regression, and gradient boosting machines for the predictive task (AUC, 0.75), some statistical errors and inaccuracies were encountered. ML strategies were valid and based on consistent coding routines, resulting in valid outputs on par with human specialist-developed reference models (AUC, 0.80 [95% CI: 0.80, 0.81] vs 0.80 [95% CI: 0.80, 0.81]; P = .51) (accuracy, 79% [6910 of 8758 patients] vs 78% [6875 of 8758 patients], respectively; P = .27). Conclusion LLMs may facilitate data analysis in radiology, from basic statistics to advanced ML-based predictive modeling. © RSNA, 2024 Supplemental material is available for this article.

MeSH terms

Aged
Artificial Intelligence
Female
Humans
Machine Learning
Male
Middle Aged
Radiography, Thoracic* / methods
Retrospective Studies