Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notes

Xinsong Du; John Novoa-Laurentiev; Joseph M Plasek; Ya-Wen Chuang; Liqin Wang; Gad A Marshall; Stephanie K Mueller; Frank Chang; Surabhi Datta; Hunki Paek; Bin Lin; Qiang Wei; Xiaoyan Wang; Jingqi Wang; Hao Ding; Frank J Manion; Jingcheng Du; David W Bates; Li Zhou

doi:10.1016/j.ebiom.2024.105401

Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notes

EBioMedicine. 2024 Oct 12:109:105401. doi: 10.1016/j.ebiom.2024.105401. Online ahead of print.

Authors

Xinsong Du¹, John Novoa-Laurentiev², Joseph M Plasek³, Ya-Wen Chuang⁴, Liqin Wang³, Gad A Marshall⁵, Stephanie K Mueller³, Frank Chang², Surabhi Datta⁶, Hunki Paek⁶, Bin Lin⁶, Qiang Wei⁶, Xiaoyan Wang⁶, Jingqi Wang⁶, Hao Ding⁶, Frank J Manion⁶, Jingcheng Du⁶, David W Bates³, Li Zhou³

Affiliations

¹ Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA. Electronic address: xidu1@bwh.harvard.edu.
² Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA.
³ Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA.
⁴ Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA; Division of Nephrology, Taichung Veterans General Hospital, Taichung, 407219, Taiwan; Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, 402202, Taiwan; School of Medicine, College of Medicine, China Medical University, Taichung, 406040, Taiwan.
⁵ Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA; Department of Neurology, Brigham and Women's Hospital, Boston, MA, 02115, USA.
⁶ Intelligent Medical Objects, Rosemont, Illinois, 60018, USA.

PMID: 39396423
DOI: 10.1016/j.ebiom.2024.105401

Abstract

Background: Large language models (LLMs) have shown promising performance in various healthcare domains, but their effectiveness in identifying specific clinical conditions in real medical records is less explored. This study evaluates LLMs for detecting signs of cognitive decline in real electronic health record (EHR) clinical notes, comparing their error profiles with traditional models. The insights gained will inform strategies for performance enhancement.

Methods: This study, conducted at Mass General Brigham in Boston, MA, analysed clinical notes from the four years prior to a 2019 diagnosis of mild cognitive impairment in patients aged 50 and older. We developed prompts for two LLMs, Llama 2 and GPT-4, on Health Insurance Portability and Accountability Act (HIPAA)-compliant cloud-computing platforms using multiple approaches (e.g., hard prompting, retrieval augmented generation, and error analysis-based instructions) to select the optimal LLM-based method. Baseline models included a hierarchical attention-based neural network and XGBoost. Subsequently, we constructed an ensemble of the three models using a majority vote approach. Confusion-matrix-based scores were used for model evaluation.

Findings: We used a randomly annotated sample of 4949 note sections from 1969 patients (women: 1046 [53.1%]; age: mean, 76.0 [SD, 13.3] years), filtered with keywords related to cognitive functions, for model development. For testing, a random annotated sample of 1996 note sections from 1161 patients (women: 619 [53.3%]; age: mean, 76.5 [SD, 10.2] years) without keyword filtering was utilised. GPT-4 demonstrated superior accuracy and efficiency compared to Llama 2, but did not outperform traditional models. The ensemble model outperformed the individual models in terms of all evaluation metrics with statistical significance (p < 0.01), achieving a precision of 90.2% [95% CI: 81.9%-96.8%], a recall of 94.2% [95% CI: 87.9%-98.7%], and an F1-score of 92.1% [95% CI: 86.8%-96.4%]. Notably, the ensemble model showed a significant improvement in precision, increasing from a range of 70%-79% to above 90%, compared to the best-performing single model. Error analysis revealed that 63 samples were incorrectly predicted by at least one model; however, only 2 cases (3.2%) were mutual errors across all models, indicating diverse error profiles among them.

Interpretation: LLMs and traditional machine learning models trained using local EHR data exhibited diverse error profiles. The ensemble of these models was found to be complementary, enhancing diagnostic performance. Future research should investigate integrating LLMs with smaller, localised models and incorporating medical data and domain knowledge to enhance performance on specific tasks.

Funding: This research was supported by the National Institute on Aging grants (R44AG081006, R01AG080429) and National Library of Medicine grant (R01LM014239).

Keywords: Alzheimer disease; Cognitive dysfunction; Dementia; Early diagnosis; Electronic health records; Natural language processing; Neurobehavioral manifestations.