Expert evaluation of large language models for clinical dialogue summarization

David Fraile Navarro; Enrico Coiera; Thomas W Hambly; Zoe Triplett; Nahyan Asif; Anindya Susanto; Anamika Chowdhury; Amaya Azcoaga Lorenzo; Mark Dras; Shlomo Berkovsky

doi:10.1038/s41598-024-84850-x

Expert evaluation of large language models for clinical dialogue summarization

Sci Rep. 2025 Jan 7;15(1):1195. doi: 10.1038/s41598-024-84850-x.

Authors

David Fraile Navarro¹, Enrico Coiera², Thomas W Hambly³, Zoe Triplett⁴, Nahyan Asif⁵, Anindya Susanto^{2

6}, Anamika Chowdhury⁷, Amaya Azcoaga Lorenzo^{8

9

10}, Mark Dras¹¹, Shlomo Berkovsky²

Affiliations

¹ Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Level 6, 75 Talavera Road, North Ryde, Sydney, NSW, 2113, Australia. david.frailenavarro@mq.edu.au.
² Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Level 6, 75 Talavera Road, North Ryde, Sydney, NSW, 2113, Australia.
³ Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, Australia.
⁴ School of Medicine, Faculty of Human and Health Sciences, Macquarie University, Sydney, Australia.
⁵ Macquarie University Hospital, Sydney, Australia.
⁶ Faculty of Medicine, Universitas Indonesia, Jakarta, Indonesia.
⁷ Cowra District Hospital, Cowra, Australia.
⁸ Health Centre Los Pintores, Madrid Health Services, Madrid, Spain.
⁹ Health Research Institute, Fundación Jimenez Díaz, Madrid, Spain.
¹⁰ University of St Andrews, St Andrews, Scotland, UK.
¹¹ School of Computing, Macquarie University, Sydney, Australia.

Abstract

We assessed the performance of large language models' summarizing clinical dialogues using computational metrics and human evaluations. The comparison was done between automatically generated and human-produced summaries. We conducted an exploratory evaluation of five language models: one general summarisation model, one fine-tuned for general dialogues, two fine-tuned with anonymized clinical dialogues, and one Large Language Model (ChatGPT). These models were assessed using ROUGE, UniEval metrics, and expert human evaluation was done by clinicians comparing the generated summaries against a clinician generated summary (gold standard). The fine-tuned transformer model scored the highest when evaluated with ROUGE, while ChatGPT scored the lowest overall. However, using UniEval, ChatGPT scored the highest across all the evaluated domains (coherence 0.957, consistency 0.7583, fluency 0.947, and relevance 0.947 and overall score 0.9891). Similar results were obtained when the systems were evaluated by clinicians, with ChatGPT scoring the highest in four domains (coherency 0.573, consistency 0.908, fluency 0.96 and overall clinical use 0.862). Statistical analyses showed differences between ChatGPT and human summaries vs. all other models. These exploratory results indicate that ChatGPT's performance in summarizing clinical dialogues approached the quality of human summaries. The study also found that the ROUGE metrics may not be reliable for evaluating clinical summary generation, whereas UniEval correlated well with human ratings. Large language models may provide a successful path for automating clinical dialogue summarization. Privacy concerns and the restricted nature of health records remain challenges for its integration. Further evaluations using diverse clinical dialogues and multiple initialization seeds are needed to verify the reliability and generalizability of automatically generated summaries.

Keywords: Artificial intelligence; Electronic health records; Natural language processing; Primary care.

MeSH terms

Electronic Health Records
Humans
Language*
Natural Language Processing