Towards evaluating and building versatile large language models for medicine

Chaoyi Wu; Pengcheng Qiu; Jinxin Liu; Hongfei Gu; Na Li; Ya Zhang; Yanfeng Wang; Weidi Xie

doi:10.1038/s41746-024-01390-4

Towards evaluating and building versatile large language models for medicine

NPJ Digit Med. 2025 Jan 27;8(1):58. doi: 10.1038/s41746-024-01390-4.

Authors

Chaoyi Wu^#^{1

2}, Pengcheng Qiu^#^{1

2}, Jinxin Liu³, Hongfei Gu⁴, Na Li³, Ya Zhang^{1

2}, Yanfeng Wang^{5

6}, Weidi Xie^{7

8}

Affiliations

¹ Shanghai Jiao Tong University, Shanghai, China.
² Shanghai Artificial Intelligence Laboratory, Shanghai, China.
³ China Mobile Communications Group Co., Ltd., Beijing, China.
⁴ China Mobile Communications Group Shanghai Co., Ltd., Shanghai, China.
⁵ Shanghai Jiao Tong University, Shanghai, China. wangyanfeng622@sjtu.edu.cn.
⁶ Shanghai Artificial Intelligence Laboratory, Shanghai, China. wangyanfeng622@sjtu.edu.cn.
⁷ Shanghai Jiao Tong University, Shanghai, China. weidi@sjtu.edu.cn.
⁸ Shanghai Artificial Intelligence Laboratory, Shanghai, China. weidi@sjtu.edu.cn.

^# Contributed equally.

Abstract

In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction-tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 5M instances with 19K instructions, across 122 tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models on various clinical tasks. To promote further advancements, we have made MedS-Ins fully accessible and invite the research community to contribute to its expansion. Additionally, we have launched a dynamic leaderboard for MedS-Bench, to track the development progress of medical LLMs.

Abstract

Grants and funding