Evaluating the ChatGPT family of models for biomedical reasoning and classification

Shan Chen; Yingya Li; Sheng Lu; Hoang Van; Hugo J W L Aerts; Guergana K Savova; Danielle S Bitterman

doi:10.1093/jamia/ocad256

Evaluating the ChatGPT family of models for biomedical reasoning and classification

J Am Med Inform Assoc. 2024 Apr 3;31(4):940-948. doi: 10.1093/jamia/ocad256.

Authors

Shan Chen^{1

2}, Yingya Li³, Sheng Lu⁴, Hoang Van³, Hugo J W L Aerts^{1

2

5}, Guergana K Savova³, Danielle S Bitterman^{1

2}

Affiliations

¹ Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA 02115, United States.
² Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA 02115, United States.
³ Computational Health Informatics Program, Boston Children's Hospital, and Harvard Medical School, Boston, MA 02115, United States.
⁴ Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt, Darmstadt 64289, Germany.
⁵ Radiology and Nuclear Medicine, GROW & CARIM, Maastricht University, Maastricht 6211 LK, Netherlands.

PMID: 38261400
PMCID: PMC10990500 (available on 2025-01-22)
DOI: 10.1093/jamia/ocad256

Abstract

Objective: Large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates ChatGPT family of models (GPT-3.5, GPT-4) in biomedical tasks beyond question-answering.

Materials and methods: We evaluated model performance with 11 122 samples for two fundamental tasks in the biomedical domain-classification (n = 8676) and reasoning (n = 2446). The first task involves classifying health advice in scientific literature, while the second task is detecting causal relations in biomedical literature. We used 20% of the dataset for prompt development, including zero- and few-shot settings with and without chain-of-thought (CoT). We then evaluated the best prompts from each setting on the remaining dataset, comparing them to models using simple features (BoW with logistic regression) and fine-tuned BioBERT models.

Results: Fine-tuning BioBERT produced the best classification (F1: 0.800-0.902) and reasoning (F1: 0.851) results. Among LLM approaches, few-shot CoT achieved the best classification (F1: 0.671-0.770) and reasoning (F1: 0.682) results, comparable to the BoW model (F1: 0.602-0.753 and 0.675 for classification and reasoning, respectively). It took 78 h to obtain the best LLM results, compared to 0.078 and 0.008 h for the top-performing BioBERT and BoW models, respectively.

Discussion: The simple BoW model performed similarly to the most complex LLM prompting. Prompt engineering required significant investment.

Conclusion: Despite the excitement around viral ChatGPT, fine-tuning for two fundamental biomedical natural language processing tasks remained the best strategy.

Keywords: ChatGPT; biomedical research; classification; natural language processing; reasoning.

Evaluating the ChatGPT family of models for biomedical reasoning and classification

Authors

Affiliations

Abstract

Publication types

MeSH terms

Grants and funding