Benchmarking a Foundation Large Language Model on its Ability to Relabel Structure Names in Accordance With the American Association of Physicists in Medicine Task Group-263 Report

Jason Holmes; Lian Zhang; Yuzhen Ding; Hongying Feng; Zhengliang Liu; Tianming Liu; William W Wong; Sujay A Vora; Jonathan B Ashman; Wei Liu

doi:10.1016/j.prro.2024.04.017

Benchmarking a Foundation Large Language Model on its Ability to Relabel Structure Names in Accordance With the American Association of Physicists in Medicine Task Group-263 Report

Pract Radiat Oncol. 2024 Nov-Dec;14(6):e515-e521. doi: 10.1016/j.prro.2024.04.017. Epub 2024 Sep 5.

Authors

Affiliations

¹ Department of Radiation Oncology, Mayo Clinic, Phoenix, Arizona. Electronic address: holmes.jason@mayo.edu.
² Department of Radiation Oncology, Mayo Clinic, Phoenix, Arizona.
³ School of Computing, University of Georgia, Athens, Georgia.

PMID: 39243241
DOI: 10.1016/j.prro.2024.04.017

Abstract

Purpose: To introduce the concept of using large language models (LLMs) to relabel structure names in accordance with the American Association of Physicists in Medicine Task Group-263 standard and to establish a benchmark for future studies to reference.

Methods and materials: Generative Pretrained Transformer (GPT)-4 was implemented within a Digital Imaging and Communications in Medicine server. Upon receiving a structure-set Digital Imaging and Communications in Medicine file, the server prompts GPT-4 to relabel the structure names according to the American Association of Physicists in Medicine Task Group-263 report. The results were evaluated for 3 disease sites: prostate, head and neck, and thorax. For each disease site, 150 patients were randomly selected for manually tuning the instructions prompt (in batches of 50), and 50 patients were randomly selected for evaluation. Structure names considered were those that were most likely to be relevant for studies using structure contours for many patients.

Results: The per-patient accuracy was 97.2%, 98.3%, and 97.1% for prostate, head and neck, and thorax disease sites, respectively. On a per-structure basis, the clinical target volume was relabeled correctly in 100%, 95.3%, and 92.9% of cases, respectively.

Conclusions: Given the accuracy of GPT-4 in relabeling structure names as presented in this work, LLMs are poised to become an important method for standardizing structure names in radiation oncology, especially considering the rapid advancements in LLM capabilities that are likely to continue.

MeSH terms

Benchmarking* / methods
Head and Neck Neoplasms / radiotherapy
Humans
Male
Prostatic Neoplasms / radiotherapy
Radiotherapy Planning, Computer-Assisted / methods
United States