Applying machine learning to identify pediatric patients with newly diagnosed acute lymphoblastic leukemia using administrative data

Lusha Cao; Yuan-Shung Huang; Kelly D Getz; Alix E Seif; Jenny Ruiz; Tamara P Miller; Brian T Fisher; Richard Aplenc; Yimei Li

doi:10.1002/pbc.30858

Applying machine learning to identify pediatric patients with newly diagnosed acute lymphoblastic leukemia using administrative data

Pediatr Blood Cancer. 2024 Mar;71(3):e30858. doi: 10.1002/pbc.30858. Epub 2024 Jan 8.

Authors

Lusha Cao¹, Yuan-Shung Huang¹, Kelly D Getz^{2

3}, Alix E Seif^{3

4}, Jenny Ruiz^{5

6}, Tamara P Miller^{7

8}, Brian T Fisher^{2

4

9}, Richard Aplenc^{2

3

4}, Yimei Li^{2

3

4}

Affiliations

¹ Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
² Department of Biostatistics, Epidemioloy and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
³ Division of Oncology, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
⁴ Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
⁵ Department of Pediatrics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, USA.
⁶ Division of Hematology-Oncology, Children's Hospital of Pittsburgh, Pittsburgh, Pennsylvania, USA.
⁷ Department of Pediatrics, Emory University School of Medicine, Atlanta, Georgia, USA.
⁸ Aflac Cancer & Blood Disorders Center, Children's Healthcare of Atlanta, Atlanta, Georgia, USA.
⁹ Division of Infectious Diseases, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.

PMID: 38189744
DOI: 10.1002/pbc.30858

Abstract

Case identification in administrative databases is challenging as diagnosis codes alone are not adequate for case ascertainment. We utilized machine learning (ML) to efficiently identify pediatric patients with newly diagnosed acute lymphoblastic leukemia. We tested nine ML models and validated the best model internally and externally. The optimal model had 97% positive predictive value (PPV) and 99% sensitivity in internal validation; 94% PPV and 82% sensitivity in external validation. Our ML model identified a large cohort of 21,044 patients, demonstrating an efficient approach for cohort assembly and enhancing the usability of administrative data.

Keywords: acute lymphoblastic leukemia; administrative data; case identification; machine learning.

MeSH terms

Algorithms*
Child
Databases, Factual
Humans
Machine Learning
Precursor Cell Lymphoblastic Leukemia-Lymphoma* / diagnosis
Predictive Value of Tests