Comparison of machine-learning and logistic regression models for prediction of 30-day unplanned readmission in electronic health records: A development and validation study

Masao Iwagami; Ryota Inokuchi; Eiryo Kawakami; Tomohide Yamada; Atsushi Goto; Toshiki Kuno; Yohei Hashimoto; Nobuaki Michihata; Tadahiro Goto; Tomohiro Shinozaki; Yu Sun; Yuta Taniguchi; Jun Komiyama; Kazuaki Uda; Toshikazu Abe; Nanako Tamiya

doi:10.1371/journal.pdig.0000578

Comparison of machine-learning and logistic regression models for prediction of 30-day unplanned readmission in electronic health records: A development and validation study

PLOS Digit Health. 2024 Aug 20;3(8):e0000578. doi: 10.1371/journal.pdig.0000578. eCollection 2024 Aug.

Authors

Masao Iwagami^{1

2

3

4}, Ryota Inokuchi^{1

2

5

6}, Eiryo Kawakami^{7

8}, Tomohide Yamada⁹, Atsushi Goto¹⁰, Toshiki Kuno^{11

12}, Yohei Hashimoto^{13

14}, Nobuaki Michihata^{14

15}, Tadahiro Goto^{14

16}, Tomohiro Shinozaki¹⁷, Yu Sun^{1

2}, Yuta Taniguchi¹, Jun Komiyama^{1

2}, Kazuaki Uda^{1

2}, Toshikazu Abe^{1

18}, Nanako Tamiya^{1

2

3

19}

Affiliations

¹ Department of Health Services Research, Institute of Medicine, University of Tsukuba, Tsukuba, Ibaraki, Japan.
² Health Services Research and Development Center, University of Tsukuba, Tsukuba, Ibaraki, Japan.
³ Digital Society Division, Cyber Medicine Research Center, University of Tsukuba, Tsukuba, Ibaraki, Japan.
⁴ Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, United Kingdom.
⁵ Department of Clinical Engineering, The University of Tokyo Hospital, Tokyo, Japan.
⁶ Department of Emergency and Critical Care Medicine, The University of Tokyo Hospital, Tokyo, Japan.
⁷ Department of Artificial Intelligence Medicine, Graduate School of Medicine, Chiba University, Chiba, Chiba, Japan.
⁸ Advanced Data Science Project (ADSP), RIKEN Information R&D and Strategy Headquarters, RIKEN, Yokohama, Kanagawa, Japan.
⁹ Department of Diabetes and Metabolic Diseases, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
¹⁰ Department of Public Health, School of Medicine, Yokohama City University, Yokohama, Kanagawa, Japan.
¹¹ Division of Cardiology, Montefiore Medical Center, Albert Einstein College of Medicine, NY, United States of America.
¹² Cardiology Division, Massachusetts General Hospital, Harvard Medical School, Boston, MA, United States of America.
¹³ Department of Ophthalmology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
¹⁴ Department of Clinical Epidemiology and Health Economics, School of Public Health, The University of Tokyo, Tokyo, Japan.
¹⁵ Cancer Prevention Center, Chiba Cancer Center Research Institute, Chiba, Japan.
¹⁶ TXP Medical Co. Ltd, Tokyo, Japan.
¹⁷ Department of Information and Computer Technology, Faculty of Engineering, Tokyo University of Science, Tokyo, Japan.
¹⁸ Department of Emergency and Critical Care Medicine, Tsukuba Memorial Hospital, Tsukuba, Ibaraki, Japan.
¹⁹ Center for Artificial Intelligence Research, University of Tsukuba, Tsukuba, Ibaraki, Japan.

Abstract

It is expected but unknown whether machine-learning models can outperform regression models, such as a logistic regression (LR) model, especially when the number and types of predictor variables increase in electronic health records (EHRs). We aimed to compare the predictive performance of gradient-boosted decision tree (GBDT), random forest (RF), deep neural network (DNN), and LR with the least absolute shrinkage and selection operator (LR-LASSO) for unplanned readmission. We used EHRs of patients discharged alive from 38 hospitals in 2015-2017 for derivation and in 2018 for validation, including basic characteristics, diagnosis, surgery, procedure, and drug codes, and blood-test results. The outcome was 30-day unplanned readmission. We created six patterns of data tables having different numbers of binary variables (that ≥5% or ≥1% of patients or ≥10 patients had) with and without blood-test results. For each pattern of data tables, we used the derivation data to establish the machine-learning and LR models, and used the validation data to evaluate the performance of each model. The incidence of outcome was 6.8% (23,108/339,513 discharges) and 6.4% (7,507/118,074 discharges) in the derivation and validation datasets, respectively. For the first data table with the smallest number of variables (102 variables that ≥5% of patients had, without blood-test results), the c-statistic was highest for GBDT (0.740), followed by RF (0.734), LR-LASSO (0.720), and DNN (0.664). For the last data table with the largest number of variables (1543 variables that ≥10 patients had, including blood-test results), the c-statistic was highest for GBDT (0.764), followed by LR-LASSO (0.755), RF (0.751), and DNN (0.720), suggesting that the difference between GBDT and LR-LASSO was small and their 95% confidence intervals overlapped. In conclusion, GBDT generally outperformed LR-LASSO to predict unplanned readmission, but the difference of c-statistic became smaller as the number of variables was increased and blood-test results were used.

Copyright: © 2024 Iwagami et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Grants and funding

This study was supported by the Cyber Medicine Research Center, University of Tsukuba, Tsukuba, Ibaraki, Japan, and a Japan Society for the Promotion of Science (JSPS) KAKENHI Grant (No. 19K19430) from the Japanese Ministry of Education, Culture, Sports, Science, and Technology. The funders had no role in study design, data collection, data analysis, data interpretation, or writing.