The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data

Benedikt Langenberger; Timo Schulte; Oliver Groene

doi:10.1371/journal.pone.0279540

The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data

PLoS One. 2023 Jan 18;18(1):e0279540. doi: 10.1371/journal.pone.0279540. eCollection 2023.

Authors

Benedikt Langenberger¹, Timo Schulte^{2

3}, Oliver Groene^{2

3}

Affiliations

¹ Department of Health Care Management, Technische Universität Berlin, Berlin, Germany.
² OptiMedis, Hamburg, Germany.
³ Department of Management & Innovation in Healthcare, Faculty of Health, University of Witten/Herdecke, Witten, Germany.

Abstract

Our aim was to predict future high-cost patients with machine learning using healthcare claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the following year. Therefore, we exploited routinely collected sickness funds claims and cost data of the years 2016, 2017 and 2018. Various specifications of each algorithm were trained and cross-validated on training data (n = 20,984) with claims and cost data from 2016 and outcomes from 2017. The best performing specifications of each algorithm were selected based on validation dataset performance. For performance comparison, selected models were applied to unforeseen data with features of the year 2017 and outcomes of the year 2018 (n = 21,146). The RF was the best performing algorithm measured by the area under the receiver operating curve (AUC) with a value of 0.883 (95% confidence interval (CI): 0.872-0.893) on test data, followed by the GBM (AUC = 0.878; 95% CI: 0.867-0.889). The ANN (AUC = 0.846; 95% CI: 0.834-0.857) and LR (AUC = 0.839; 95% CI: 0.826-0.852) were significantly outperformed by the GBM and the RF. All ML algorithms and the LR performed ´good´ (i.e. 0.9 > AUC ≥ 0.8). We were able to develop machine learning models that predict high-cost patients with 'good' performance facilitating routinely collected sickness fund claims and cost data. We found that tree-based models performed best and outperformed the ANN and LR.

Copyright: © 2023 Langenberger et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Delivery of Health Care
Humans
Machine Learning
Neural Networks, Computer*
Random Forest

Grants and funding

OptiMedis AG sponsored the data and publication fees. OG is employed by OptiMedis AG. The involvement of OptiMedis AG did not influence our analysis or the interpretation of our results. The funder had no role in study design, data analysis, decision to publish, or preparation of the manuscript.