Objective: To identify potential medical aid beneficiaries using demographic and medical history of individuals and analyzing important features qualitatively.
Methods: This retrospective, national cohort, case-control study included data from the National Health Insurance Service (NHIS) in Korea between January 1, 2002 and December 31, 2019. Potential medical aid beneficiaries were classified using several machine learning models (linear models and tree-based models). Demographic data such as age, sex, region, insurance type, insurance fee, and medical history such as diagnosis, operation, statement, visits, and costs were collected. Those data were transformed into a one-dimensional vector for each individual, allowing machine learning models to learn. For feature importance calculation, we used the average gain across all splits for each feature.
Results: 274,635 individuals were finally included in the study population, and 62,501 were classified as potential medical aid beneficiaries. XGBoost successfully classified potential medical aid beneficiaries with an AUROC of around 0.891. Assuming predicting before two years, the performance was still significant with an AUROC of around 0.832. Economic variables, such as insurance fees and several costs, turned out to be the most important, but variables regarding medical status, such as the results of blood tests and history of chronic diseases, were also important.
Conclusion: Machine learning-based models successfully screened potential medical aid beneficiaries. Qualitative analysis of important features well reflected prior knowledge regarding public health. These findings can contribute to the soundness of healthcare finance and the improvement of public health.
Keywords: Longitudinal study; Machine learning; Medical aid; National cohort; National health insurance.
Copyright © 2024 The Author(s). Published by Elsevier B.V. All rights reserved.