Stable feature selection for clinical prediction: exploiting ICD tree structure using Tree-Lasso

J Biomed Inform. 2015 Feb:53:277-90. doi: 10.1016/j.jbi.2014.11.013. Epub 2014 Dec 9.

Abstract

Modern healthcare is getting reshaped by growing Electronic Medical Records (EMR). Recently, these records have been shown of great value towards building clinical prediction models. In EMR data, patients' diseases and hospital interventions are captured through a set of diagnoses and procedures codes. These codes are usually represented in a tree form (e.g. ICD-10 tree) and the codes within a tree branch may be highly correlated. These codes can be used as features to build a prediction model and an appropriate feature selection can inform a clinician about important risk factors for a disease. Traditional feature selection methods (e.g. Information Gain, T-test, etc.) consider each variable independently and usually end up having a long feature list. Recently, Lasso and related l1-penalty based feature selection methods have become popular due to their joint feature selection property. However, Lasso is known to have problems of selecting one feature of many correlated features randomly. This hinders the clinicians to arrive at a stable feature set, which is crucial for clinical decision making process. In this paper, we solve this problem by using a recently proposed Tree-Lasso model. Since, the stability behavior of Tree-Lasso is not well understood, we study the stability behavior of Tree-Lasso and compare it with other feature selection methods. Using a synthetic and two real-world datasets (Cancer and Acute Myocardial Infarction), we show that Tree-Lasso based feature selection is significantly more stable than Lasso and comparable to other methods e.g. Information Gain, ReliefF and T-test. We further show that, using different types of classifiers such as logistic regression, naive Bayes, support vector machines, decision trees and Random Forest, the classification performance of Tree-Lasso is comparable to Lasso and better than other methods. Our result has implications in identifying stable risk factors for many healthcare problems and therefore can potentially assist clinical decision making for accurate medical prognosis.

Keywords: Classification; Feature selection; Feature stability; Lasso; Tree-Lasso.

MeSH terms

  • Algorithms
  • Bayes Theorem
  • Decision Making
  • Decision Trees
  • Electronic Health Records*
  • Humans
  • International Classification of Diseases*
  • Logistic Models
  • Medical Informatics*
  • Probability
  • Prognosis
  • Regression Analysis
  • Reproducibility of Results
  • Risk Factors
  • Support Vector Machine*