LLpowershap: logistic loss-based automated Shapley values feature selection method

BMC Med Res Methodol. 2024 Oct 24;24(1):247. doi: 10.1186/s12874-024-02370-8.

Abstract

Background: Shapley values have been used extensively in machine learning, not only to explain black box machine learning models, but among other tasks, also to conduct model debugging, sensitivity and fairness analyses and to select important features for robust modelling and for further follow-up analyses. Shapley values satisfy certain axioms that promote fairness in distributing contributions of features toward prediction or reducing error, after accounting for non-linear relationships and interactions when complex machine learning models are employed. Recently, feature selection methods using predictive Shapley values and p-values have been introduced, including powershap.

Methods: We present a novel feature selection method, LLpowershap, that takes forward these recent advances by employing loss-based Shapley values to identify informative features with minimal noise among the selected sets of features. We also enhance the calculation of p-values and power to identify informative features and to estimate number of iterations of model development and testing.

Results: Our simulation results show that LLpowershap not only identifies higher number of informative features but outputs fewer noise features compared to other state-of-the-art feature selection methods. Benchmarking results on four real-world datasets demonstrate higher or comparable predictive performance of LLpowershap compared to other Shapley based wrapper methods, or filter methods. LLpowershap is also ranked the best in mean ranking among the seven feature selection methods tested on the benchmark datasets.

Conclusion: Our results demonstrate that LLpowershap is a viable wrapper feature selection method that can be used for feature selection in large biomedical datasets and other settings.

Keywords: Benchmark; Feature selection; Interventional TreeSHAP; Logistic loss; Shapley values; Simulation; UK Biobank.

MeSH terms

  • Algorithms*
  • Computer Simulation
  • Humans
  • Logistic Models
  • Machine Learning*