A Bagged, Partially Linear, Tree-Based Regression Procedure for Prediction and Variable Selection

Hum Hered. 2015;79(3-4):182-93. doi: 10.1159/000380850. Epub 2015 Jul 28.

Abstract

Objectives: In genomics, variable selection and prediction accounting for the complex interrelationships between explanatory variables represent major challenges. Tree-based methods are powerful alternatives to classical regression models. We have recently proposed the generalized, partially linear, tree-based regression (GPLTR) procedure that integrates the advantages of generalized linear regression (allowing the incorporation of confounding variables) and of tree-based models. In this work, we use bagging to address a classical concern of tree-based methods: their instability.

Methods: We present a bagged GPLTR procedure and three scores for variable importance. The prediction accuracy and the performance of the scores are assessed by simulation. The use of this procedure is exemplified by the analysis of a lung cancer data set. The aim is to predict the epidermal growth factor receptor (EGFR) mutation based on gene expression measurements, taking into account the ethnicity (confounder variable) and perform variable selection.

Results: The procedure performs well in terms of prediction accuracy. The scores differentiate predictive variables from noise variables. Based on a lung adenocarcinoma data set, the procedure achieves good predictive performance for EGFR mutation and selects relevant genes.

Conclusion: The proposed bagged GPLTR procedure performs well for prediction and variable selection.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adenocarcinoma / genetics
  • Adenocarcinoma of Lung
  • Computer Simulation
  • Databases as Topic
  • Genomics / methods*
  • Humans
  • Linear Models
  • Lung Neoplasms / genetics