[Random forest analysis of high dimensional case and control study of lung cancer]

Zhonghua Yu Fang Yi Xue Za Zhi. 2012 Sep;46(9):845-9.
[Article in Chinese]

Abstract

Objective: To investigate the performance of random forest method as a SNP screening procedure in high dimensional case-control data of lung cancer.

Methods: This study included 500 lung cancer patients and 517 controls. A total of 5 ml venous blood sample was collected from each participant. The genotypes were classified by GoldenGate platform, and 399 SNPs were selected. The random forest method was first applied to reduce the dimension, and then the traditional logistic regression method was used to analyze the variables and the genetic susceptibility between lung cancer and multiple SNPs was analyzed by AUC (areas under receiver operation characteristics (ROC) curves).

Results: Fifty important variables, whose average importance scores were highest and whose error rates were lowest, were selected by random forest method. The importance scores of environmental variables (smoking, age and gender) were all listed at top 20, which were respectively 4.05, 3.12 and 1.16. After adjusting 3 environmental variables and false discovery rate (FDR), 6 SNPs were still significantly associated with lung cancer (FDR-P < 0.05). However, if traditional logistic regression analysis were directly applied, no significant SNPs were found. The likelihood testing result of AUC of the 2 ROC (one curve only included environmental variables and the other curve included environmental variables and SNPs) were 0.6491 ± 0.0172 and 0.6811 ± 0.0166 respectively; showed statistical significance of the association between the 6 SNPs and lung cancer (χ² = 43.82, P = 3.6×10⁻¹¹).

Conclusion: Random forest analysis could first remove the turbulent SNPs and then make the analysis by logistic regression method. This could improve the testing efficacy, which is significantly better than single logistic regression analysis.

Publication types

  • English Abstract
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Case-Control Studies
  • Data Interpretation, Statistical
  • Genetic Predisposition to Disease
  • Humans
  • Logistic Models
  • Lung Neoplasms / genetics*
  • Polymorphism, Single Nucleotide*
  • Risk Factors