Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms

Comput Methods Programs Biomed. 2019 Jul:176:173-193. doi: 10.1016/j.cmpb.2019.04.008. Epub 2019 Apr 10.

Abstract

Objective: A colon microarray data is a repository of thousands of gene expressions with different strengths for each cancer cell. It is necessary to detect which genes are responsible for cancer growth. This study presents an exhaustive comparative study of different machine learning (ML) systems which serves two major purposes: (a) identification of high risk differential genes using statistical tests and (b) development of a ML strategy for predicting cancer genes.

Methods: Four statistical tests namely: Wilcoxon sign rank sum (WCSRS), t test, Kruskal-Wallis (KW), and F-test were adapted for cancerous gene identification using their p-values. The extracted gene set was used to classify cancer patients using ten classifiers namely: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), naïve Bayes (NB), Gaussian process classification (GPC), support vector machine (SVM), artificial neural network (ANN), logistic regression (LR), decision tree (DT), Adaboost (AB), and random forest (RF). Performance was then evaluated using cross-validation protocols and standardized metrics viz. accuracy (ACC) and area under the curve (AUC).

Results: The colon cancer dataset consists of 2000 genes from 62 patients (40 cancer vs. 22 control). The overall mean ACC of our ML system using all four statistical tests and all ten classifiers was 90.50%. The ML system showed an ACC of 99.81% using a combination WCSRS test and RF-based classifier. This is an improvement of 8% over previously published values in literature.

Conclusions: RF-based model with statistical tests for detection of high risk genes showed the best performance for accurate cancer classification in multi-center clinical trials.

Keywords: Colon cancer; Gene expression data; Machine learning; Performance; Prediction; Statistical test.

MeSH terms

  • Area Under Curve
  • Bayes Theorem
  • Colon / metabolism*
  • Colonic Neoplasms / metabolism*
  • Decision Trees
  • Discriminant Analysis
  • Gene Expression Profiling
  • Humans
  • Logistic Models
  • Machine Learning*
  • Models, Statistical
  • Neural Networks, Computer
  • Normal Distribution
  • Oncogenes
  • Regression Analysis
  • Risk
  • Sensitivity and Specificity
  • Support Vector Machine
  • Tissue Array Analysis / methods*