SecProMTB: Support Vector Machine-Based Classifier for Secretory Proteins Using Imbalanced Data Sets Applied to Mycobacterium tuberculosis

Proteomics. 2019 Sep;19(17):e1900007. doi: 10.1002/pmic.201900007. Epub 2019 Aug 8.

Abstract

Secretory proteins of Mycobacterium tuberculosis have created more concern, given their dominant immunogenicity and role in pathogenesis. In view of expensive and time-consuming traditional biochemical experiments, an advanced support vector machine model named SecProMTB is constructed in this study and the proteins are identified by a bioinformatic approach. First, an improved pseudo-amino acid composition (PseAAC) algorithm is used to extract features from all entities. Second, a novel imbalanced-data strategy is proposed and adopted to divide the original data set into train set and test set. Third, to overcome the overfitting problem, feature-ranking algorithms are applied with an increment feature selection. Finally, the model is trained and optimized. Consequently, a model is obtained with an area under the curve of 0.862 and average accuracy of 86% in the independent test. For the convenience of users, SecProMTB and related data are openly accessible at http://server.malab.cn/SecProMTB/index.jsp.

Keywords: imbalanced-data strategy; improved PseAAC; secretory proteins of Mycobacterium tuberculosis; support vector machine.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Bacterial Proteins / classification*
  • Bacterial Proteins / metabolism*
  • Computational Biology / methods*
  • Databases, Protein
  • Mycobacterium tuberculosis / metabolism*
  • Support Vector Machine*

Substances

  • Bacterial Proteins