A novel model to predict O-glycosylation sites using a highly unbalanced dataset

Glycoconj J. 2012 Oct;29(7):551-64. doi: 10.1007/s10719-012-9434-x. Epub 2012 Aug 3.

Abstract

In silico approaches have become an alternative method to study O-glycosylation. In this paper, we developed a linear interpretable model for O-glycosylation prediction based on an unbalanced dataset, analyzing the underlying biological knowledge of glycosylation. A training set of 4446 sites involving 468 positive sites and 3978 negative sites was developed during this research. The sites were encoded using the amino acid index (AAindex), and the forward stepwise procedure utilized for feature selection. The linear discriminant analysis with an equal a priori probability (PP-LDA) was employed to develop the interpretable model. Performance of the model was verified using both the internal leave-one-out cross-validation and external validation methods. Two non-linear algorithms, the supervised support vector machine and the unsupervised self-organizing competitive neural network, were used as comparisons. The PP-LDA model exhibited improved classification results with accuracy of 82.1% for cross-validations and 80.3% for external prediction. Further analysis of this linear model indicated that the properties at position R(1) and the properties relative to hydrophobicity contributed more to the glycosylation prediction. However, the alpha and turn propensities at the C-terminal, together with physicochemical properties at the N-terminal, are also relative to the glycosylation activity. This model is not only capable of predicting the possibility of glycosylation using an unbalanced dataset, but is also helpful to understand the underlying biological mechanisms of glycosylation. Considering the publicly accessibility of our prediction model, a downloadable program is provided in our supply materials.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Databases, Protein*
  • Glycoproteins / genetics*
  • Glycoproteins / metabolism
  • Glycosylation
  • Models, Genetic*
  • Neural Networks, Computer*
  • Protein Structure, Tertiary
  • Sequence Analysis, Protein*
  • Software*

Substances

  • Glycoproteins