dForml(KNN)-PseAAC: Detecting formylation sites from protein sequences using K-nearest neighbor algorithm via Chou's 5-step rule and pseudo components

J Theor Biol. 2019 Jun 7:470:43-49. doi: 10.1016/j.jtbi.2019.03.011. Epub 2019 Mar 14.

Abstract

Formylation is a type of post-translational modification that can occur on lysine sites, which plays an irreplaceable role in organism. To better understand the mechanism, it is necessary to identify formylation sites in proteins accurately. Computational method is popular because of its more convenience and higher speed than traditional experimental methods. However, no computational method has been proposed for prediction of lysine formylation. In this study, we developed a predictor named LFPred to identify lysine formylation sites using sequence features (including amino acid composition (AAC), binary profile features (BPF), and amino acid index (AAI)) combined K-nearest neighbor algorithm as classifier. We chose discrete window instead of continuous window according to information entropy. Besides, we took measure to select more reliable negative samples and address the severe imbalance between positive samples and negative samples. Finally, the performance of LFPred is measured with a specificity of 79.9% and a sensibility of 81.4% using jackknife test method, which indicated that our method can be a useful tool for prediction of lysine formylation sites.

Keywords: Discrete window; Formylation; Information entropy; K-nearest neighbor algorithm; Sequence feature.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Lysine / genetics
  • Lysine / metabolism
  • Protein Processing, Post-Translational*
  • Proteins* / genetics
  • Proteins* / metabolism
  • Sequence Analysis, Protein*

Substances

  • Proteins
  • Lysine