Assessing predictors for new post translational modification sites: A case study on hydroxylation

PLoS Comput Biol. 2020 Jun 22;16(6):e1007967. doi: 10.1371/journal.pcbi.1007967. eCollection 2020 Jun.

Abstract

Post-translational modification (PTM) sites have become popular for predictor development. However, with the exception of phosphorylation and a handful of other examples, PTMs suffer from a limited number of available training examples and sparsity in protein sequences. Here, proline hydroxylation is taken as an example to compare different methods and evaluate their performance on new experimentally determined sites. As a guide for effective experimental design, predictors require both high specificity and sensitivity. However, the self-reported performance may often not be indicative of prediction quality and detection of new sites is not guaranteed. We have benchmarked seven published hydroxylation site predictors on two newly constructed independent datasets. The self-reported performance is found to widely overestimate the real accuracy measured on independent datasets. No predictor performs better than random on new examples, indicating the refined models do not sufficiently generalize to detect new sites. The number of false positives is high and precision low, in particular for non-collagen proteins whose motifs are not conserved. As hydroxylation site predictors do not generalize for new data, caution is advised when using PTM predictors in the absence of independent evaluations, in particular for highly specific sites involved in signalling.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • HeLa Cells
  • Humans
  • Hydroxylation
  • Protein Processing, Post-Translational*
  • Proteins / metabolism*
  • Signal Transduction

Substances

  • Proteins

Grants and funding

DP was supported by Fondazione Italiana per la Ricerca sul Cancro [16621]. AMM was funded by the research programme MSCA Seal of Excellence @UniPD [TOSA_MSCASOE18_01]. ST was supported by Associazione Italiana per la Ricerca sul Cancro [IG 17753, IG 23825]. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 778247. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.