Evaluating the impact of topological protein features on the negative examples selection

BMC Bioinformatics. 2018 Nov 20;19(Suppl 14):417. doi: 10.1186/s12859-018-2385-x.

Abstract

Background: Supervised machine learning methods when applied to the problem of automated protein-function prediction (AFP) require the availability of both positive examples (i.e., proteins which are known to possess a given protein function) and negative examples (corresponding to proteins not associated with that function). Unfortunately, publicly available proteome and genome data sources such as the Gene Ontology rarely store the functions not possessed by a protein. Thus the negative selection, consisting in identifying informative negative examples, is currently a central and challenging problem in AFP. Several heuristics have been proposed through the years to solve this problem; nevertheless, despite their effectiveness, to the best of our knowledge no previous existing work studied which protein features are more relevant to this task, that is, which protein features help more in discriminating reliable and unreliable negatives.

Results: The present work analyses the impact of several features on the selection of negative proteins for the Gene Ontology (GO) terms. The analysis is network-based: it exploits the fact that proteins can be naturally structured in a network, considering the pairwise relationships coming from several sources of data, such as protein-protein and genetic interactions. Overall, the proposed protein features, including local and global graph centrality measures and protein multifunctionality, can be term-aware (i.e., depending on the GO term) and term-unaware (i.e., invariant across the GO terms). We validated the informativeness of each feature utilizing a temporal holdout in three different experiments on yeast, mouse and human proteomes: (i) feature selection to detect which protein features are more helpful for the negative selection; (ii) protein function prediction to verify whether the features considered are also useful to predict GO terms; (iii) negative selection by applying two different negative selection algorithms on proteins represented through the proposed features.

Conclusions: Term-aware features (with some exceptions) resulted more informative for problem (i), together with node betweenness, which is the most relevant among term-unaware features. The node positive neighborhood instead is the most predictive feature for the AFP problem, while experiment (iii) showed that the proposed features allow negative selection algorithms to select effectively negative instances in the temporal holdout setting, with better results when nonlinear combinations of features are also exploited.

Keywords: Biological networks; Negative example selection; Protein features; Protein function prediction.

MeSH terms

  • Algorithms
  • Animals
  • Gene Ontology
  • Gene Regulatory Networks
  • Humans
  • Mice
  • Proteins / chemistry*
  • Proteome / metabolism
  • Saccharomyces cerevisiae / metabolism

Substances

  • Proteins
  • Proteome