Evaluating the effect of annotation size on measures of semantic similarity

Maxat Kulmanov; Robert Hoehndorf

doi:10.1186/s13326-017-0119-z

Evaluating the effect of annotation size on measures of semantic similarity

J Biomed Semantics. 2017 Feb 13;8(1):7. doi: 10.1186/s13326-017-0119-z.

Authors

Maxat Kulmanov^{1

2}, Robert Hoehndorf^{3

4}

Affiliations

¹ Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.
² Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.
³ Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia. robert.hoehndorf@kaust.edu.sa.
⁴ Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia. robert.hoehndorf@kaust.edu.sa.

Abstract

Background: Ontologies are widely used as metadata in biological and biomedical datasets. Measures of semantic similarity utilize ontologies to determine how similar two entities annotated with classes from ontologies are, and semantic similarity is increasingly applied in applications ranging from diagnosis of disease to investigation in gene networks and functions of gene products.

Results: Here, we analyze a large number of semantic similarity measures and the sensitivity of similarity values to the number of annotations of entities, difference in annotation size and to the depth or specificity of annotation classes. We find that most similarity measures are sensitive to the number of annotations of entities, difference in annotation size as well as to the depth of annotation classes; well-studied and richly annotated entities will usually show higher similarity than entities with only few annotations even in the absence of any biological relation.

Conclusions: Our findings may have significant impact on the interpretation of results that rely on measures of semantic similarity, and we demonstrate how the sensitivity to annotation size can lead to a bias when using semantic similarity to predict protein-protein interactions.

Keywords: Gene ontology; Ontology; Semantic similarity.

MeSH terms

Disease / genetics
Gene Ontology*
Molecular Sequence Annotation*
Protein Interaction Mapping
Semantics*