Evaluation of the vector space representation in text-based gene clustering

Pac Symp Biocomput. 2003:391-402. doi: 10.1142/9789812776303_0037.

Abstract

Thanks to its increasing availability, electronic literature can now be a major source of information when developing complex statistical models where data is scarce or contains much noise. This raises the question of how to deeply integrate information from domain literature with experimental data. Evaluating what kind of statistical text representations can integrate literature knowledge in clustering still remains an unsufficiently explored topic. In this work we discuss how the bag-of-words representation can be used successfully to represent genetic annotation and free-text information coming from different databases. We demonstrate the effect of various weighting schemes and information sources in a functional clustering setup. As a quantitative evaluation, we contrast for different parameter settings the functional groupings obtained from text with those obtained from expert assessments and link each of the results to a biological discussion.

Publication types

  • Evaluation Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Artificial Intelligence
  • Cluster Analysis
  • Computational Biology
  • Databases, Genetic
  • Gene Expression Profiling / statistics & numerical data
  • Genome, Fungal
  • Genomics / statistics & numerical data*
  • Models, Genetic*
  • Saccharomyces cerevisiae / genetics