Identifying relevant data for a biological database: handcrafted rules versus machine learning

Aditya Kumar Sehgal; Sanmay Das; Keith Noto; Milton H Saier Jr; Charles Elkan

doi:10.1109/TCBB.2009.83

Identifying relevant data for a biological database: handcrafted rules versus machine learning

IEEE/ACM Trans Comput Biol Bioinform. 2011 May-Jun;8(3):851-7. doi: 10.1109/TCBB.2009.83.

Authors

Aditya Kumar Sehgal¹, Sanmay Das, Keith Noto, Milton H Saier Jr, Charles Elkan

Affiliation

¹ Core Technologies Group, Parity Computing, San Diego, CA 92121, USA. a.sehgal@paritycomputing.com

Abstract

With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.

Publication types

Comparative Study
Research Support, N.I.H., Extramural

MeSH terms

Algorithms*
Artificial Intelligence*
Carrier Proteins
Cluster Analysis
Data Mining / methods*
Databases, Genetic*
Genomics / methods*
Humans
MEDLINE
Proteins / classification
Proteins / genetics

Substances

Carrier Proteins
Proteins

Grants and funding

R01 GM077402/GM/NIGMS NIH HHS/United States