Creating hierarchical models of protein families based on Expressed Sequence Tags: the "Sprockets" analysis pipeline

Anal Chim Acta. 2006 Mar 30;564(1):123-32. doi: 10.1016/j.aca.2006.01.072. Epub 2006 Feb 28.

Abstract

We have created an analysis pipeline called Sprockets, which can be used to classify proteins into various hierarchical "families", and build searchable models of these families. The construction of these families is based on data from Expressed Sequence Tags (ESTs) and Coding DNA Sequences (CDSs), making Sprockets clusters especially suitable for studying gene families in organisms for which the completely sequenced genome does not (yet) exist. The pipeline consists of two main parts: pair-wise analysis and grouping of sequences with Z-score statistics, followed by hierarchical splitting of clusters into alignable protein families. Various computational and statistical techniques applied in Sprockets allow it to act like a massive and selective multiple sequence alignment engine for combining individual sequence collections and related public sequences. The end result is a database of gene Hidden Markov Models, each related to the other by three levels of similarity: secondary structure, function and evolutionary origin. For a sample 20,000 EST set from Lactuca spp., Sprockets provided a 9% improvement in mapping of function to unknown sequences over traditional pair-wise search methods and InterPro mapping.