Background: The detection of relationships between a protein sequence of unknown function and a sequence whose function has been characterised enables the transfer of functional annotation. However in many cases these relationships can not be identified easily from direct comparison of the two sequences. Methods which compare sequence profiles have been shown to improve the detection of these remote sequence relationships. However, the best method for building a profile of a known set of sequences has not been established. Here we examine how the type of profile built affects its performance, both in detecting remote homologs and in the resulting alignment accuracy. In particular, we consider whether it is better to model a protein superfamily using a single structure-based alignment that is representative of all known cases of the superfamily, or to use multiple sequence-based profiles each representing an individual member of the superfamily.
Results: Using profile-profile methods for remote homolog detection we benchmark the performance of single structure-based superfamily models and multiple domain models. On average, over all superfamilies, using a truncated receiver operator characteristic (ROC5) we find that multiple domain models outperform single superfamily models, except at low error rates where the two models behave in a similar way. However there is a wide range of performance depending on the superfamily. For 12% of all superfamilies the ROC5 value for superfamily models is greater than 0.2 above the domain models and for 10% of superfamilies the domain models show a similar improvement in performance over the superfamily models.
Conclusion: Using a sensitive profile-profile method we have investigated the performance of single structure-based models and multiple sequence models (domain models) in detecting remote superfamily members. We find that overall, multiple models perform better in recognition although single structure-based models display better alignment accuracy.