Motivation: A major goal in structural genomics is to enrich the catalogue of proteins whose 3D structures are known. In an attempt to address this problem we mapped over 10 000 proteins with solved structures onto a graph of all Swissprot protein sequences (release 36, approximately 73 000 proteins) provided by ProtoMap, with the goal of sorting proteins according to their likelihood of belonging to new superfamilies. We hypothesized that proteins within neighbouring clusters tend to share common structural superfamilies or folds. If true, the likelihood of finding new superfamilies increases in clusters that are distal from other solved structures within the graph.
Results: We defined an order relation between unsolved proteins according to their 'distance' from solved structures in the graph, and sorted approximately 48 000 proteins. Our list can be partitioned into three groups: approximately 35 000 proteins sharing a cluster with at least one known structure; approximately 6500 proteins in clusters with no solved structure but with neighbouring clusters containing known structures; and a third group contains the rest of the proteins, approximately 6100 (in 1274 clusters). We tested the quality of the order relation using thousands of recently solved structures that were not included when the order was defined. The tests show that our order is significantly better (P-value approximately 10(5)) than a random order. More interestingly, the order within the union of the second and third groups, and the order within the third group alone, perform better than random (P-values: 0.0008 and 0.15, respectively) and are better than alternative orders created using PSI-BLAST. Herein, we present a method for selecting targets to be used in structural genomics projects.
Availability: List of proteins to be used for targets selection combined with a set of biological filters for narrowing down potential targets is in http://www.protarget.cs.huji.ac.il.