Experimental evidence for the entire human proteome has been defined in the Human Proteome Project, and it is publicly available in the neXtProt database. However, there are still human proteins for which reliable experimental evidence does not exist, and the identification of such information has become one of the overriding objectives in the chromosome-centric study of the human proteome. With this aim and considering the complexity of protein detection using shotgun and targeted proteomics, the research community has addressed the integration of transcriptomics and proteomics landscapes. Here, we describe an analytical pipeline that predicts the probability of a missing protein being expressed in a biological sample based on (1) gene sequence characteristics, (2) the probability of an expressed gene being a coding gene of a missing protein in a certain sample, and (3) the probability of a gene being expressed in a transcriptomic experiment. More than 3400 microarray experiments were analyzed corresponding to three biological sources: cell lines, normal tissues, and cancer samples. A gene classification based on gene expression profiles distinguished among ubiquitous, nonubiquitous, nonexpressed, and coding genes of missing proteins. In addition, a different tissue-specific expression pattern for the coding genes of missing proteins is reported. Our results underline the relevance of selecting an appropriate sample for the detection of missing proteins and provide a comprehensive method to score their expression probability. Testis, brain, and skeletal muscle are the most promising normal tissues.
Keywords: C-HPP; missing proteins; naive Bayes classifier; protein expression profiles; transcriptome profiling.