RNA 5-methylcytosine (m5C) is an important post-transcriptional modification that plays an indispensable role in biological processes. The accurate identification of m5C sites from primary RNA sequences is especially useful for deeply understanding the mechanisms and functions of m5C. Due to the difficulty and expensive costs of identifying m5C sites with wet-lab techniques, developing fast and accurate machine-learning-based prediction methods is urgently needed. In this study, we proposed a new m5C site predictor, called M5C-HPCR, by introducing a novel heuristic nucleotide physicochemical property reduction (HPCR) algorithm and classifier ensemble. HPCR extracts multiple reducts of physical-chemical properties for encoding discriminative features, while the classifier ensemble is applied to integrate multiple base predictors, each of which is trained based on a separate reduct of the physical-chemical properties obtained from HPCR. Rigorous jackknife tests on two benchmark datasets demonstrate that M5C-HPCR outperforms state-of-the-art m5C site predictors, with the highest values of MCC (0.859) and AUC (0.962). We also implemented the webserver of M5C-HPCR, which is freely available at http://cslab.just.edu.cn:8080/M5C-HPCR/.
Keywords: Classifier ensemble; Heuristic properties reduction; Pseudo dinucleotide composition; RNA 5-methylcytosine.
Copyright © 2018 Elsevier Inc. All rights reserved.