A naïve Bayesian classifier for identifying plant microRNAs

Plant J. 2016 Jun;86(6):481-92. doi: 10.1111/tpj.13180. Epub 2016 Jun 20.

Abstract

MicroRNAs (miRNAs) are important regulatory molecules in eukaryotic organisms. Existing methods for the identification of mature miRNA sequences in plants rely extensively on the search for stem-loop structures, leading to high false negative rates. Here, we describe a probabilistic method for ranking putative plant miRNAs using a naïve Bayes classifier and its publicly available implementation. We use a number of properties to construct the classifier, including sequence length, number of observations, existence of detectable predicted miRNA* sequences, the distribution of nearby reads and mapping multiplicity. We apply the method to small RNA sequence data from soybean, peach, Arabidopsis and rice and provide experimental validation of several predictions in soybean. The approach performs well overall and strongly enriches for known miRNAs over other types of sequences. By utilizing a Bayesian approach to rank putative miRNAs, our method is able to score miRNAs that would be eliminated by other methods, such as those that have low counts or lack detectable miRNA* sequences. As a result, we are able to detect several soybean miRNA candidates, including some that are 24 nucleotides long, a class that is almost universally eliminated by other methods.

Keywords: Bayesian statistics; classification; naïve Bayes classifier; plant miRNAs; small RNA sequencing.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Base Sequence
  • Bayes Theorem*
  • Computational Biology / methods*
  • Gene Expression Regulation, Plant / genetics
  • MicroRNAs / classification
  • MicroRNAs / genetics*
  • RNA, Plant / classification
  • RNA, Plant / genetics*

Substances

  • MicroRNAs
  • RNA, Plant

Associated data

  • GENBANK/GSM848963