BlockPolish: accurate polishing of long-read assembly via block divide-and-conquer

Brief Bioinform. 2022 Jan 17;23(1):bbab405. doi: 10.1093/bib/bbab405.

Abstract

Long-read sequencing technology enables significant progress in de novo genome assembly. However, the high error rate and the wide error distribution of raw reads result in a large number of errors in the assembly. Polishing is a procedure to fix errors in the draft assembly and improve the reliability of genomic analysis. However, existing methods treat all the regions of the assembly equally while there are fundamental differences between the error distributions of these regions. How to achieve very high accuracy in genome assembly is still a challenging problem. Motivated by the uneven errors in different regions of the assembly, we propose a novel polishing workflow named BlockPolish. In this method, we divide contigs into blocks with low complexity and high complexity according to statistics of aligned nucleotide bases. Multiple sequence alignment is applied to realign raw reads in complex blocks and optimize the alignment result. Due to the different distributions of error rates in trivial and complex blocks, two multitask bidirectional Long short-term memory (LSTM) networks are proposed to predict the consensus sequences. In the whole-genome assemblies of NA12878 assembled by Wtdbg2 and Flye using Nanopore data, BlockPolish has a higher polishing accuracy than other state-of-the-arts including Racon, Medaka and MarginPolish & HELEN. In all assemblies, errors are predominantly indels and BlockPolish has a good performance in correcting them. In addition to the Nanopore assemblies, we further demonstrate that BlockPolish can also reduce the errors in the PacBio assemblies. The source code of BlockPolish is freely available on Github (https://github.com/huangnengCSU/BlockPolish).

Keywords: assembly polishing; block divide-and-conquer; long reads; neural network.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • High-Throughput Nucleotide Sequencing* / methods
  • Reproducibility of Results
  • Sequence Alignment
  • Sequence Analysis, DNA / methods
  • Software*