Fec: a fast error correction method based on two-rounds overlapping and caching

Jun Zhang; Fan Nie; Neng Huang; Peng Ni; Feng Luo; Jianxin Wang

doi:10.1093/bioinformatics/btac565

Fec: a fast error correction method based on two-rounds overlapping and caching

Bioinformatics. 2022 Sep 30;38(19):4629-4632. doi: 10.1093/bioinformatics/btac565.

Authors

Jun Zhang^{1

2}, Fan Nie^{1

2}, Neng Huang^{1

2}, Peng Ni^{1

2}, Feng Luo³, Jianxin Wang^{1

2}

Affiliations

¹ School of Computer Science and Engineering, Central South University, Changsha 410083, China.
² Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China.
³ School of Computing, Clemson University, Clemson, SC 29634, USA.

PMID: 35977383
DOI: 10.1093/bioinformatics/btac565

Abstract

The third-generation sequencing technology has advanced genome analysis with long-read length, but the reads need error correction due to the high error rate. Error correction is a time-consuming process especially when the sequencing coverage is high. Generally, for a pair of overlapping reads A and B, the existing error correction methods perform a base-level alignment from B to A when correcting the read A. And another base-level alignment from A to B is performed when correcting the read B. However, based on our observation, the base-level alignment information can be reused. In this article, we present a fast error correction tool Fec, using two-rounds overlapping and caching. Fec can be used independently or as an error correction step in an assembly pipeline. In the first round, Fec uses a large window size (20) to quickly find enough overlaps to correct most of the reads. In the second round, a small window size (5) is used to find more overlaps for the reads with insufficient overlaps in the first round. When performing base-level alignment, Fec searches the cache first. If the alignment exists in the cache, Fec takes this alignment out and deduces the second alignment from it. Otherwise, Fec performs base-level alignment and stores the alignment in the cache. We test Fec on nine datasets, and the results show that Fec has 1.24-38.56 times speed-up compared to MECAT, CANU and MINICNS on five PacBio datasets and 1.16-27.8 times speed-up compared to NECAT and CANU on four nanopore datasets.

Availability and implementation: Fec is available at https://github.com/zhangjuncsu/Fec.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Genome
High-Throughput Nucleotide Sequencing* / methods
Sequence Analysis, DNA / methods
Software*

Abstract

Publication types

MeSH terms

Grants and funding