HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

William H Press; John A Hawkins; Stephen K Jones Jr; Jeffrey M Schaub; Ilya J Finkelstein

doi:10.1073/pnas.2004821117

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

Proc Natl Acad Sci U S A. 2020 Aug 4;117(31):18489-18496. doi: 10.1073/pnas.2004821117. Epub 2020 Jul 16.

Authors

William H Press^{1

2}, John A Hawkins^{3

4

5}, Stephen K Jones Jr^{4

5}, Jeffrey M Schaub^{4

5}, Ilya J Finkelstein^{4

5}

Affiliations

¹ Department of Computer Science, The University of Texas at Austin, Austin, TX 78712; wpress@cs.utexas.edu.
² Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712.
³ Oden Institute of Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX 78712.
⁴ Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712.
⁵ Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712.

Abstract

Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed-Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine-cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.

Keywords: DNA; Reed–Solomon; error-correcting code; indel; information storage.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

DNA / genetics*
DNA Replication
INDEL Mutation*
Information Storage and Retrieval
Models, Statistical

Substances

DNA

Abstract

Publication types

MeSH terms

Substances

Grants and funding