CSTs for Terabyte-Sized Data

Marco Oliva; Davide Cenzato; Massimiliano Rossi; Zsuzsanna Lipták; Travis Gagie; Christina Boucher

doi:10.1109/dcc52660.2022.00017

CSTs for Terabyte-Sized Data

Proc Data Compress Conf. 2022 Mar:2022:93-102. doi: 10.1109/dcc52660.2022.00017. Epub 2022 Jul 4.

Authors

Marco Oliva¹, Davide Cenzato², Massimiliano Rossi¹, Zsuzsanna Lipták², Travis Gagie³, Christina Boucher¹

Affiliations

¹ Dept of Comp and Info Sci and Eng, University of Florida, Gainesville, FL.
² Dept of Comp Sci, University of Verona, Verona, Italy.
³ Faculty of Comp Sci, Dalhousie University, Halifax, Canada.

Abstract

Generating pangenomic datasets is becoming increasingly common but there are still few tools able to handle them and even fewer accessible to non-specialists. Building compressed suffix trees (CSTs) for pangenomic datasets is still a major challenge but could be enormously beneficial to the community. In this paper, we present a method, which we refer to as RePFP-CST, for building CSTs in a manner that is scalable. To accomplish this, we show how to build a CST directly from VCF files without decompressing them, and to prune from the prefix-free parse (PFP) phrase boundaries whose removal reduces the total size of the dictionary and the parse. We show that these improvements reduce the time and space required for the construction of the CST, and the memory footprint of the finished CST, enabling us to build a CST for a terabyte of DNA for the first time in the literature.

Abstract

Grants and funding