GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs

Jianshu Zhao; Jean Pierre Both; Luis M Rodriguez-R; Konstantinos T Konstantinidis

doi:10.1093/nar/gkae609

GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs

Nucleic Acids Res. 2024 Sep 9;52(16):e74. doi: 10.1093/nar/gkae609.

Authors

Jianshu Zhao^{1

2}, Jean Pierre Both³, Luis M Rodriguez-R^{4

5

6}, Konstantinos T Konstantinidis^{1

2

4}

Affiliations

¹ Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA.
² School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA.
³ Université Paris-Saclay, CEA, List, Palaiseau, France.
⁴ School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
⁵ Department of Microbiology, University of Innsbruck, Innsbruck, Austria.
⁶ Digital Science Center (DiSC), University of Innsbruck, Innsbruck, Austria.

Abstract

Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.

MeSH terms

Algorithms*
Databases, Genetic
Genome, Viral
Genomics* / methods
Software*

Grants and funding

1759831/National Science Foundation