From components to communities: bringing network science to clustering for molecular epidemiology

Molly Liu; Connor Chato; Art F Y Poon

doi:10.1093/ve/vead026

From components to communities: bringing network science to clustering for molecular epidemiology

Virus Evol. 2023 Apr 25;9(1):vead026. doi: 10.1093/ve/vead026. eCollection 2023.

Authors

Molly Liu¹, Connor Chato¹, Art F Y Poon^{1

2

3}

Affiliations

¹ Department of Pathology and Laboratory Medicine, Western University, Dental Sciences Building, Rm. 4044, London, ON N6A 5C1, Canada.
² Department of Microbiology and Immunology, Western University, 1151 Richmond Street, London, ON N6A 3K7, Canada.
³ Department of Computer Science, Western University, Room 355, Middlesex College, London, ON N6A 5B7, Canada.

Abstract

Defining clusters of epidemiologically related infections is a common problem in the surveillance of infectious disease. A popular method for generating clusters is pairwise distance clustering, which assigns pairs of sequences to the same cluster if their genetic distance falls below some threshold. The result is often represented as a network or graph of nodes. A connected component is a set of interconnected nodes in a graph that are not connected to any other node. The prevailing approach to pairwise clustering is to map clusters to the connected components of the graph on a one-to-one basis. We propose that this definition of clusters is unnecessarily rigid. For instance, the connected components can collapse into one cluster by the addition of a single sequence that bridges nodes in the respective components. Moreover, the distance thresholds typically used for viruses like HIV-1 tend to exclude a large proportion of new sequences, making it difficult to train models for predicting cluster growth. These issues may be resolved by revisiting how we define clusters from genetic distances. Community detection is a promising class of clustering methods from the field of network science. A community is a set of nodes that are more densely inter-connected relative to the number of their connections to external nodes. Thus, a connected component may be partitioned into two or more communities. Here we describe community detection methods in the context of genetic clustering for epidemiology, demonstrate how a popular method (Markov clustering) enables us to resolve variation in transmission rates within a giant connected component of HIV-1 sequences, and identify current challenges and directions for further work.