k-Means NANI: An Improved Clustering Algorithm for Molecular Dynamics Simulations

Lexin Chen; Daniel R Roe; Matthew Kochert; Carlos Simmerling; Ramón Alain Miranda-Quintana

doi:10.1021/acs.jctc.4c00308

k-Means NANI: An Improved Clustering Algorithm for Molecular Dynamics Simulations

J Chem Theory Comput. 2024 Jul 9;20(13):5583-5597. doi: 10.1021/acs.jctc.4c00308. Epub 2024 Jun 21.

Authors

Lexin Chen^{1

2}, Daniel R Roe³, Matthew Kochert^{4

5}, Carlos Simmerling^{5

4

6}, Ramón Alain Miranda-Quintana^{1

2}

Affiliations

¹ Department of Chemistry, University of Florida, Gainesville, Florida 32611, United States.
² Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States.
³ Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland 20892, United States.
⁴ Laufer Center for Physical & Quantitative Biology, Stony Brook University, Stony Brook, New York 11794, United States.
⁵ Department of Chemistry, Stony Brook University, Stony Brook, New York 11794, United States.
⁶ Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, New York 11794, United States.

Abstract

One of the key challenges of k-means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as k-means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex data sets such as those obtained from molecular simulation, k-means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors of k-means++ will lead to a lack of reproducibility. K-means N-Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficient n-ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helping k-means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse data sets and be used as a standalone tool or as part of our MDANCE clustering package.

Abstract

Grants and funding