Bacterial genomes are chimeras of DNA of different ancestries. Deconstructing chimeric genomes is central to understanding the evolutionary trajectories of their disparate components and thus the organisms as a whole in the light of their evolutionary contexts. Of specific interest is to delineate and quantify native (vertically inherited) and alien (horizontally acquired) components of bacterial genomes and also specify genomic fractions that represent different donor sources. An agglomerative clustering procedure that prioritizes grouping of proximal similar genomic segments has previously been invoked for this purpose in conjunction with a recursive segmentation procedure. Surprisingly, however, the relative strengths and weaknesses of different clustering approaches to deciphering bacterial chimerism have not yet been investigated, despite the need to robustly interpret tens of thousands of completely sequenced bacterial genomes and nearly complete genome assemblies available in the public databases. To bridge this knowledge gap and develop more robust approaches, we assessed different clustering methods, including segment order based (proximal) clustering, hierarchical clustering, affinity propagation clustering, and a novel network clustering approach on chimeric genomes modeled after bacterial genomes representing a broad spectrum of compositional complexity. Although segment order-based clustering and network clustering compared favorably with the other approaches in discriminating between native and alien DNA at genome optimized settings, network clustering did consistently better than other methods at parametric settings optimized on all test genomes together. Segment order-based clustering and hierarchical clustering outperformed other methods in alien DNA identification while preserving donor identity in the genomes. Our study highlights the strengths and weaknesses of different approaches and suggests how this can be leveraged to achieve a more robust deconstruction of bacterial chimerism.
Keywords: bacteria; chimeric genomes; gene clustering; network.