Fixed Character States and the Optimization of Molecular Sequence Data

Cladistics. 1999 Dec;15(4):379-385. doi: 10.1111/j.1096-0031.1999.tb00274.x.

Abstract

A method is proposed to optimize molecular sequence data that does not employ multiple sequence alignment. This method treats entire homologous contiguous stretches of sequence data as individual characters. This sequence is treated as the homologous unit employed in phylogeny reconstruction. The sets of specific sequences exhibited by the terminal taxa constitute the character states. The number of states is then less than or equal to the number of unique sequences (or homologous fragments) exhibited by the data. A matrix of transformation costs is created to relate the states to one another. The cells of this matrix are defined as the minimum transformation cost between each pair of states based on insertion-deletion and base substitution costs. The diagnosis of a topology then follows existing dynamic programming techniques, with the number of states greatly expanded. Since the possible sequences reconstructed at nodes are limited to those exhibited by the terminals, cladograms constructed in this way may be longer than those of other methods in that they require a greater number of weighted evolutionary events. Example data, the effects of missing data, restricted ancestors, and putative long-branch attraction are discussed.