A comparison of cluster analysis methods using DNA methylation data

Bioinformatics. 2004 Aug 12;20(12):1896-904. doi: 10.1093/bioinformatics/bth176. Epub 2004 Mar 25.

Abstract

Motivation: Aberrant DNA methylation is common in cancer. DNA methylation profiles differ between tumor types and subtypes and provide a powerful diagnostic tool for identifying clusters of samples and/or genes. DNA methylation data obtained with the quantitative, highly sensitive MethyLight technology is not normally distributed; it frequently contains an excess of zeros. Established tools to analyze this type of data do not exist. Here, we evaluate a variety of methods for cluster analysis to determine which is most reliable.

Results: We introduce a Bernoulli-lognormal mixture model for clustering DNA methylation data obtained using MethyLight. We model the outcomes using a two-part distribution having discrete and continuous components. It is compared with standard cluster analysis approaches for continuous data and for discrete data. In a simulation study, we find that the two-part model has the lowest classification error rate for mixture outcome data compared with other approaches. The methods are illustrated using DNA methylation data from a study of lung cancer cell lines. Compared with competing hierarchical clustering methods, the mixture model approaches have the lowest cross-validation error for detecting lung cancer subtype (non-small versus small cell). The Bernoulli-lognormal mixture assigns observations to subgroups with the lowest uncertainty.

Availability: Software is available upon request from the authors.

Supplementary information: http://www-rcf.usc.edu/~kims/SupplementaryInfo.html

Publication types

  • Comparative Study
  • Evaluation Study
  • Research Support, U.S. Gov't, P.H.S.
  • Validation Study

MeSH terms

  • Algorithms*
  • Cluster Analysis*
  • CpG Islands / genetics*
  • DNA Methylation*
  • DNA, Neoplasm / classification
  • DNA, Neoplasm / genetics
  • Genetic Testing / methods
  • Humans
  • Lung Neoplasms / classification*
  • Lung Neoplasms / diagnosis
  • Lung Neoplasms / genetics*
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Sequence Alignment / methods
  • Sequence Analysis, DNA / methods*
  • Software

Substances

  • DNA, Neoplasm