DataLink record linkage software applied to the cancer registry of Murcia, Spain

Methods Inf Med. 2008;47(5):448-53. doi: 10.3414/me0529.

Abstract

Objectives: Record linkage between data sets is relatively simple when unique, universal, permanent, and common variables exist in each data set. This situation occurs infrequently; thus, there is a need to apply probabilistic methods to identify corresponding records. DataLink has been tested to determine if the use of clustering techniques will improve performance with a minimum decrease in accuracy.

Methods: The study uses cancer registry data which includes hospital discharge and pathology reports from two hospitals in the Murcia Region for the years 2002-2003. These data are standardized prior to running DataLink. The original version of DataLink compares all of the records one by one, and in two later versions of the software clustering is applied which filters for one or more variables. Computing time and the proportion of detected matches have been investigated with each version.

Results: The clustering versions achieve 96.1% and 96.2% accuracy, respectively. An improvement in the computational time of 97.3% and 98.6% is achieved for the two clustering versions compared with the original. The clustering versions lose 0.36% and 1.07% of real duplicates, respectively.

Conclusions: DataLink implements deterministic and probabilistic record linkage to eliminate duplicates and to merge new information with existing cases. The standardization of variables to a common format has been adapted to the characteristics of Spanish language data. Clustering techniques minimize computational time and maximize accuracy in the detection of corresponding records.

MeSH terms

  • Child
  • Data Collection / methods*
  • Humans
  • Neoplasms*
  • Registries*
  • Software*
  • Spain