Purpose: To compare probabilistic and deterministic algorithms for linking mothers and infants within electronic health records (EHRs) to support pregnancy outcomes research.
Methods: The study population was women enrolled in Group Health (Washington State, USA) delivering a liveborn infant from 2001 through 2008 (N = 33,093 deliveries) and infant members born in these years. We linked women to infants by surname, address, and dates of birth and delivery using deterministic and probabilistic algorithms. In a subset previously linked using "gold standard" identifiers (N = 14,449), we assessed each approach's sensitivity and positive predictive value (PPV). For deliveries with no "gold standard" linkage (N = 18,644), we compared the algorithms' linkage proportions. We repeated our analyses in an independent test set of deliveries from 2009 through 2013. We reviewed medical records to validate a sample of pairs apparently linked by one algorithm but not the other (N = 51 or 1.4% of discordant pairs).
Results: In the 2001-2008 "gold standard" population, the probabilistic algorithm's sensitivity was 84.1% (95% CI, 83.5-84.7) and PPV 99.3% (99.1-99.4), while the deterministic algorithm had sensitivity 74.5% (73.8-75.2) and PPV 95.7% (95.4-96.0). In the test set, the probabilistic algorithm again had higher sensitivity and PPV. For deliveries in 2001-2008 with no "gold standard" linkage, the probabilistic algorithm found matched infants for 58.3% and the deterministic algorithm, 52.8%. On medical record review, 100% of linked pairs appeared valid.
Conclusions: A probabilistic algorithm improved linkage proportion and accuracy compared to a deterministic algorithm. Better linkage methods can increase the value of EHRs for pregnancy outcomes research.
Keywords: medical record linkage; pharmacoepidemiology; pregnancy outcome/epidemiology.
Copyright © 2014 John Wiley & Sons, Ltd.