To study the risk of spontaneous abortion (SAB) or termination using healthcare utilization databases, algorithms to estimate the gestational age (GA) are needed. Using Medicaid data, we developed a hierarchical algorithm to classify pregnancy outcomes. We identified the subset of potential SAB and termination cases, and abstracted the GA from linked electronic medical records (gold standard). We developed three approaches: (1) assign median GA for SAB and termination cases in the US; (2) draw a random GA from the population distributions; (3) estimate GA based on regression models. Algorithm performance was assessed based on the proportion of pregnancies with estimated GA within 1-4 weeks of the gold standard, the mean squared error (MSE) and the R-squared. Approach 1 and Approach 3 had similar performance, though approach 3 using random forest models with variables selected via the Boruta algorithm had better MSE and R-squared. For SAB, 58.0% of pregnancies were correctly classified within 2 weeks of the gold standard (MSE: 8.7, R-squared: 0.09). For termination, the proportions were 66.3% (MSE: 11.7; R-squared: 0.35). SABs and terminations can be studied in healthcare utilization data with careful implementation of validated algorithms though higher level of GA misclassification is expected compared to live births.
Keywords: Medicaid; gestational age; spontaneous abortion; termination; validation study.
© The Author(s) 2024. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.