Introduction: There is growing interest in using electronic health records (EHRs) for chronic disease surveillance. However, these data are convenience samples of in-care individuals, which are not representative of target populations for public health surveillance, generally defined, for the relevant period, as resident populations within city, state, or other jurisdictions. We focus on using EHR data for estimation of diabetes prevalence among young adults in New York City, as rising diabetes burden in younger ages call for better surveillance capacity.
Methods: This article applies common nonprobability sampling methods, including raking, post-stratification, and multilevel regression with post-stratification, to real and simulated data for the cross-sectional estimation of diabetes prevalence among those aged 18-44 years. Within real data analyses, we externally validate city- and neighborhood-level EHR-based estimates to gold-standard estimates from a local health survey. Within data simulations, we probe the extent to which residual biases remain when selection into the EHR sample is non-ignorable.
Results: Within the real data analyses, these methods reduced the impact of selection biases in the citywide prevalence estimate compared to gold standard. Residual biases remained at the neighborhood-level, where prevalence tended to be overestimated, especially in neighborhoods where a higher proportion of residents were captured in the sample. Simulation results demonstrated these methods may be sufficient, except when selection into the EHR is non-ignorable, depending on unmeasured factors or on diabetes status.
Conclusions: While EHRs offer potential to innovate on chronic disease surveillance, care is needed when estimating prevalence for small geographies or when selection is non-ignorable.
Keywords: diabetes mellitus; electronic health records; prevalence; selection bias; surveillance.