Validity of Privacy-Protecting Analytical Methods That Use Only Aggregate-Level Information to Conduct Multivariable-Adjusted Analysis in Distributed Data Networks

Xiaojuan Li; Bruce H Fireman; Jeffrey R Curtis; David E Arterburn; David P Fisher; Érick Moyneur; Mia Gallagher; Marsha A Raebel; W Benjamin Nowell; Lindsay Lagreid; Sengwee Toh

doi:10.1093/aje/kwy265

Validity of Privacy-Protecting Analytical Methods That Use Only Aggregate-Level Information to Conduct Multivariable-Adjusted Analysis in Distributed Data Networks

Am J Epidemiol. 2019 Apr 1;188(4):709-723. doi: 10.1093/aje/kwy265.

Authors

Affiliations

¹ Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, Massachusetts.
² Division of Research, Kaiser Permanente Northern California, Oakland, California.
³ Division of Clinical Immunology and Rheumatology, School of Medicine, University of Alabama at Birmingham, Birmingham, Alabama.
⁴ Kaiser Permanente Washington Health Research Institute, Seattle, Washington.
⁵ The Permanente Medical Group, Kaiser Permanente Northern California, Oakland, California.
⁶ StatLog Econometrics Inc., Montreal, Quebec, Canada.
⁷ Institute for Health Research, Kaiser Permanente Colorado, Denver, Colorado.
⁸ CreakyJoints, Global Healthy Living Foundation, Upper Nyack, New York.
⁹ Limeade, Bellevue, Washington.

Abstract

Distributed data networks enable large-scale epidemiologic studies, but protecting privacy while adequately adjusting for a large number of covariates continues to pose methodological challenges. Using 2 empirical examples within a 3-site distributed data network, we tested combinations of 3 aggregate-level data-sharing approaches (risk-set, summary-table, and effect-estimate), 4 confounding adjustment methods (matching, stratification, inverse probability weighting, and matching weighting), and 2 summary scores (propensity score and disease risk score) for binary and time-to-event outcomes. We assessed the performance of combinations of these data-sharing and adjustment methods by comparing their results with results from the corresponding pooled individual-level data analysis (reference analysis). For both types of outcomes, the method combinations examined yielded results identical or comparable to the reference results in most scenarios. Within each data-sharing approach, comparability between aggregate- and individual-level data analysis depended on adjustment method; for example, risk-set data-sharing with matched or stratified analysis of summary scores produced identical results, while weighted analysis showed some discrepancies. Across the adjustment methods examined, risk-set data-sharing generally performed better, while summary-table and effect-estimate data-sharing more often produced discrepancies in settings with rare outcomes and small sample sizes. Valid multivariable-adjusted analysis can be performed in distributed data networks without sharing of individual-level data.

Keywords: confounding control; data-sharing; disease risk score; distributed data networks; meta-analysis; multicenter studies; privacy protection; propensity score.

© The Author(s) 2019. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Confidentiality / standards*
Data Aggregation*
Epidemiologic Research Design*
Humans
Information Dissemination / methods*
Information Services*
Multivariate Analysis
Privacy
Propensity Score

Abstract

Publication types

MeSH terms

Grants and funding