Learning debiased graph representations from the OMOP common data model for synthetic data generation

BMC Med Res Methodol. 2024 Jun 22;24(1):136. doi: 10.1186/s12874-024-02257-8.

Abstract

Background: Generating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention.

Methods: Our approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts.

Results: The algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand.

Conclusion: Only TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable.

Keywords: Causal Discovery; Constraint-based Causal Discovery; DYNOTEARS; Discrete Time Series; Gradient-Based Causal Discovery; Graphical Models; Standardized Electronic Health Records; Structural Equation Models; Synthetic Data Generation; Temporal Association Rule Mining (TARM).

MeSH terms

  • Algorithms*
  • Electronic Health Records* / standards
  • Electronic Health Records* / statistics & numerical data
  • Humans
  • Markov Chains
  • Medical Informatics / methods
  • Medical Informatics / statistics & numerical data