Fidelity and Privacy of Synthetic Medical Data

01/18/2021
by   Ofer Mendelevitch, et al.
40

The digitization of medical records ushered in a new era of big data to clinical science, and with it the possibility that data could be shared, to multiply insights beyond what investigators could abstract from paper records. The need to share individual-level medical data to accelerate innovation in precision medicine continues to grow, and has never been more urgent, as scientists grapple with the COVID-19 pandemic. However, enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy. That is, the ability to extract private or confidential information about an individual, in practice, renders it difficult to share data, since significant infrastructure and data governance must be established before data can be shared. Although HIPAA provided de-identification as an approved mechanism for data sharing, linkage attacks were identified as a major vulnerability. A variety of mechanisms have been established to avoid leaking private information, such as field suppression or abstraction, strictly limiting the amount of information that can be shared, or employing mathematical techniques such as differential privacy. Another approach, which we focus on here, is creating synthetic data that mimics the underlying data. For synthetic data to be a useful mechanism in support of medical innovation and a proxy for real-world evidence, one must demonstrate two properties of the synthetic dataset: (1) any analysis on the real data must be matched by analysis of the synthetic data (statistical fidelity) and (2) the synthetic data must preserve privacy, with minimal risk of re-identification (privacy guarantee). In this paper we propose a framework for quantifying the statistical fidelity and privacy preservation properties of synthetic datasets and demonstrate these metrics for synthetic data generated by Syntegra technology.

READ FULL TEXT

page 8

page 9

page 18

page 19

page 21

page 23

page 24

research
11/12/2022

TAPAS: a Toolbox for Adversarial Privacy Auditing of Synthetic Data

Personal data collected at scale promises to improve decision-making and...
research
04/01/2021

Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data

AI-based data synthesis has seen rapid progress over the last several ye...
research
11/18/2022

A Unified Framework for Quantifying Privacy Risk in Synthetic Data

Synthetic data is often presented as a method for sharing sensitive info...
research
06/24/2021

A Deep Learning Approach to Private Data Sharing of Medical Images Using Conditional GANs

Sharing data from clinical studies can facilitate innovative data-driven...
research
07/23/2020

Hide-and-Seek Privacy Challenge

The clinical time-series setting poses a unique combination of challenge...
research
06/17/2023

Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing

Synthetic data is seen as the most promising solution to share individua...
research
04/04/2023

Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model

Synthetic electronic health records (EHRs) that are both realistic and p...

Please sign up or login with your details

Forgot password? Click here to reset