Generation of Differentially Private Heterogeneous Electronic Health Records

06/05/2020
by   Kieran Chin-Cheong, et al.
0

Electronic Health Records (EHRs) are commonly used by the machine learning community for research on problems specifically related to health care and medicine. EHRs have the advantages that they can be easily distributed and contain many features useful for e.g. classification problems. What makes EHR data sets different from typical machine learning data sets is that they are often very sparse, due to their high dimensionality, and often contain heterogeneous (mixed) data types. Furthermore, the data sets deal with sensitive information, which limits the distribution of any models learned using them, due to privacy concerns. For these reasons, using EHR data in practice presents a real challenge. In this work, we explore using Generative Adversarial Networks to generate synthetic, heterogeneous EHRs with the goal of using these synthetic records in place of existing data sets for downstream classification tasks. We will further explore applying differential privacy (DP) preserving optimization in order to produce DP synthetic EHR data sets, which provide rigorous privacy guarantees, and are therefore shareable and usable in the real world. The performance (measured by AUROC, AUPRC and accuracy) of our model's synthetic, heterogeneous data is very close to the original data set (within 3 - 5 tested in a binary classification task. Using strong (1, 10^-5) DP, our model still produces data useful for machine learning tasks, albeit incurring a roughly 17 additionally perform a sub-population analysis and find that our model does not introduce any bias into the synthetic EHR data compared to the baseline in either male/female populations, or the 0-18, 19-50 and 51+ age groups in terms of classification performance for either the non-DP or DP variant.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2020

Chasing Your Long Tails: Differentially Private Prediction in Health Care Settings

Machine learning models in health care are often deployed in settings wh...
research
10/02/2019

Improving Differentially Private Models with Active Learning

Broad adoption of machine learning techniques has increased privacy conc...
research
08/09/2023

Collaborative Learning From Distributed Data With Differentially Private Synthetic Twin Data

Consider a setting where multiple parties holding sensitive data aim to ...
research
11/25/2018

A Fully Private Pipeline for Deep Learning on Electronic Health Records

We introduce an end-to-end private deep learning framework, applied to t...
research
05/28/2022

MC-GEN:Multi-level Clustering for Private Synthetic Data Generation

Nowadays, machine learning is one of the most common technology to turn ...
research
01/20/2022

Conditional Generation of Medical Time Series for Extrapolation to Underrepresented Populations

The widespread adoption of electronic health records (EHRs) and subseque...

Please sign up or login with your details

Forgot password? Click here to reset