cor-gan
:unlock: COR-GAN: Correlation-Capturing Convolutional Neural Networks for Generating Synthetic Healthcare Records
view repo
Deep learning models have demonstrated high-quality performance in areas such as image classification and speech processing. However, creating a deep learning model using electronic health record (EHR) data, requires addressing particular privacy challenges that are unique to researchers in this domain. This matter focuses attention on generating realistic synthetic data while ensuring privacy. In this paper, we propose a novel framework called correlation-capturing Generative Adversarial Network (corGAN), to generate synthetic healthcare records. In corGAN we utilize Convolutional Neural Networks to capture the correlations between adjacent medical features in the data representation space by combining Convolutional Generative Adversarial Networks and Convolutional Autoencoders. To demonstrate the model fidelity, we show that corGAN generates synthetic data with performance similar to that of real data in various Machine Learning settings such as classification and prediction. We also give a privacy assessment and report on statistical analysis regarding realistic characteristics of the synthetic data. The software of this work is open-source and is available at: https://github.com/astorfi/cor-gan.
READ FULL TEXT VIEW PDF:unlock: COR-GAN: Correlation-Capturing Convolutional Neural Networks for Generating Synthetic Healthcare Records
Adoption of Electronic Health Records (EHRs) by the healthcare community, along with the massive quantity of available data, has led to calls for employing promising data-driven methods inspired by Artificial Intelligence (AI). Data-powered tools alter how clinicians and healthcare bureaus approach and satisfy patients’ needs for care. However, extending EHR adoption to also support data access for research and development purposes, is far from being practical in the healthcare domain, due to privacy restrictions.
De-identification of EHR data is often employed for mitigation of privacy risks. However, questions and doubts have increased about the safety of prolonged use of de-identification methods regarding their vulnerability to information leakage [10]. Accordingly, more recent attention has focused on Synthetic Data Generation (SDG) which can satisfy reliably the needs for privacy.
We aim to create realistic synthetic EHR data by Generative Adversarial Networks (GANs), which have been successfully employed in applications such as image generation [27, 4, 17], video generation [33, 31], and image translation [15, 18, 8]. Contributions of this work include:
We propose an efficient architecture to generate synthetic healthcare records using Convolutional GANs and Convolutional Autoencoders (CAs) which we call “corGAN”. We demonstrate that corGAN can effectively generate both discrete and continuous synthetic records.
We demonstrate the effectiveness of utilizing Convolutional Neural Networks (CNNs) as opposed to Multilayer Perceptrons to capture inter-correlation between features.
We show that corGAN can generate realistic synthetic data that performs similarly to real data on classification tasks, according to our analysis and assessments.
We report on a privacy assessment of the model and demonstrate that corGAN provides an acceptable level of privacy, by varying the amount of synthetically generated data and amount of data known to an adversary.
Some distinguished efforts were conducted in a variety of domains about synthetic data generation [34, 5, 23, 25]. But some of these works are overly disease-specific, unrealistic, or have failed to provide any substantial measurements regarding privacy.
Highly relevant is “medGAN” [6], using GANs for synthetic discrete EHR data. But in contrast to medGAN, we consider the temporal nature of the data and local correlation between features. Instead of regular multi-layer perceptrons, we base our architecture on CNNs and provide empirical results to demonstrate the superior performance in capturing inter-correlations between data features.
Many discrete variables (e.g., diagnosis, procedure code) are available in the dataset. Let’s assume there are
discrete variables and the vector
(where indicates natural numbers including zero) is in a vector space. The dimension designates the number of incidents of the variable in a subject’s medical records. We can represent a patient’s visit (encounter event) by a binary vector , where the dimension shows whether the variable occurred in the patient record. We represent the input space as a matrix in which columns indicate discrete variables in the EHR record. Such representation extracts multiple patients’ records representing different points in time.A Generative Adversarial Network (GAN), introduced in [11], is a combination of two neural networks, a discriminator and a generator. The whole network is trained in an iterative process. First the generator network produces a fake sample. Then the discriminator network tries to determine whether this sample (ex.: an input image) is real or fake, i.e., whether it came from the real training data. The goal of the generator is to fool the discriminator so it believes the artificial (i.e., generated) samples synthesized by the generator are real.
The generator is to learn the distribution over data x. In that regard, represents the input noise variables distribution which generates random data shown by . The function is differentiable with parameters . The discriminator, , decides if its input data is real or fake. is trained to distinguish the training samples from by minimizing ). and perform the following min-max game with value function V(G, D):
(1) |
We use the architecture in Fig. 1. The discrete input represents the source EHR data; is the random distribution for the generator ; is the employed neural network architecture; refers to the decoding function which is used to transform the generator continuous output to their equivalent discrete values. The discriminator attempts to distinguish real input from the discrete synthetic output . For the generator and the discriminator, a 1-Dimensional Convolutional GAN architecture is utilized.
Consider the decoding function . GANs are known for generating continuous values and encountering trouble when dealing with discrete variables. Recently, researchers proposed solutions to the problem of generating discrete variables [13, 35, 19, 37]. Some approaches use the indirect method such that they create a separate model to transform continuous to discrete data [6]. Regarding EHR data generation, we are dealing with discrete data. Hence, our generative model should create discrete data directly, or there should be a function to transform the continuous data samples into discrete equivalents. We chose the second approach, and employed autoencoders.
Considering Fig. 1, the autoencoder digests (right part of the figure) discrete values and reconstructs the same discrete values as well. The autoencoder structure consists of two main elements: encoder and decoder. While encoding, the autoencoder transforms the discrete space into a corresponding (we call it equivalent as well) continuous space (the output of the hidden layer) and the decoder reverses the process. The Binary Cross-Entropy (BCE)loss function is used for training the autoencoder:
(2) | |||
(3) |
We used denoising autoencoders
[32] to create a more robust pretrained model as we do not expect our model to always generate perfect discrete samples. After training the autoencoder, we need to use its decoder to convert continuous values to their associated discrete values. The cost function to train our proposed architecture is similar to Eq. 1 with the exception of operating the decoder on top of the generator.(4) |
As we are dealing with 1D data, we chose the 1-Dimensional Convolutional Autoencoders (1D-CAEs) as a particular form of the regular CAEs. This approach enables us to capture the neighboring feature correlations. We call our proposed architecture corGAN.
It is worth noting that for our experiments with discrete variables, we round the values of to their nearest integers (the outcome is zero or one) to guarantee that we train and evaluate the discriminator on discrete values.
One of the primary crash forms for GAN is for the generator to collapse to a set of parameters and always generate the same sample. This phenomenon is called Mode Collapse. Some approaches have been proposed to handle the mode collapse issue such as minibatch discrimination [28] and unrolled GANs [24]
. We utilized minibatch discrimination due to its better stability. We also utilized batch normalization
[14] to improve the generator’s learning abilities. Furthermore, we used LeakyRelu activation unit [21]as it consistently demonstrated equal or better results over other common activation functions
[36].We utilize the Membership Inference (MI) attack as an approach to measure the privacy. Membership Inference (MI), proposed in [30], refers to determining whether a given record generated by a known machine learning model was used as part of the training data. The membership inference problem is basically the well-known problem of presence disclosure of an individual [2, 9, 29]. If the adversary has complete access to the records of a particular patient and can recognize their employment in the model training, that is an indication of information leakage, as it can jeopardize the whole dataset privacy or at least the particular patient’s private information. Here, we will assume the adversary has the synthetically generated data as well as a portion of the compromised real data.
We evaluated corGAN with two datasets. First, we explain the datasets and baseline models. Then, we provide the results regarding the evaluation of the synthetic data in terms of the realistic characteristics. Finally, we report on a privacy assessment of the model.
We used two publicly available datasets in this study. The first is the MIMIC-III dataset [16] consisting of the medical records of almost 46K patients. From MIMIC-III, we extracted ICD-9 codes only. We represent a patient record as a fixed-size vector with 1071 entries for each patient record. This dataset is used for experiments with binary discrete variables.
We conducted our experiments regarding continuous variables with the UCI Epileptic Seizure Recognition dataset [1]. This dataset characterizes brain activities. The core task is classification, regarding if a sample indicates a seizure activity. The number of features and samples are 179 and 11500, respectively. Almost 20% of the samples are categorized as seizure activity. So, we are dealing with an unbalanced dataset in a binary classification setting. The first 178 features are the values of the Electroencephalogram (EEG) recordings at different time points, and the last feature is the class. There are five values for the class label (). Except for , the rest of the classes indicate subjects who do not have an epileptic seizure. Dataset statistics are given in Table 1.
Dataset | UCI |
---|---|
# of patients | 500 |
Each patient’s data points | 4097 |
Each patient’s duration of recording | 23.5 seconds |
# data points chunks per patient | 23 |
# of data points per chunk | 178 |
Duration per chunk | 1 second |
Data type | Continuous EEG |
To show the effectiveness of our proposed architecture, we compare our results with different baseline methods as below:
Stacked Deep Boltzmann Machines (DBMs):
We trained a stacked Deep Boltzmann Machine (DBM)
[12]. After which, we used Gibbs sampling to generate synthetic binary samples. All hidden layers have 256 dimensions. We employed greedy contrastive divergence to create the model. We ran Gibbs sampling for 500 iterations per sample.
Variational Autoencoder (VAE): We used VAEs [20]
as one of our baseline models. For both the encoder and the decoder, we used 1D convolutional neural networks, each having two hidden layers. All hidden layers have the size of 128. We trained VAE with Adam optimizer for 500 epochs and for the batch size of 500.
In this section, we report our evaluation results regarding the quality of the synthetic data and the privacy risks. Here, we divide the dataset into a training and a test set , where is the feature size and is consistent for all sets. We use to train the models, then generate synthetic samples using the trained model. Noted that we usually use the same number of samples for and .
We use the following two metrics to evaluate our synthetically generated data.
Dimension-wise probability:
As a basic sanity check to see if our proposed models learned the distribution of the real data (for each dimension), we report the dimension-wise probability. This measurement refers to the Bernoulli success probability of each dimension (each dimension is a unique ICD-9 code).
Dimension-wise prediction: This approach measures how robust the model catches the inter-dimensional connections of the real data samples. Assume is used to generate . Then, one random fixed dimension () from each and are selected as and . We call it the testing dimension. The rest of the dimensions ( and
) are used to train a classifier, which aims to predict the value of the testing dimension of the test set
.Binary Classification: We use this metric for our experiments with continuous data. To empirically verify the quality of the synthetic data, we consider two different settings. (A) Train and test the predictive models on the real data. (B) train the predictive model on synthetic data and test it on the real data. If the model evaluated in setting (B), represents competitive results with the same model performed in setting (A), we can conclude the synthetic data has good predictive modeling similar to the real data.
For Dimension-wise probability and Dimension-wise prediction experiments, we used MIMIC-III dataset and for Binary Classification experiments we used UCI Epileptic Seizure Recognition dataset.
The results regarding the investigation of dimension-wise probability are depicted in Fig. 2. As can be seen, the corGAN is superior compared to other methods. An interesting observation is that the VAE is never generating any synthetic data for which the probability of occurrence of a diagnosis code is higher than its counterpart in the real data.
For dimension-wise prediction (Table 2
), we use the following classifiers the predictive model types: Logistic Regression, Random Forests
[3], Linear SVM [7], and Decision Tree
[26]. For our experiments, we conduct E=100 number of runs. In each run, we pick a random testing dimension from test set () and will train each predictive model on and . This results in having two models as and . The superscript refers to the kind of model that was used to be trained on both real and synthetic sets. We then report the performance over all predictive models and for all experiments using the F1-score variation. F1-score variation means the difference between the F-1 score obtained from training the model on synthetic and real datasets.For the MIMIC dataset experiments, although the temporal information is ignored, the medical health diagnosis are sorted in terms of similarity in the feature vector of the MIMIC data. Therefore, 1D CNNs capture the correlation between features rather than the temporal information.
Generative Model | F1-Score |
---|---|
DBM [12] | 0.12 0.052 |
VAE [20] | 0.069 0.043 |
medGAN [6] | 0.043 0.049 |
corGAN [ours] | 0.021 0.045 |
Comparison of different baseline architectures. The reported metric demonstrate the mean and standard deviation of the F-1 score differences. A better model has a closer score to zero. Due to the sparse nature of the data, we have a relatively large variance for all utilized method. It is worth noting that a variance which is larger than the mean, indicates that the model training on the synthetic data, performed even better that the real model in some cases!
For binary classification experiments, we used the same predictive models as for the dimension-wise predictions. We reported the averaged AUROC (averaged area under the ROC for all models) and AUPRC (averaged area under the PR curve for all models) for the models’ evaluations. The difference with our experiments here is that we are not dealing with binary variables. Hence, for the medGAN and corGAN methods we eliminate the autoencoder as depicted in Fig.
1. As can be observed in Table. 3, our proposed method outperforms the other methods. As the UCI Epileptic Seizure Recognition dataset features contain termporal information, our method is able to capture temporal data information more effectively due to the usage of 1D CNNs.Generative Model | AUROC | AUPRC |
---|---|---|
DBM | ||
VAE | ||
medGAN | ||
corGAN [ours] | 0.92 0.012 | 0.41 0.015 |
In this section, the experiments are conducted on the MIMIC-III dataset regarding the membership inference attack. For privacy assessment, we randomly take samples from each and and call them and . We assume the attacker has the complete knowledge of both and . Clearly, was used to train the generating model, but wasn’t. So we have records. Then, we compared each of these records with the synthetically generated data samples .
We compared each of the samples in the set of with each samples in the set of
and we calculate cosine similarity score. Cosine similarity is used since it provides a more meaningful correlation metric as opposed to distance metrics
[22] used in previous research efforts [6]. If the score is higher than a threshold, then it flags the match, otherwise, we call it a mismatch. For threshold, we randomly select 100 threshold values from a Gaussian distribution with a mean of 0.5 and a standard deviation of 0.01 (ignoring possible negative values), and we report the results which demonstrate the best adversary attack.
For evaluation, we use precision and recall metrics. We conduct two sets of experiments here: (1) investigating the effect of the number of records known by the attacker (Table. 4) and (2) examining the effect of synthetic data volume on the privacy risk (Fig. 3).
As can be seen in Table. 4, by increasing the number of the real patient records known to the adversary, the attack will be even less accurate. It also demonstrates the fact that higher precision is possible at lower recall rates when the number of known records is not high. However, as is evident, a higher amount of revealed data increases the privacy risk significantly.
Regarding the effect of number of generated synthetic data on the privacy risk, as can be observed in Fig. 3, the increasing number of synthetic records does not have a significant effect on the recall, but it causes a dramatic decrease in precision. Henceforth, by having a fixed amount of known records, a higher number of synthetic patient’s records can be very misleading for the adversary. This empirical observation indicates that the increasing the number of synthetic records, with a fixed number or revealed patient’s records to the attacker, does not necessarily raise privacy risk.
100 | 1k | 2k | 3k | 4k | 5k | |
---|---|---|---|---|---|---|
Precision | 0.60 | 0.51 | 0.41 | 0.40 | 0.40 | 0.39 |
Recall | 0.05 | 0.10 | 0.19 | 0.28 | 0.27 | 0.28 |
The precision and recall demonstrated as a function of the number of patients whose data is revealed to the attacker.
= # of Known Records to the attacker.In this work, we proposed corGAN, which utilizes the convolutional generative adversarial networks to learn the distribution of real patient records. Through precise evaluation using real and synthetic datasets, corGAN demonstrated decent results for both discrete and continuous records. We empirically proved the superiority of CNNs over MLPs to capture the correlated features. We believe our method can be effectively extended and employed to longitudinal records as well for which the goal is to capture the temporal characteristics of the data.
Unsupervised image-to-image translation with generative adversarial networks
. arXiv preprint arXiv:1701.02676. Cited by: Introduction.Image-to-image translation with conditional adversarial networks
. InProceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1125–1134. Cited by: Introduction.
Comments
There are no comments yet.