Generation of Synthetic Electronic Health Records Using a Federated GAN

09/06/2021 ∙ by John Weldon, et al. ∙ Dublin City University 19

Sensitive medical data is often subject to strict usage constraints. In this paper, we trained a generative adversarial network (GAN) on real-world electronic health records (EHR). It was then used to create a data-set of "fake" patients through synthetic data generation (SDG) to circumvent usage constraints. This real-world data was tabular, binary, intensive care unit (ICU) patient diagnosis data. The entire data-set was split into separate data silos to mimic real-world scenarios where multiple ICU units across different hospitals may have similarly structured data-sets within their own organisations but do not have access to each other's data-sets. We implemented federated learning (FL) to train separate GANs locally at each organisation, using their unique data silo and then combining the GANs into a single central GAN, without any siloed data ever being exposed. This global, central GAN was then used to generate the synthetic patients data-set. We performed an evaluation of these synthetic patients with statistical measures and through a structured review by a group of medical professionals. It was shown that there was no significant reduction in the quality of the synthetic EHR when we moved between training a single central model and training on separate data silos with individual models before combining them into a central model. This was true for both the statistical evaluation (Root Mean Square Error (RMSE) of 0.0154 for single-source vs. RMSE of 0.0169 for dual-source federated) and also for the medical professionals' evaluation (no quality difference between EHR generated from a single source and EHR generated from multiple sources).



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real data sets often have strict usage constraints due to privacy rules and regulations. Generated synthetic versions of these data sets can replicate the statistical qualities of actual data without exposing the data, eliminating these constraints. This is critical when dealing with human subjects’ data, protecting these subjects’ privacy and confidentiality. There are other motivations behind SDG besides usage constraints. Typically when training machine learning algorithms, the more training data, the better the results and the better the trained models will perform. Often there is a lack of data, or the data is expensive to generate. For example, in the autonomous vehicle space, producing synthetic data is far more cost-effective and efficient than collecting real-world data

[Yurtsever_2020]. One can also use synthetic data to test existing models; instead of using costly real-world data to test.

FL was first introduced by Google in their 2016 paper [mcmahan2017communicationefficient]. While their vision involved mobile devices being the decentralised clients in which training occurs before the models are aggregated in a central location, their techniques are equally suited to larger entities such as financial institutions and hospitals. FL allows institutions to share the value of their private data with other similar institutions without sharing actual data and in return receive the value of these other institutions’ data. While it is possible to use a peer to peer structure that does not require a centralised model as seen in [roy2019braintorrent], we wanted to examine the centralised implementation.

1.1 PhysioNet’s MIMIC-III Electronic Health Records

The EHR that we used to create synthetic data was PhysioNet’s [gaghim00] MIMIC-III 1.4 [jpm16]. This data is an example of the strict usage constraints mentioned previously. We had to complete a human research ethics course and wait 5 weeks to gain access to it. It’s a sizeable relational database detailing patients admitted to ICU at Beth Israel Deaconess Medical in Boston, MA, USA. The data we wanted to synthesise was patient ICD-9 diagnoses. Figure 1 shows the structure with a sample list from the 1071 unique codes in the data set. f

Figure 1: Example of ICD9 Diagnosis Codes [p]

1.2 Generative Adversarial Networks

GANs typically consist of two separate neural networks, the generator (G) and the discriminator (D, sometimes called the critic). G takes in Gaussian random noise (Z) and attempts to generate synthetic data similar to the trained data. D attempts to determine whether or not the data produced by G is fake or if it’s data taken from the real data-set. G attempts to maximise how often D classifies the fake data as being real and D simultaneously attempting to maximise how often it correctly classifies it as fake. An often-used analogy is that of counterfeiting. G, in this analogy, is a criminal who is creating counterfeit banknotes, and D represents the bank teller attempting to detect if the notes are fraudulent. As the criminal becomes better, producing notes that are more and more realistic, the teller must also improve in ability to distinguish these counterfeit notes from the real thing. The reverse is also true; as the teller improves at detecting the counterfeit notes, the criminal must increase the quality if they want to succeed in fooling them. Figure 

2 below shows how the GAN is structured. [goodfellow2014generative].

Figure 2: GAN Architecture Overview

2 Related Work

Highly relevant is medGAN; Choi et al. demonstrated that it was possible to use a GAN they created to generate high-quality synthetic EHR [choi2018generating]. Rasouli et al. showed that it was possible to use FL techniques with GANs to provide decentralised training of both G and D [rasouli2020fedgan]. Torfi and Fox used medical professionals to evaluate the data quality of generated EHR though it is not clear how this evaluation was structured [corgan]. Finally, Brophy et al. demonstrated that a federated GAN could be successfully used to generate continuous time-series EHR. [brophy2021estimation].

As far as we are aware, the work in this paper is novel in that it is the first attempt to use a federated GAN for the generation of discrete, binary EHR. It also contains a novel evaluation methodology, providing a repeatable method for medical professionals to evaluate these synthetic EHR.

3 Implementation

3.1 Data Pre-processing

To process the MIMIC-III data-set into the required format, we followed the process documented in [choi2018generating]. This left us with the data structure in Table 1. The feature columns contain binary data, with 1 indicating that the patient received that ICD-9 diagnosis and 0 indicating they did not. Figure 3 shows what this looked like and how the patients were presented once the ICD-9 codes were mapped to descriptive diagnoses.

Data-set Version Info ICU Patients (Rows) ICD9 Diagnoses (Columns)
MIMIC-III 1.4 46,520 1,071
Table 1: Table showing structure of patient data
Figure 3:

Feature vectors and sample patient

3.2 GAN architecture

The GAN was created with Python 3.8, using Tensorflow-GPU 2.2. Both G and D had depths of 3, with the dimension sizes being the reverse of each other. As this was tabular data and not images, convolutional layers were unnecessary and dense layers were used in their place. The activation function for the first three layers in both G and D were Leaky Rectified Linear Units (Leaky RELU). The final layer in D had a sigmoid activation function with range [0,1] and midpoint at 0.5, and the last layer in G had a tanh activation function with range [-1,1] and centre at 0. Table 

2 shows the details of both G and D.

3.3 GAN training

The hyperparameters used within the GAN were based on choices made by Goodfellow et al.


. We experimented with many different combinations of epochs, batch size, learning rate, optimiser. We found that for our use case, the following hyperparameters gave us the best results.

Learning Rate: 0.0002, Batch Size: 1,500, Epochs: 20,000 total (10,000 per local model for 2 source, 4,000 per local model for 5 source etc.), Optimiser: ADAM.

For each weight update within G, we made two consecutive updates within D, a method that has been known to increase the quality of the synthetic data in some cases and proved to be true in our case also [ganhack].

GAN Component Layer Dimension (G)
Discriminator (D) Input 1,071
Discriminator (D) 1 - Leaky RELU Dense 512
Discriminator (D) 2 - Leaky RELU Dense 256
Discriminator (D) 3 - Leaky RELU Dense 128
Discriminator (D) Output Dense 1, Sigmoid activation
Generator (G) Input Gaussian Random Noise 128
Generator (G) 1 - Leaky RELU Dense 128
Generator (G) 2 - Leaky RELU Dense 256
Generator (G) 3 - Leaky RELU Dense 512
Generator (G) Output Dense 1,071, Tanh activation
Table 2: GAN Architecture

3.4 Federated Learning Structure

Algorithm 1 and Figure 4 show the structure for a single round of our FL implementation. For actual use cases, these rounds may occur at some periodic interval to ensure each client has the most up-to-date model or maybe a new round is initiated if one of the local clients has a significant increase in data. Thus, the process is flexible and could be adjusted to whatever is most suited to the specific use case. What’s important is that all local clients will have the latest version of the global model by the end of one full round.

Figure 4: Centralised Federated Learning Process
create global GAN;
create local node GANs;
create nodelist containing all local nodes;
for local node in nodelist do
       retrieve weights from global GAN;
       set local node GAN weights to global weights;
       train local node GAN on local data silo;
       set global GAN weights to local node GAN weights;
end for
use final global GAN to generate synthetic patients;
Algorithm 1 Federated Learning Process Pseudo-code

4 Evaluation

4.1 Synthetic Data Quality

4.1.1 Feature Probability Correlation - Visual Comparison:

The first step that we took when evaluating the quality of the synthetic patients was to visually compare how close they were to the actual patients regarding the probability of diagnoses. We calculated the probability that each of the 1,071 diagnosis features was present for any given patient in both the real and synthetic data and plotted these probabilities against each other. If they were perfectly matched, all 1,071 points would land perfectly on the X=Y line. We can see in Figure 

5 that for all models, the points were clustered along this line. Importantly, we did not see any dramatic decrease in quality as we moved from the single source model to the federated versions. Comparing the single-source model to the dual-source federated model showed no loss in quality at this high level. We seemed to start to lose some quality when we moved on to three sources and five sources. While this was a helpful introduction to the evaluation process, we needed to make more concrete numerical comparisons.

(a) Full 46,520
(b) Two silos of 23,260
(c) Three silos of 15,506
(d) Five silos of 9,304
Figure 5: Feature probability comparison between single and federated models

4.1.2 Feature Probability Correlation - Statistical Comparison:

We used two statistical measures to compare the synthetic data sets from the various models to the real data in diagnosis probability—coefficient of Determination (R Squared) and Root Mean Square Error (RMSE). RMSE indicates the absolute fit of our points to the ideal line, in other words, how close the synthetic diagnosis probabilities are to the actual diagnosis probabilities. Whereas R-squared is a relative measure of how closely correlated the two probability vectors are. Lower values of RMSE indicate a better fit, whereas higher values of R Squared indicate the two sets of probabilities are more highly correlated. We can see in Table 3 that while RMSE starts to increase and R Squared starts to decrease, as we split the data-set into more and more silos, the changes are negligible, and we did not deem them significant.

Model R Squared RMSE
Single 0.805 0.0154
Federated - 2 Silos 0.793 0.0161
Federated - 3 Silos 0.786 0.0169
Federated - 5 Silos 0.782 0.0174
Table 3: Feature Probability Correlation Comparison

4.1.3 Medical Professional Evaluations:

Many GAN implementations involve image data such as digits (MNIST)

[jain2020locally], or cat images [huang2018multimodal]. While it is straightforward to derive a feel for the quality of the images in these cases simply by glancing at them, it’s not so simple for someone who lacks the specific domain knowledge to do the same for EHR. While the previous metrics suggested our synthetic data was similar to the real data, we wanted to seek input from experts in this field. A group of Medical Professionals was tasked with evaluating the generated patients. This group consisted of the following individuals:

  • A Critical Care Doctor within the ICU unit in Beaumont Hospital, Dublin, Ireland,

  • A Critical Care Nurse within the ICU unit in the Mater Private Hospital, Dublin, Ireland,

  • An Anatomical Pathologist from the Ottawa Hospital, Ontario, Canada.

The evaluation contained 20 actual patients, 20 patients generated by the single source GAN and 20 patients generated by the dual-source, federated GAN. The patients were randomly sorted and unlabelled. The guidelines were as follows:

"Please rate these 60 patients in terms of how plausible you consider them to be as ICU patients that you might encounter on a given day in your ICU or any ICU. Please try not to take into account how likely or unlikely it may be to encounter one of these patients, as rarity should not be a factor.
* Please note that this list is a mixture of real patients (from an ICU in the United States) and generated synthetic patients."

They were asked to place each patient into one of six categories. We chose an even number so there was no middle category that would allow them to sit on the fence when evaluating any patient. These were the six categories:

  • [leftmargin=1.2in]

  • Highly Plausible,

  • Plausible,

  • Slightly Plausible,

  • Slightly Implausible,

  • Implausible,

  • Highly Implausible.

Figure 6 shows us that there was no significant difference between the three groups of patients regarding how plausible the medical professionals viewed them, confirming that the synthetic patients closely matched the real patients for both the single source GAN and the federated GAN with two separate sources. Unfortunately, it was not possible to include the three-source GAN and the five-source GAN as these professionals are extremely busy, and though they were all generous with their time, they could only dedicate so much.

Figure 6: Medical Professionals’ Evaluation

4.2 Data Privacy

We implemented two different checks for patient confidentiality, detailed below. It’s important to note that while these checks provide some reassurance that the generated data is private and confidential, neither check can sufficiently verify that an actual patient could not be identified via a well-orchestrated membership inference attack on the synthetic data by some malevolent actor [hu2021membership].

4.2.1 Checking for exact matches:

The first privacy measure to determine if any actual patient data was exposed was a duplicate check. This involved iterating across the 46,520 real patients and see if their exact diagnoses vector had a perfect match in the 46,520 synthetic patients. If so, it would mean that any potential attacker who knew a patient’s diagnoses could quickly tell if this patient was a part of the data set. We can see in Table 4 that there are no exact matches between the synthetic patients and our real patients for all four models. Though this is a reasonably simplistic measure, it serves as a quick sanity check and gives some confidence that privacy has been attained.

4.2.2 Similarity Threshold:

The second method was to find the cosine similarity between each real patient and all of the synthetic patients. While the synthetic data needs to be realistic, it was essential that none of the synthetic patients was so alike any real patient that it would be apparent they were part of the training data. Cosine similarity provides a more meaningful similarity metric between two diagnosis vectors than Euclidean distance does

[corgan]. This similarity is subtracted from 1 to derive distance so the lower the value, the more similar two patients are. Table 4

shows the mean and standard deviation for shortest cosine distance for all patient pairs. Figure 

7 shows that the distribution of shortest cosine distances is approximately normal for all four. Very few patients fall within 0.1 of any real patients, and so the vast majority of synthetic patient pairs are below 0.9 for similarity, which we set as our threshold value.

Model Duplicates Mean Min Cos. Distance Standard Dev.
Single 0 0.5194 0.1283
Federated - 2 Silos 0 0.4698 0.1289
Federated - 3 Silos 0 0.4736 0.1334
Federated - 5 Silos 0 0.5229 0.1438
Table 4: Data Privacy
(a) Full 46,520
(b) Two silos of 23,260
(c) Three silos of 15,506
(d) Five silos of 9,304
Figure 7: Distribution of Shortest Cosine Distances

5 Discussion

5.1 Mode Collapse

GANs sometimes show a lack of diversity amongst generated samples. This is known as Mode Collapse (MC) [che2017mode]. We return to the counterfeiting analogy. Suppose the counterfeiter manages to create banknotes that consistently fool the teller. They will see no need to alter these notes and continue producing similar samples, knowing they’re likely to pass as authentic. This represents a GAN suffering from MC. Suppose G stumbles upon a combination that consistently fools D. In that case, it will see no reason to diversify future samples and risk lowering the chance of fooling D. Though our GAN did produce diverse patients, there were signs of MC. In terms of the diagnosis combinations, certain patient types appeared often. One method for mitigating mode collapse is Wasserstein Loss (WGAN) [arjovsky2017wasserstein]; specifically, a Wasserstein GAN with Gradient Penalty (WGAN-GP) [gulrajani2017improved]. This enforces a penalty if G consistently produces highly similar samples, meaning that G cannot sit in this collapsed mode and must instead generate more diverse samples.

5.1.1 Wasserstein GAN - Gradient Penalty:

We attempted to eliminate MC in our model by using WGAN-GP. Figure 8 shows the quality of the WGAN-GP data. It did help mitigate MC, generating a more diverse range of patients, but these patients’ quality suffered. The trade-off between MC and data quality has been documented by [Adiga2018ONTT]. We decided that this quality drop was not acceptable and so we stuck with our previous model.

Figure 8: WGAN-GP Feature Probability Comparison

6 Conclusion

We propose that the federated GAN can adequately learn the distribution of real-world EHR and exhibit comparable performance to the single source GAN in generating realistic synthetic binary EHR. We analysed both the synthetic EHR data generated by the single source GAN and the synthetic EHR data generated by the federated GAN and compared their evaluation results with the actual EHR data. Based on our investigation, we conclude that there is no significant reduction in quality between the data generated by single-source GAN and the data generated by the federated GAN. A potential idea for future research could be to combine the federated GAN with more formal privacy-preserving mechanisms like differential privacy to provide stronger privacy guarantees. Another avenue to explore would be improving the WGAN-GP implementation to eliminate MC without significant data quality loss.