EMIXER: End-to-end Multimodal X-ray Generation via Self-supervision

07/10/2020 ∙ by Siddharth Biswal, et al. ∙ University of Illinois at Urbana-Champaign Georgia Institute of Technology 0

Deep generative models have enabled the automated synthesis of high-quality data for diverse applications. However, the most effective generative models are specialized to data from a single domain (e.g., images or text). Real-world applications such as healthcare require multi-modal data from multiple domains (e.g., both images and corresponding text), which are difficult to acquire due to limited availability and privacy concerns and are much harder to synthesize. To tackle this joint synthesis challenge, we propose an End-to-end MultImodal X-ray genERative model (EMIXER) for jointly synthesizing x-ray images and corresponding free-text reports, all conditional on diagnosis labels. EMIXER is an conditional generative adversarial model by 1) generating an image based on a label, 2) encoding the image to a hidden embedding, 3) producing the corresponding text via a hierarchical decoder from the image embedding, and 4) a joint discriminator for assessing both the image and the corresponding text. EMIXER also enables self-supervision to leverage vast amount of unlabeled data. Extensive experiments with real X-ray reports data illustrate how data augmentation using synthesized multimodal samples can improve the performance of a variety of supervised tasks including COVID-19 X-ray classification with very limited samples. The quality of generated images and reports are also confirmed by radiologists. We quantitatively show that EMIXER generated synthetic datasets can augment X-ray image classification, report generation models to achieve 5.94 data samples. Taken together, our results highlight the promise of state of generative models to advance clinical machine learning.



There are no comments yet.


page 18

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While clinical applications of supervised machine learning algorithms continue to advance, their impact is stifled by the limited amount of available labeled clinical data. This issue only made more dire by applications such as radiology report generation for medical images, which require paired data jointly across images, clinical notes, and diagnosis labels. Data sharing across healthcare organizations and institutions remains difficult, often due to legal and privacy concerns McGuire et al. (2008); Filkins et al. (2016). On the other hand, generative modeling has improved dramatically in the past few years. While early Generative Adversarial Networks (GANs) could only synthesize low-resolution grayscale images Goodfellow et al. (2014), state-of-art generative models can now synthesize diverse high-quality and high-resolution images  Brock et al. (2018); Karras et al. (2017, 2019a, 2019b)

. GANs and related generative models have been applied to various domains such as computer vision

Brock et al. (2018); Karras et al. (2017)

, natural language processing

Dai et al. (2017b); Fedus et al. (2018), time-series synthesis Brophy et al. (2019), semantic segmentation Dong et al. (2017); Luc et al. (2016), among others. This manuscript explores using generative models to address the challenge of limited data in machine learning for clinical applications. We explore a variety of applications, with a focus on using synthetic data to augment real datasets – increasing the amount of the data and labels available Choi et al. (2017), thereby improving downstream model performance.

We focus on X-rays as are a primary diagnostic tool in many clinical workflows, most importantly in radiology, and are used for detecting pneumonia, bone fracture, and cancerRajpurkar et al. (2017); Gulshan et al. (2016). Recent research efforts have shown promise for lung cancer detection in radiology, prostate cancer in pathology, and differential diagnoses in dermatology  Ardila et al. (2019); Fujisawa et al. (2019); Arvaniti et al. (2018); Mohamed et al. (2018). Most recently, X-rays have been employed for the coronavirus diagnosis and prognosis Jacobi et al. (2020). Along with X-rays, associated reports written by clinicians are the primary communication between patients and doctors Schwartz et al. (2011); Kahn Jr et al. (2009)

. Several deep learning based X-ray image to report writing method been proposed  

Jing et al. (2017, 2020); Li et al. (2018). Researchers have proposed generative models for clinical data Choi et al. (2017). However, existing methods are limited to a single modality – images, or clinical reports only. Thus, current generative models are not able to produce high quality multimodal synthetic datasets, which is the focus of this paper. This manuscript investigates an end-to end approach for generating multimodal X-ray images and text reports which are essential for the radiology applications. To this end, our work addresses the following challenges.

  • [leftmargin=*]

  • Multimodal generation of images and corresponding reports: Multimodal generative models are difficult to train compared to single-mode modal generative models Liu and Tuzel (2016); Isola et al. (2017); Zhu et al. (2017b, a); Choi et al. (2018, 2020). In the past few years, there have been multiple attempts at developing models that can generate multiple modalities at the same time Pu et al. (2018). In particular, text synthesis using generative models has proven to be extremely challenging – most likely because because discrete text tokens are not differentiable – making it more difficult to train GANs. We show that using an end-to-end approach, combines with appropriate text embeddings can overcome these issues.

  • Generative model training with limited labels: Generative models typically require large quantities of high-quality labeled data for training. However, labels are scarcely available in real-world applications such as medical domain. This renders training of high-quality generative models challenging. We present successful results with limited labeled X-ray data along with large amount of unlabeled X-ray data, and conjecture about properties of X-rays which make this feasible.

  • Difficultly of data augmentation with limited data:

    The task of training a generative model for classifier augmentation 

    Huang et al. (2018); Antoniou et al. (2017) is particularly challenging in the case of rare diseases or new phenotypes, as the limited amount of labels renders training of generative models difficult. For example, in the case of the COVID-19 pandemic, the amount of available X-ray data and labels are extremely low. Given the limited labels, training high-quality generative models to augment the original dataset is a challenge. Pretraining models of large and diverse augmented data can potentially provide robust embeddings for new phenotypes.

We propose EMIXER, an end-to-end multimodal generative model that can generate paired chest X-ray images and corresponding reports simultaneously, conditioned on diagnosis labels. Our primary contributions are summarized in the following.

  • [leftmargin=*]

  • Multimodal X-ray image and report generation. We show that EMIXER generates high-quality X-ray images and corresponding reports. Multiple radiologists scored average 7.340/10 for synthetic data and 7.825/10 for real data on their realisticness and quality. Furthermore, EMIXER generated synthetic datasets used to augment X-ray image classification models lead to up to improvement in classification accuracy compared to models trained on real X-ray images only. Similarly, EMIXER augmented paired X-ray image and report datasets improve X-ray report generation models up to as measured by the CIDEr scores.

  • Learning high-quality generative models from limited samples. EMIXER uses self-supervision to enable learning of high-quality generative models from limited labels. We show that even with of the original labels, EMIXER can outperform baselines with the 100% labeled data in terms of image classification and report generation tasks.

  • Improved classification of COVID-19 chest X-rays via data augmentation. We utilize the pre-trained model of EMIXER with augmenting classification models, applied to the automated diagnosis of COVID-19 from X-ray images. Our results show improvement in predictive accuracy than the one without using pre-trained EMIXER model.

2 Related Work

Generative models. In the past few years, there has been great progress in the area of generative modeling of complex imaging data. Since the introduction of the Generative Adversarial Networks (GAN), there have been many variants proposed such as DCGAN, Progressive GAN, Self-supervised GANs Goodfellow et al. (2014); Karras et al. (2017); Radford et al. (2015); Dai et al. (2017a)

, among others. In addition to GANs, other types of generative models are also quite widely used such as Flow Models, Autoregressive Models, and variational autoencoders

Kingma and Welling (2013); Kingma and Dhariwal (2018); Dinh et al. (2014, 2016). Flow Models uses a stack of invertible transformations to a sample from prior distributions, thus can compute the exact log-likelihoods of observations. Autoregressive models factorize the distribution over observations into a sequence of conditional distributions (e.g. over pixels for images), then process each component in sequence  Oord et al. (2016); Van den Oord et al. (2016). For image generation applications, GAN-based models produce among the photo-realistic images. However, the training of GAN models can be quite challenging with known issues such as mode collapse and instability in convergence Salimans et al. (2016). There have been many works to improve upon these challenges, e.g., by changing the objective function Arjovsky et al. (2017). Some other research efforts have focused on constraining the discriminator through gradient penalties or normalization Miyato and Koyama (2018). BigGAN Zhang et al. (2018); Brock et al. (2018) adds the self-attention block, and ProGAN considers training a single model across a sequence of increasing resolutions Karras et al. (2017)

. While there is a lot of effort in modeling single modalities especially images, there is a dearth of research on multimodal image and text generation.This work addresses the challenge of multimodal joint generation of image and text.

Medical report generation. Deep learning based image classification has been successfully applied to many different types of medical image classification tasks such as diabetic retinopathy classification, X-ray classification, cancer detection from cell images, and X-ray based bone classification Wang et al. (2018); Gulshan et al. (2016); Milletari et al. (2016), among other applications. Similarly, different image segmentation algorithms have been very successfully applied to medical images to identify different organs and diseases. There has been progress in the task of automated report generation for medical images such as X-rays Liu et al. (2019). Existing applications of machine learning to clinical tasks must address a variety of challenges such availability of large datasets.

3 Methods

3.1 Problem Definition

We begin by introducing notations. We denote real chest X-ray images as where is the size of the image, text X-ray reports as and labels as for th data sample. The X-ray report contains a sequence of sentences = , where the report length may vary. Sentence consists of sequence of words = where

-th word represented as one-hot vectors in the sentence

of document . The dataset, denoted as is a combination of images , reports and labels denoted as . EMIXER generates synthetic dataset that consists of synthetic X-ray images , synthetic report conditioned on class labels. We train an end-to-end generative model which consists of an X-ray image generator , X-ray image discriminator , X-ray report discriminator , and an X-ray image to report decoder

. Each of these components is a neural network that are trained jointly to produce paired X-ray images and clinical reports conditioned on diagnosis labels.

Figure 1: An overview of EMIXER generator framework

3.2 The Emixer Model

We describe primary components of EMIXER in this section. As illustrated in Fig. 1, EMIXER is composed of four different trainable networks: (a) Image generator: This image generator synthesizes X-ray images from a prior noise distribution conditioned on label information (b) Image to report decoder: An image to report decoder produces a text report from X-ray image (c) Image Discriminator: This discriminator is tasked with discriminating between real and synthetic X-ray images (d) Text Discriminator: This text discriminator distinguishes between real and synthetic X-ray reports (e) Joint discriminator: The joint discriminator combines the embedding of X-ray images and text to discriminate between real and synthetic embeddings.

3.2.1 X-ray Image Generator ()

An X-ray image generator is a deep neural network that accepts two inputs; a noise vector and class information represented as one-hot vector. First, we split the noise vector to obtain vectors. The vectors is passed through a linear layer to obtain , . We embed the class information via a linear layer to obtain . concatenated with is passed through three layers of

which applies batch-normalization with deconvolution operation,

 He et al. (2016). The output is passed through a self-attention block which applies applies a convolution operation with softmax to obtain intermediate feature vectors which are combined with the original input to compute the , . Finally this output is passed through another to obtain the as the output of image generator. Taken together, the generator network can be abstracted as the following . We provide implementation details of , blocks in the supplement.

3.2.2 X-ray Report Generator ()

The image is fed through an image encoder convolutional neural network(

) to obtain a feature representation. These feature vectors are passed to a sentence decoder to recurrently generate topic vectors for each sentence. These topic vectors are used by a word decoder to generate the words for each sentence as .

X-ray image encoder Specifically, given an image , we first extract its features from an intermediate layer of a , . We use a pretrained DenseNet-121 as the model trained on a different chest X-ray dataset Huang et al. (2017). Note that this is different from the used in the image discriminator . The report generator module is composed of a sentence decoder and word decoder RNN which are described below.

Sentence decoder RNN: Given the X-ray image features extracted by the , a sentence decoder is used generate topic vectors

. We employ a Long-Short Term Memory network (LSTM) to compute the hidden state as

. We use the hidden states in two ways: First, we project the hidden state

through a linear layer and logistic layer to get probability distribution

over two states CONTINUE = 0, STOP = 1. Second, we also feed through three-layered fully connected network to get a topic vector for th sentence in the report, .

Word decoder RNN: The words for each individual sentence are generated by a word decoder which is a trainable three-layer LSTM. The sentence topics generated by the sentence decoder are combined with the <START> token as input for the first and second input to the word LSTM. In subsequent steps, we provide the hidden state of the last LSTM layer to predict a distribution over the words in the vocabulary. The hidden state of the word LSTM is directly used to predict the distribution over words: where is the parameter matrix. Finally, after the word decoder generates the word sequences, we concatenate all the generated sequence to obtain the final report.

3.2.3 Discriminator ()

EMIXER uses three discriminators, an image discriminator, a report discriminator and joint embedding discriminator to ensure image and report consistency of the synthetic data. The image discriminator measures whether the generated image matches the image distribution of real X-ray images, and the report discriminator discriminates between the real and synthetic X-ray reports.

X-ray Image discriminator (): We use a convolutional neural network discriminator for X-ray images which are fed real and synthetic X-ray images for classification. The discriminators use a ResNet architecture where the input image is passed through multiple layers of ResBlocks, where ResBlocks are composed of convolution with layers He et al. (2016). This image discriminator can be represented as where is a linear layer with weight matrix applied to to image feature

and one-hot encoded label

. is a linear classifier tasked with detecting if the provided sample is real or fake.

X-ray Report Discriminator (): We use a X-ray report discriminator which classifies a given X-ray report as real or fake. X-ray reports generated from the decoder and real X-ray reports are passed as input discriminator. We employ a LSTM to to extract text embeddings from given X-ray report , Cho et al. (2014). These report embeddings

are passed through multi-layer linear layers with softmax layer to obtain

. The report discriminator can be abstracted as to discriminate between real or fake report embedding as . We provide further details of the implementation in the supplementary section.

Joint Discriminator for X-ray images and Reports (): Along with the image discriminator and report discriminator, we also use a joint embedding discriminator. We hypothesize that as the X-ray images and reports are dependent upon each other, a joint multimodal embedding discriminator provides further guidance to the generator network for generating higher quality images and reports. This joint embedding discriminator is designed to discriminate real joint embeddings from fake joint embeddings. The joint embedding discriminator first obtains image features from the X-ray images using a before the pooling layer. The text-reports are provided as input to an LSTM. The last hidden vector of the LSTM is passed through a linear layer to obtain report embedding . The image feature vector and report embedding are concatenated together to form the joint embedding . This joint embedding is passed through linear layers to obtain probability of real or fake embedding. This discriminator can be abstracted as .

Learning: Previous works have shown that self-supervision guide the classifier to learn useful data representation by detecting auxiliary information such as rotation angles. When applied to image classification, typically images are rotated and the angle of rotation is provided as the artificial label. In this rotation task, the self-supervised task is to predict the angle of rotation of an image. We use rotation angles. Image is rotated by degrees is denoted as and is probability distribution over the rotation angles. The EMIXER framework corresponds to a constrained minimax game given by where the value function is given by

where , , , , are the image generator, image discriminator, report discriminator and image to report decoder, respectively. EMIXER can be trained by back propagation with the alternating gradient update steps. The details of the learning algorithm are given in the supplementary materials.

4 Experiments

In this section, we perform extensive evaluations to measure the effectiveness of EMIXER for paired chest X-ray image and report generation. We empirically show that (1) our proposed model can generate high-quality X-ray images and reports (2) EMIXER with self-supervised loss can match the generated sample quality of the conditional models using only fraction of labels (3) EMIXER can be used to augment datasets in limited label settings such as COVID-19 chest X-ray detection.

4.1 Datasets

We perform experiments on MIMIC-CXR dataset, one of the largest X-ray datasets containing 377,110 X-ray images and corresponding reports Johnson et al. (2019). MIMIC dataset contains 377,110 chest X-rays associated with 227,827 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011-2016. The labels extracted from the reports contain 14 different unique classes. We resize the images to as done in previous work Miyato and Koyama (2018).

4.2 Evaluation Metrics

We perform quantitative and qualitative experiments: (a) For classification experiments, we used accuracy and AUC as classification metrics. We use CIDEr, BLEU scores for image captioning experiments 

Vedantam et al. (2015); Papineni et al. (2002). (b) To evaluate X-ray image quality, we use the Fréchet Inception Distance (FID) scores. We use a special pre-trained Inception network on chest X-ray images. We have provided further details in the supplement. (c) We qualitatively evaluate the generated X-ray images and reports. For this, we present randomized pairs of real or synthetic X-ray images and reports to clinical experts for evaluation (they do not know if the presented sample is real or synthetic). The clinical experts were asked to provide a numerical quality score between 1-10 (10 being the best) for each sample.

4.3 Models

JointGAN: JointGAN trains multiple generators and a single softmax-based critic, all jointly trained via adversarial learningPu et al. (2018) to generate joint data distributions. CoGAN: CoGAN learns separate generators for two different domains with tied weights on the first few layers for shared latent representations Liu and Tuzel (2016). Single Modal Image GAN with text decoder(SM-GAN) In this setup, we use a GAN model to generate X-ray images. These X-ray images are passed to a text decoder which produces text reports corresponding to the synthetic chest X-rays. EMIXER: We compare these baselines against our method which is a self-supervised generative model with multiple discriminators for each modality. The final loss for our discriminator is the combination of adversarial loss of both the generators and joint embedding.

4.4 Experimental Results and Discussion

Our experiments aim to answer the following questions.

  • [noitemsep]

  • Can EMIXER generate high quality X-ray images?

  • Can EMIXER generate high quality pairs of X-ray images and reports?

  • Can EMIXER learn a high quality generative model from limited samples?

  • Can EMIXER be used to improve COVID X-ray classification?

4.4.1 Image quality evaluation: Is Emixer capable of generating high quality X-ray images?

One of the primary applications of generative models is data augmentation to increase sample size and improve downstream model performance. We use the baselines and EMIXER to augment the real X-ray images and evaluate the improved quality of the datasets by using these augmented datasets for X-ray image classification.

X-ray image classification setup: We trained two separate X-ray image classification models on real X-ray images and synthetic X-ray images. We hypothesize that good generative models should generate images that resemble real data and can be used to train a classification model. These classification models are evaluated on held out real X-ray images. This setup evaluates the performance of the classification model for five different classes of diseases related to the X-ray images. In this experiment, we report accuracy and AUC for classification scores in Table 1, where we increase the dataset size by augmenting the real data with generated X-ray images. We use 100k real X-ray images and gradually increase the augmented dataset size by adding synthetic X-ray images up to 600k. We notice improved performance of these image classification model by up to compared to real X-ray images, and improvement compared to the best baseline. This highlights that EMIXER is able to generate synthetic X-ray images which are able to augment the real dataset to improve the classification performance.

Image Classification Report Generation
Dataset Method AUC ACC CIDEr BLEU-1 BLEU-2 BLEU-3 BLEU-4
Only real data R 100k
JointGAN R100k + S50k
R100k + S100k
R100k + S300k
R100k + S600k
CoGAN R100k + S50k
R100k + S100k
R100k + S300k
R100k + S600k
SMGAN R100k + S50k
R100k + S100k
R100k + S300k
R100k + S600k
EMIXER R100k + S50k
R100k + S100k
R100k + S300k
R100k + S600k
Table 1: Comparison of X-ray report generation model performance with real and augmented dataset; In this table R indicates real data samples, S indicates Synthetic data samples

4.5 Joint Image and Text Evaluation: Can Emixer generate high quality pairs of image and reports?

One of the primary advantages of EMIXER is the ability to jointly generate paired X-ray images and reports. We performed two different experiments to understand the effectiveness of EMIXER towards generating paired images and reports.

Report Generation Task: X-ray report generation is one of the key tasks in radiology clinical workflow Schwartz et al. (2011). We validate the effectiveness of augmented paired image and report datasets for report generation task. In this setup, we train report generation models on real data and a combination of real and synthetic data. These trained models are evaluated on held-out real paired datasets. We present the results of these experiments in Table 1. In this setup, we vary the amount of synthetic data added to the real dataset. We present the performance of real and augmented datasets for report generation task in terms of natural language processing metrics such as CIDEr, BLEU 1-4 Vedantam et al. (2015); Papineni et al. (2002). We show that EMIXER improves up to compared to models trained only on real datasets. This highlights the fact that EMIXER can be used to augment and improve report generation models.

Multimodal joint embeddings of X-ray images and reports: The multimodal embeddings learned can be used for classification tasks. We perform an experiment to evaluate the joint quality of images and generated text. In table 2a, we compare the result of varying combinations of real and synthetic data on the joint modeling task. In this joint modeling task, we combine features from X-ray images and text reports together for downstream classification. We classify different disease phenotypes using these joint embeddings. We find that adding a synthetic dataset to the real dataset for this joint embedding significantly improves the performance of the classification model.

Method Dataset AUC Acc
Only Real Real [100k ]
JointGAN R100k + S300k
CoGAN R100k + S300k
SMGAN R100k + S300k
EMIXER R100k + S300k .924
Table 2: Comparative evaluation of phenotype classification via joint embedding with real and augmented data
Method Acc BLEU-1 FID
EMIXER (30%)
EMIXER (50%)
EMIXER (100%)
Table 4: Caption
Table 3: Comparison of generative models with limited labels

4.5.1 Limited label setup: Can we learn a high quality generative model from limited data?

Machine Learning applications in clinical domains are often limited by the amount of available data and labels. Since generative models require large amounts of data and labels to train, it is a challenge in clinical tasks to learn a high-quality generative models. We show in the following experiments that we can employ self-supervision to overcome the label limitations. We explore the limits of usage of labels by varying the percentage of labels used in the models. In this experiment, we use limited labels ranging from , to compare with label usage. We show that even with limited labels EMIXER can perform competitively. We compare existing baselines to our model which uses self-supervision to able to generate images from limited labels. Table 2b shows that EMIXER outperforms the baselines in terms of image generation diversity as measured by FID.

4.5.2 Case Study: COVID-19 X-ray data augmentation experiment

We applied the generative models towards improving COVID X-ray classification. In this task, we use EMIXER to augment chest X-ray images to improve COVID-19 detection. Currently, COVID X-rays classification includes four classes: normal, bacterial pneumonia, viral pneumonia and COVID-19. In this experiment, we evaluate if EMIXER generated synthetic data can augment chest X-ray image samples for the COVID-19 classification task. Specifically, we compare three different models: COVID-19 dataset, pretrained model on CheXpert dataset, pretrained model on combined data of real and synthetic data  Cohen et al. (2020); Irvin et al. (2019). We trained three different models on these datasets. Models pretrained on real dataset and combined dataset are finetuned on the COVID-19 dataset. We show that augmenting real datasets with EMIXER generated samples improves the overall performance in Table 5.

Type Phenotype AUC Sensitivity PPV
COVID samples Normal Lung
Bact. Pneumonia
Viral Pneumonia
ChexPert real dataset Normal Lung
Bact. Pneumonia
Viral Pneumonia
ChexPert real data + EMIXER (250k Samples) Normal Lung
Bact. Pneumonia
Viral Pneumonia
Table 5: Comparison of performance for COVID-19 classification

4.5.3 Evaluation by Radiologists

We perform a qualitative evaluation of the generated X-ray images and reports. In this task, we present randomized X-ray images and reports to expert doctors. Two radiologists provide a rating between 1-10 for each pair of images and reports. We have presented the results of this evaluation task in figure 2. The scores for real and synthetic X-rays samples were and . The inter-rater agreement was measured using cohen’s kappa. The comments provided by the doctors indicate that synthetic samples were similar to real samples with some language incoherence in X-ray reports.

Figure 2: Qualitative evaluation. (a) User study Results (b) Comparative real and synthetic samples

5 Conclusion

In this paper, we address the challenging multimodal paired x-ray image and report generation task by proposing a novel self-supervised multimodal generative model called EMIXER. EMIXER successfully uses a multimodal generative model to learn to generate paired x-ray images and reports. We use self-supervision to guide EMIXER to learn from limited samples which are very applicable in the medical domain as the number of labels is often limited. We also use multiple discriminators to guide the process of image generation, report decoding. We show via extensive experiments that EMIXER can augment real x-ray image datasets to improve downstream classification tasks. Finally, in timely case-study, we show that EMIXER can also improve COVID-19 x-ray classification.

Broader Impact

Our paper presents an end-to-end multimodal X-ray generation algorithm to produce synthetic but realistic X-ray images and the corresponding text reports.

Application and societal impact: Deep learning models have shown great promises in medical imaging applications such as automatic diagnosis of radiology images. However, large amount of labeled training data are required to develop accurate models. Unfortunately, medical data are extremely difficult to share due to the sensitive nature of the data around patient privacy and legal constraints. In addition, many conditions and situations are intrinsically rare which mean limited data. Our proposed method EMIXER can alleviate these challenges by producing realistic but synthetic data to support model building and augmenting to the limited existing data in some situations as we demonstrated in the COVID-19 image classification task.

As sensing technology become cheap and ubiquitous (e.g., high-resolution cameras from smart phones), it is foreseeable that AI supported telemedicine can efficiently support many people especially the ones in the rural community, where our proposed algorithm can play an important role.

Caveat and potential weakness: Although synthetic data can potentially alleviate the sensitive data sharing in healthcare, it is important to study and quantify the privacy implication of synthetic generated data by a model trained with real data. Although unlikely, some real data can be potentially remembered and resynthesized in the synthetic data. There is a balance between data utility and privacy preservation in this line of research. Finally, a broader trend to consider is that AI technology has largely enabled automation and improved efficiency of many industries such as traditional retail to e-commerce, automation in production plants. The traditional workforce can be negatively impacted. It is important to consider social impact of AI technology to the existing industries. Although in healthcare the skilled experts are still in shortage, the AI based medical technology will probably have limited negative impact in the existing workforce.


  • [1] A. Antoniou, A. Storkey, and H. Edwards (2017) Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340. Cited by: 3rd item.
  • [2] D. Ardila, A. P. Kiraly, S. Bharadwaj, B. Choi, J. J. Reicher, L. Peng, D. Tse, M. Etemadi, W. Ye, G. Corrado, et al. (2019) End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature medicine 25 (6), pp. 954–961. Cited by: §1.
  • [3] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.
  • [4] E. Arvaniti, K. S. Fricker, M. Moret, N. Rupp, T. Hermanns, C. Fankhauser, N. Wey, P. J. Wild, J. H. Rueschoff, and M. Claassen (2018) Automated gleason grading of prostate cancer tissue microarrays via deep learning. Scientific reports 8 (1), pp. 1–11. Cited by: §1.
  • [5] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1, §2.
  • [6] E. Brophy, Z. Wang, and T. E. Ward (2019) Quick and easy time series generation with established image-based gans. arXiv preprint arXiv:1902.05624. Cited by: §1.
  • [7] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.2.3.
  • [8] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun (2017) Generating multi-label discrete patient records using generative adversarial networks. arXiv preprint arXiv:1703.06490. Cited by: §1, §1.
  • [9] Y. Choi, M. Choi, M. Kim, et al. (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation


    Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

    Cited by: 1st item.
  • [10] Y. Choi, Y. Uh, J. Yoo, et al. (2020) StarGAN v2: diverse image synthesis for multiple domains. CVPR. Cited by: 1st item.
  • [11] J. P. Cohen, P. Morrison, and L. Dao (2020) COVID-19 image data collection. arXiv 2003.11597. External Links: Link Cited by: §4.5.2.
  • [12] B. Dai, S. Fidler, R. Urtasun, and D. Lin (2017) Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2970–2979. Cited by: §2.
  • [13] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov (2017)

    Good semi-supervised learning that requires a bad gan

    In Advances in neural information processing systems, pp. 6510–6520. Cited by: §1.
  • [14] L. Dinh, D. Krueger, and Y. Bengio (2014)

    Nice: non-linear independent components estimation

    arXiv preprint arXiv:1410.8516. Cited by: §2.
  • [15] L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §2.
  • [16] H. Dong, S. Yu, C. Wu, and Y. Guo (2017) Semantic image synthesis via adversarial learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5706–5714. Cited by: §1.
  • [17] W. Fedus, I. Goodfellow, and A. M. Dai (2018) MaskGAN: better text generation via filling in the_. arXiv preprint arXiv:1801.07736. Cited by: §1.
  • [18] B. L. Filkins, J. Y. Kim, B. Roberts, W. Armstrong, M. A. Miller, M. L. Hultner, A. P. Castillo, J. Ducom, E. J. Topol, and S. R. Steinhubl (2016) Privacy and security in the era of digital health: what should translational researchers know and do about it?. American journal of translational research 8 (3), pp. 1560. Cited by: §1.
  • [19] Y. Fujisawa, Y. Otomo, Y. Ogata, Y. Nakamura, R. Fujita, Y. Ishitsuka, R. Watanabe, N. Okiyama, K. Ohara, and M. Fujimoto (2019) Deep-learning-based, computer-aided classifier developed with a small dataset of clinical images surpasses board-certified dermatologists in skin tumour diagnosis. British Journal of Dermatology 180 (2), pp. 373–381. Cited by: §1.
  • [20] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §6.3.2.
  • [21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §1, §2, §6.1.
  • [22] V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, et al. (2016) Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama 316 (22), pp. 2402–2410. Cited by: §1, §2.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.1, §3.2.3, §6.2.2.
  • [24] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §6.3.3.
  • [25] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §3.2.2.
  • [26] S. Huang, C. Lin, S. Chen, Y. Wu, P. Hsu, and S. Lai (2018) Auggan: cross domain adaptation with gan-based data augmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 718–731. Cited by: 3rd item.
  • [27] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 590–597. Cited by: §4.5.2.
  • [28] P. Isola, J. Zhu, T. Zhou, et al. (2017)

    Image-to-image translation with conditional adversarial networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: 1st item.
  • [29] A. Jacobi, M. Chung, A. Bernheim, and C. Eber (2020) Portable chest x-ray in coronavirus disease-19 (covid-19): a pictorial review. Clinical Imaging. Cited by: §1.
  • [30] B. Jing, Z. Wang, and E. Xing (2020) Show, describe and conclude: on exploiting the structure information of chest x-ray reports. arXiv preprint arXiv:2004.12274. Cited by: §1.
  • [31] B. Jing, P. Xie, and E. Xing (2017) On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195. Cited by: §1.
  • [32] A. E. Johnson, T. J. Pollard, S. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Deng, R. G. Mark, and S. Horng (2019) MIMIC-cxr: a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 1 (2). Cited by: §4.1, §6.3.1.
  • [33] C. E. Kahn Jr, C. P. Langlotz, E. S. Burnside, J. A. Carrino, D. S. Channin, D. M. Hovsepian, and D. L. Rubin (2009) Toward best practices in radiology reporting. Radiology 252 (3), pp. 852–856. Cited by: §1.
  • [34] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1, §2.
  • [35] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [36] T. Karras, S. Laine, M. Aittala, et al. (2019) Analyzing and improving the image quality of stylegan. arXiv preprint. Cited by: §1.
  • [37] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.3.2.
  • [38] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • [39] D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §2.
  • [40] Y. Li, X. Liang, Z. Hu, and E. P. Xing (2018) Hybrid retrieval-generation reinforced agent for medical image report generation. In Advances in neural information processing systems, pp. 1530–1540. Cited by: §1.
  • [41] G. Liu, T. H. Hsu, M. McDermott, W. Boag, W. Weng, P. Szolovits, and M. Ghassemi (2019) Clinically accurate chest x-ray report generation. arXiv preprint arXiv:1904.02633. Cited by: §2.
  • [42] M. Liu and O. Tuzel (2016) Coupled generative adversarial networks. In Advances in neural information processing systems, pp. 469–477. Cited by: 1st item, §4.3.
  • [43] P. Luc, C. Couprie, S. Chintala, and J. Verbeek (2016) Semantic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408. Cited by: §1.
  • [44] A. L. McGuire, R. Fisher, P. Cusenza, K. Hudson, M. A. Rothstein, D. McGraw, S. Matteson, J. Glaser, and D. E. Henley (2008) Confidentiality, privacy, and security of genetic and genomic test information in electronic health records: points to consider. Genetics in Medicine 10 (7), pp. 495–499. Cited by: §1.
  • [45] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §2.
  • [46] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §6.1.
  • [47] T. Miyato and M. Koyama (2018) CGANs with projection discriminator. arXiv preprint arXiv:1802.05637. Cited by: §2, §4.1.
  • [48] A. A. Mohamed, W. A. Berg, H. Peng, Y. Luo, R. C. Jankowitz, and S. Wu (2018) A deep learning method for classifying mammographic breast density categories. Medical physics 45 (1), pp. 314–321. Cited by: §1.
  • [49] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016)

    Pixel recurrent neural networks

    arXiv preprint arXiv:1601.06759. Cited by: §2.
  • [50] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.2, §4.5.
  • [51] Y. Pu, S. Dai, Z. Gan, W. Wang, G. Wang, Y. Zhang, R. Henao, and L. Carin (2018)

    Jointgan: multi-domain joint distribution learning with generative adversarial nets

    arXiv preprint arXiv:1806.02978. Cited by: 1st item, §4.3.
  • [52] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §2.
  • [53] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. (2017) Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225. Cited by: §1.
  • [54] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In NIPS, pp. 2226–2234. Cited by: §2.
  • [55] L. H. Schwartz, D. M. Panicek, A. R. Berk, Y. Li, and H. Hricak (2011) Improving communication of diagnostic radiology findings through structured reporting. Radiology 260 (1), pp. 174–181. Cited by: §1, §4.5.
  • [56] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §2.
  • [57] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §4.2, §4.5.
  • [58] X. Wang, Y. Peng, L. Lu, Z. Lu, and R. M. Summers (2018) Tienet: text-image embedding network for common thorax disease classification and reporting in chest x-rays. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9049–9058. Cited by: §2.
  • [59] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §2.
  • [60] J. Zhu, T. Park, P. Isola, et al. (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: 1st item.
  • [61] J. Zhu, R. Zhang, D. Pathak, et al. (2017) Toward multimodal image-to-image translation. In Advances in neural information processing systems (NeurIPS), Cited by: 1st item.

6 Supplementary

6.1 Preliminaries: Generative Adversarial Networks

The Generative Adversarial Network (GAN) involves a Generator () and a Discriminator () network. The purpose of Generator () is to map random noise to samples, while the Discriminator () classifies real and generated samples. The generator builds a mapping function from a prior noise distribution to data space as to learn a generator distribution , while the discriminator outputs a single scalar representing the probability that came form training data rather than where is the real data distribution. The basic GAN objective function seeks a Nash equilibrium to the following two player min-max problem where value function is defined as  [21] where is a latent variable drawn from distribution such as the unit Gaussian or the unit uniform . Generative adversarial networks can be extended to conditional versions if the generators and discriminators are conditioned on label information [46]. The condition information and are combined in the joint representation of the generator. The discriminator is provided with generated samples and labels as inputs. The objective function can be modified as

6.2 Emixer: Architecture Details

6.2.1 Notations Table

We used these notations to describe different modules. The notations are described in table 4.

Symbol Definition and description
Notation for X-ray Images
Notation for sentences in the X-ray report
Generated X-ray images
Generated X-ray reports
Dataset consisting of images, reports and labels
. Notation for labels associated with images
Words in the sentences of X-ray report
Noise vector for the generator
Generator Neural Network
X-ray image discriminator Neural Network
Discriminator Neural Network
Discriminator Neural Network
Report Generator Network
Table 6: Notations used in EMIXER

6.2.2 Emixer Model

In this section, we provide further description of the different neural networks within in EMIXER.

X-ray Image Generator(): Figure 3 shows the architecture of the image generator. X-ray image generator accepts two inputs: (a) noise vector (b) class information represented as one-hot vector. We embed the class information via a linear layer to obtain vector . It has been shown generators can use the latent space to influence features at different resolutions by providing direct connections from noise vector to different layers of the generator. We split the noise vector to obtain different smaller vectors (https://pytorch.org/docs/master/generated/torch.split.html). The vectors is passed through a linear layer to obtain , . We concatenate with which is passed through three layers of  [23]. We have provided the details of this convolution block in table 7. We use , to denote input height and width and , are input and output channels for the . The output from the previous layer and the concatenated vector from , is provided as input to each of the residual block. The final residual output is passed through a self-attention block which applies applies a convolution operation with softmax to obtain intermediate feature vectors which are combined with the original input to compute the , . Finally this output is passed through another to obtain the as the output of generator.

Figure 3: Architectural layout of EMIXER image generator
Layer Kernel Output
Shortcut [1,1,1]
Table 7: Details of for generator

X-ray Image Discriminator (): Figure 4 shows the architecture of the X-ray image discriminator. X-ray image discriminator is used to distinguish between real and fake X-ray images. The discriminator takes an X-ray image as an input. Image is passed through multiple layers of residual convolutional blocks . We have provided the details of the convolution block in table LABEL:tab:tab:res-block-disc. We use , to denote input height and width and , are input and output channels for the . In each residual convolutional block the number of channels is doubled to process the previous layers input. The intermediate feature vector obtained from the residual blocks is passed through a pooling layer and ReLU activation layer. Finally we combine it with the projected condition vector and pass there through a linear layer to obtain the final output.

Layer kernel Output
Shortcut [1,1,1]
Conv [3,3,1]
Conv [3,3,1]
Table 8: Details of for discriminator.
Figure 4: Architectural layout of EMIXER image generator

X-ray Report Generator (): We describe the architecture of the X-ray report generation module in Figure 5. The report generation component contains three different sub-components: (a) Image encoder (b) Sentence (c) Word . The image encoder CNN takes an X-ray image as input and produces feature vectors. This CNN model is pre-trained on X-ray images using a DenseNet model. The sentence produces topic vectors which are used as input for word s to produce the words. After the word produces all the words, the words are combined to create the final report .

Figure 5: Architectural layout of EMIXER report generator

X-ray Report Discriminator (): As we show in the figure 6, the X-ray report is passed as input to the . s have been used to represent paragraphs and sentences to produce context vectors. We use the final representation obtained from the LSTM and pass that to a linear layer. This is finally passed through a softmax layer to obtain the probability of real or fake.

Figure 6: Architectural layout of EMIXER text discriminator

Joint Discriminator () As shown in figure 7, the X-ray report and image are used to create a joint embedding. X-ray images is passed through CNN to obtain an X-ray image feature vector . X-ray report is passed through a to obtain the final representation of the report . The feature vectors are concatenated together to obtain a joint embedding . This is finally passed through a linear layer and softmax layer to obtain the probability of embedding being real or fake.

Figure 7: Architectural layout of EMIXER image generator

6.3 Appendix B Experimental Details

6.3.1 Dataset Details

We used MIMIC-CXR dataset consisting of X-ray images and reports [32]. This data set was collected from Beth Israel Deaconess Hospital. We apply pre-processing to remove duplicated samples from this dataset. The Radiology reports typically contain an impression and findings section. We extracted the finding section from the report for training our models. We apply tokenization and only keep tokens with at least 6 occurrences in the corpus for training purposes.

6.3.2 Architecture and hyperparameters

We use Adam optimizer with a learning rate of for the generative model and for the discriminators for training EMIXER [37]. We staggered discriminator steps and generator steps in 2:1 ratio which led to 400k (800k) generator (discriminator) steps. This helps the discriminator improve it’s parameter update process faster compared to a generator. We fix our batch size at 512 while training. We use a noise vector of 120 dimensions as input for the generator. We also use spectral normalization for the layers in the generator and discriminator in the training process. All the models generate X-ray images. We obtain partially labeled data sets for the self-supervised experiments by randomly selecting 30% of the samples from each class. We rotate the images and use the rotation angles as labels for self-supervision  [20].

6.3.3 Evaluation Metrics

Fréchet Inception Distance (FID score)

: We first pass real data and generated samples embedded in a specific layer of special pre-trained Inception network on chest X-ray images instead of ImageNet 

[24]. Then, a multivariate Gaussian is fit to the data and the distance computed as where and denote the empirical mean, covariance and subscripts and denote the real and generated data respectively.

6.4 Results

6.4.1 Phenotype Classification from X-ray Images with augmented data

We report the performance of EMIXER and the baseline models for different phenotype detection from chest X-ray images. The setup for this experiment is similar where we train two models on real X-ray images and generated X-ray images. These trained models are evaluated on held-out X-ray images. The performance of the test X-ray images are reported in Table 9

Dataset Method Cardiomegaly Consolidation Pleural Effusion Pneumothorax Pulmonary Edema
MIMIC Real data [100k images] 0.812 0.847 0.753 0.735 0.732
CoGAN [100k images] 0.741 0.817 0.708 0.713 0.682
JointGAN [100k images] 0.732 0.785 0.724 0.681 0.713
EMIXER [100k images] 0.784 0.734 0.728 0.715 0.718
Table 9: Performance of X-ray image classification using synthetic X-ray

6.4.2 Performance comparison of augmented data to real data

We performed an experiment to evaluate augmented datasets in comparison to real datasets of similar size. In this setup, we keep the total size of the dataset constant at 100k and change the ratio of real and synthetic images. We present the results of this experiment in Table 10. This experiment evaluates the performance of augmented datasets where the total dataset size is low. We show that even when we use fewer real images, augmented datasets only show decrease in performance. This shows that even with low-data availability, synthetic data augmentation can perform competitively compared to models trained only on real X-ray images.

Method Data AUC Acc
Only Real R100k .824 .846
JointGAN R90k + S10k .796 .813
R80k + S20k .778 .801
R60k + S40k .745 .764
R20k + S80k .717 .732
CoGAN R90k + S10k .784 .808
R80k + S20k .771 .796
R60k + S40k .736 .757
R20k + S80k .712 .746
SMGAN R90k + S10k .794 .812
R80k + S20k .764 .783
R60k + S40k .742 .763
R20k + S80k .723 .742
EMIXER R90k + S10k .808 .828
R80k + S20k .792 .821
R60k+ S40k .773 .796
R20k + S80k .756 .774
Table 10: X-ray image classification performance comparison with EMIXER augmented data. Dataset size at 100k while reducing the amount of real images in the augmented dataset. In this table, R indicates Real data and S indicate Synthetic data.

6.4.3 Additional generated data samples

In figures 8 and 9, we show additional generated X-ray image,report pairs in comparison to real X-ray image and report pairs. Figure 11 and 11 show comparison of real X-ray images to synthetic X-ray images. Finally, figure 12 shows more synthetic X-ray images.

Figure 8: Comparison of Real X-ray image and report pairs with generated X-ray images, reports pairs
Figure 9: Comparison of Real X-ray image and report pairs with generated X-ray images, reports pairs
Figure 10: figure

Real X-ray images

Figure 11: figure

Synthetic X-ray images

Figure 12: Samples of Synthetic X-ray images