Combining Noise-to-Image and Image-to-Image GANs: Brain MR Image Augmentation for Tumor Detection

05/31/2019 ∙ by Changhee Han, et al. ∙ The University of Tokyo 0

Convolutional Neural Networks (CNNs) can achieve excellent computer-assisted diagnosis performance, relying on sufficient annotated training data. Unfortunately, most medical imaging datasets, often collected from various scanners, are small and fragmented. In this context, as a Data Augmentation (DA) technique, Generative Adversarial Networks (GANs) can synthesize realistic/diverse additional training images to fill the data lack in the real image distribution; researchers have improved classification by augmenting images with noise-to-image (e.g., random noise samples to diverse pathological images) or image-to-image GANs (e.g., a benign image to a malignant one). Yet, no research has reported results combining (i) noise-to-image GANs and image-to-image GANs or (ii) GANs and other deep generative models, for further performance boost. Therefore, to maximize the DA effect with the GAN combinations, we propose a two-step GAN-based DA that generates and refines brain MR images with/without tumors separately: (i) Progressive Growing of GANs (PGGANs), multi-stage noise-to-image GAN for high-resolution image generation, first generates realistic/diverse 256 x 256 images--even a physician cannot accurately distinguish them from real ones via Visual Turing Test; (ii) UNsupervised Image-to-image Translation or SimGAN, image-to-image GAN combining GANs/Variational AutoEncoders or using a GAN loss for DA, further refines the texture/shape of the PGGAN-generated images similarly to the real ones. We thoroughly investigate CNN-based tumor classification results, also considering the influence of pre-training on ImageNet and discarding weird-looking GAN-generated images. The results show that, when combined with classic DA, our two-step GAN-based DA can significantly outperform the classic DA alone, in tumor detection (i.e., boosting sensitivity from 93.63 other tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Convolutional Neural Networks (CNNs) are playing a key role in medical image analysis, updating the state-of-the-art in many tasks [1, 2, 3], when large-scale annotated training data are available. However, preparing such massive medical data is demanding; thus, for better diagnosis, researchers generally adopt classic Data Augmentation (DA) techniques, such as geometric/intensity transformations of original images [4, 5]. Those augmented images, however, intrinsically have a similar distribution to the original ones, resulting in limited performance improvement. In this sense, Generative Adversarial Network (GAN)-based DA can considerably increase the performance [6]; since the generated images are realistic but completely new samples, they can fill the real image distribution uncovered by the original dataset.

The main problem in computer-assisted diagnosis lies in small and fragmented medical imaging datasets from various scanners; thus, researchers have improved classification by augmenting images with noise-to-image GANs (e.g., random noise samples to diverse pathological images [7]) or image-to-image GANs (e.g., a benign image to a malignant one [8]). However, no research has reported results achieved by combining (i) noise-to-image GANs and image-to-image GANs or (ii) GANs and other common deep generative models, such as Variational AutoEncoders (VAEs) using a single objective [9], for further performance boost.

Fig. 1:

Combining noise-to-image and image-to-image GAN-based DA for better tumor detection: the PGGANs generates a number of realistic brain tumor/non-tumor MR images separately, the UNIT/SimGAN refines them separately, and the binary classifier uses them as additional training data.

So, how can we maximize DA effect under limited training images using the GAN combinations? Aiming to generate and refine brain MR images with/without tumors separately, we propose a two-step GAN-based DA approach: (i) Progressive Growing of GANs (PGGANs) [10], low-to-high resolution noise-to-image GAN, first generates realistic and diverse images—the PGGANs is beneficial for DA since most CNN architectures adopt around input sizes (e.g., InceptionResNetV2 [11]: , ResNet-50 [12]: ); (ii) UNsupervised Image-to-image Translation (UNIT) [13] or SimGAN [14], image-to-image GAN combining GANs/VAEs or using a GAN loss for DA, further refines the texture/shape of the PGGAN-generated images to fit them into the real image distribution. We thoroughly investigate CNN-based tumor classification results, also considering the influence of pre-training on ImageNet [15] and discarding weird-looking GAN-generated images. Moreover, we evaluate the synthetic images’ realism via Visual Turing Test [16] by an expert physician, and visualize the data distribution of real/synthetic images via t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm [17]. When combined with classic DA, our two-step GAN-based DA approach remarkably outperforms the classic DA alone, boosting sensitivity to 111This paper remarkably improves our preliminary work [7] that aimed at investigating the potential of the PGGANs pre-trained on ImageNet—with minimal pre-processing and no refinement—for DA using a vanilla version of ResNet-50 (i.e., neither hyper-parameters nor settings were optimized)..

Research Questions. We mainly address two questions:

  • GAN Selection: Which GAN architectures are well-suited for realistic/diverse medical image generation?

  • Medical DA: How to use GAN-generated images as additional training data for better CNN-based diagnosis?

Contributions. Our main contributions are as follows:

  • Whole Image Generation: This research shows that PGGANs can generate realistic/diverse whole medical images, and not only small pathological areas.

  • Two-step GAN-based DA: This novel two-step approach, combining for the first time noise-to-image and image-to-image GANs, remarkably boosts tumor detection performance.

  • Misdiagnosis Prevention: This study firstly analyzes how medical GAN-based DA is associated with pre-training on ImageNet and discarding weird-looking synthetic images to achieve high sensitivity with small/fragmented datasets from various scanners.

Ii Generative Adversarial Networks

VAEs often suffer from blurred samples despite easier training, due to the injected noise and imperfect reconstruction using a single objective function; meanwhile, GANs [6] have revolutionized image generation in terms of realism and diversity [18] based on a two-player objective function: a generator tries to generate realistic images to fool a discriminator while maintaining diversity; attempts to distinguish between the real and the generator’s synthetic images. However, difficult GAN training from the two-player objective function accompanies artifacts and mode collapse [19], when generating high-resolution images (e.g., pixels) [20]; to tackle this, multi-stage noise-to-image GANs have been proposed: AttnGAN [21] generates images from text using attention-based multi-stage refinement; PGGANs [10] generates realistic images using incremental multi-stage training from low resolution to high. Contrarily, to obtain images with desired texture and shape, researchers have proposed image-to-image GANs: UNIT [13] translates images using both GANs and VAEs; SimGAN [14] translates images for DA using a self-regularization term and local adversarial loss.

Especially in medical imaging, to handle small and fragmented datasets from multiple scanners, researchers have exploited both noise-to-image and image-to-image GANs as DA techniques to improve classification: researchers used the noise-to-image GANs to augment liver lesion Computed Tomography (CT) [22] and chest cardiovascular abnormality X-ray images [23]; others used the image-to-image GANs to augment breast cancer mammography images [8] and bone lesion X-ray images [24], translating benign images to malignant ones and vice versa.

However, to the best of our knowledge, we are the first to combine noise-to-image and image-to-image GANs to maximize the DA performance. Moreover, this is the first medical GAN work generating whole images, instead of regions of interest (i.e., small pathological areas) alone, for robust classification. Along with classic image transformations, a novel approach—augmenting realistic and diverse whole medical images with the two-step GAN—may become a clinical breakthrough.

Iii Materials and Methods

Iii-a BRATS 2016 Training Dataset

We use a dataset of contrast-enhanced T1-weighted (T1c) brain axial MR images of High-Grade Glioma cases from the Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) 2016 [25]. T1c is the most common sequence in tumor detection thanks to its high-contrast [26].

Fig. 2: Example real MR images used for PGGAN training.

Iii-B PGGAN-based Image Generation

Pre-processing For better GAN/ResNet-50 training, we select the slices from to among the whole slices to omit initial/final slices, which convey negligible useful information; also, since tumor/non-tumor annotation in the BRATS 2016 dataset, based on 3D volumes, is highly incorrect/ambiguous on 2D slices, we exclude () tumor images tagged as non-tumor, () non-tumor images tagged as tumor, () borderline images with unclear tumor/non-tumor appearance, and () images with missing brain parts due to the skull-stripping procedure222Although this discarding procedure could be automated, we manually conducted it for more reliability; this does not affect our conclusion.. For tumor detection, we divide the whole dataset ( patients) into:

  • Training set
    ( patients/ tumor/ non-tumor images);

  • Validation set
    ( patients/ tumor/ non-tumor images);

  • Test set
    ( patients/ tumor/ non-tumor images).

During the GAN training, we only use the training set to be fair; for better GAN training, the training set images are zero-padded to reach a power of

, pixels from . Fig. 2 shows example real MR images.

PGGANs [10] is a GAN training method that progressively grows a generator and discriminator: starting from low resolution, new layers model details as training progresses. This study adopts the PGGANs to synthesize realistic and diverse brain MR images (Fig. 3); we train and generate tumor/non-tumor images separately.

PGGAN Implementation Details The PGGAN architecture adopts the Wasserstein loss using gradient penalty [19]:


where the discriminator is the set of 1-Lipschitz functions, is the data distribution by the true data sample , and is the model distribution by the generated sample . A gradient penalty is added for the random sample .

We train it for epochs with a batch size of 16 and learning rate for the Adam optimizer [27]. During training, we apply random cropping in 0–15 pixels as DA.

Fig. 3: PGGAN architecture for image generation.

Iii-C UNIT/SimGAN-based Image Refinement

Refinement We further refine the texture and shape of PGGAN-generated tumor/non-tumor images separately to fit them into the real image distribution using UNIT [13] or SimGAN [14]

. SimGAN remarkably improved eye gaze estimation results after refining non-GAN-based synthetic images from the UnityEyes simulator

image-to-image translation [14]; thus, we also expect such performance improvement after refining synthetic images from a noise-to-image GAN (i.e., PGGANs) an image-to-image GAN (i.e., UNIT/SimGAN) with considerably different GAN-based algorithms.

We randomly select real/ PGGAN-generated tumor images for tumor image training, and we performed the same for non-tumor image training. To find suitable refining steps for each architecture, we pick the UNIT/SimGAN models with the highest accuracy on tumor detection validation, when pre-trained and combined with classic DA, among // steps, respectively.

UNIT [13] is an image-to-image translation method based on both GANs and VAEs; it jointly learns image distributions in different domains using images from the marginal distributions in each domain with a shared-latent space.

UNIT Implementation Details The UNIT architecture adopts the following loss:


Using the multiple encoders /, generators /, discriminators /, and cycle-consistencies CC/CC, it jointly solves learning problems of the VAE/VAE and GAN/GAN for the image reconstruction streams, image translation streams, and cycle-reconstruction streams.

We train it for steps with a batch size of 1 and learning rate for the Adam optimizer [27]. The learning rate is reduced by half every 20,000 steps. During training, we apply horizontal flipping as DA.

SimGAN [14] is an image-to-image GAN designed for DA that adopts a self-regularization term/local adversarial loss; it updates a discriminator with a history of refined images.

SimGAN Implementation Details The SimGAN architecture adopts the following loss:


where denotes the function parameters, is the PGGAN-generated training image, and is the real images. The first part adds realism to the synthetic images, while the second part preserves the tumor/non-tumor features.

We train it for steps with a batch size of 10 and

learning rate for the Stochastic Gradient Descent (SGD) optimizer 


. The learning rate is reduced by half at 15,000 steps. During training, we apply horizontal flipping as DA. We use batch normalization 

[29] layers.

Iii-D Tumor Detection Using ResNet-50

Pre-processing. As ResNet-50’s input size is pixels, we resize the whole real images from and whole synthetic images from .

ResNet-50 [12] is a -layer residual learning-based CNN and we adopt it to detect brain tumors in MR images (i.e., the binary classification of images with/without tumors). We chose the ResNet-50 for comparing DA setups due to its outstanding performance in image classification tasks [30].

To confirm the effect of PGGAN-based DA and its refinement using UNIT/SimGAN, we compare the following DA setups under sufficient images both with/without ImageNet [15] pre-training/fine-tuning (i.e., 20 DA setups):

  1. real images;

  2. + k classic DA;

  3. + k classic DA;

  4. + k PGGAN-based DA;

  5. + k PGGAN-based DA w/o clustering/discarding;

  6. + k classic DA & k PGGAN-based DA;

  7. + k UNIT-refined DA;

  8. + k classic DA & k UNIT-refined DA;

  9. + k SimGAN-refined DA;

  10. + k classic DA & k SimGAN-refined DA.

Whereas medical imaging researchers widely use the ImageNet initialization despite different textures of natural/medical images, recent study found that such ImageNet-trained CNNs are biased towards recognizing textures rather than shapes [31]; thus, we aim to investigate how the medical GAN-based DA affects classification performance with/without the pre-training. As the classic DA, we adopt a random combination of horizontal/vertical flipping, rotation up to degrees, width/height shift up to , shearing up to , zooming up to , and constant filling of points outside the input boundaries (Fig. 4). For the PGGAN-based DA and its refinement, we only use success cases after discarding weird-looking synthetic images (Fig. 5); DenseNet-169 [32]

extracts image features and k-means++ 

[33] clusters the features into groups, and then we manually discard each cluster containing similar weird-looking images. To verify its effect, we also conduct the PGGAN-based DA experiment without the discarding step.

ResNet-50 Implementation Details The ResNet-50 architecture adopts the binary cross-entropy loss for binary classification both with/without ImageNet pre-training. For robust training, before the final sigmoid layer, we use a dropout [34], linear dense, and batch normalization [29] layers—training with GAN-based DA tends to be unstable especially without the batch normalization layer. We use a batch size of , learning rate for the SGD optimizer [28] with momentum, and early stopping of epochs. The learning rate was multiplied by every epochs for the training from scratch and by every epochs for the ImageNet pre-training.

Fig. 4: Example real MR image and its geometrically-transformed images.

Fig. 5: Example PGGAN-generated MR images: (a) success cases; (b) failed cases.


Accuracy (%) Sensitivity (%) Specificity (%)
8,429 real images 93.26 (86.38) 90.95 (88.94) 95.87 (83.62)
+ 200k classic DA 95.02 (92.21) 93.63 (90.21) 96.57 (95.11)
+ 400k classic DA 94.93 (93.24) 91.90 (90.91) 98.39 (95.97)
+ 200k PGGAN-based DA 93.95 (86.25) 92.48 (87.25) 95.56 (84.78)
+ 200k PGGAN-based DA w/o clustering/discarding 94.80 (80.54) 91.82 (80.02) 98.39 (81.25)
+ 200k classic DA & 200k PGGAN-based DA 96.18 (95.63) 94.12 (94.24) 98.79 (97.28)
+ 200k UNIT-refined DA 94.31 (83.68) 93.26 (87.75) 96.02 (78.48)
+ 200k classic DA & 200k UNIT-refined DA 96.70 (96.34) 95.48 (97.53) 98.29 (94.96)
+ 200k SimGAN-refined DA 94.49 (77.66) 92.39 (82.03) 97.18 (71.98)
+ 200k classic DA & 200k SimGAN-refined DA 96.36 (95.04) 95.11 (95.07) 97.88 (94.96)


TABLE I: ResNet-50 tumor detection (i.e., binary classification) results with various DA, with (without) ImageNet pre-training.

Iii-E Clinical Validation Using Visual Turing Test

To quantitatively evaluate the (i) realism of the PGGAN-based synthetic images and (ii) clearness of their tumor/non-tumor features, we supply, in random order, to an expert physician a random selection of:

  • real tumor images;

  • real non-tumor images;

  • synthetic tumor images;

  • synthetic non-tumor images.

Then, the physician has to classify them as both (i) real/synthetic and (ii) tumor/non-tumor, without previously knowing which is real/synthetic and tumor/non-tumor. The so-called Visual Turing Test [16] can probe the human ability to identify attributes and relationships in images, also for visually evaluating GAN-generated images [14]; this also applies to medical images for clinical decision-making tasks [35, 36], wherein physicians’ expertise is critical.

Iii-F Visualization Using t-SNE

To visually analyze distributions of geometrically-transformed and each GAN-based images by PGGANs/ UNIT/SimGAN against real images (i.e., 4 setups), we adopt t-SNE [17] on a random selection of:

  • real tumor images;

  • real non-tumor images;

  • geometrically-transformed or each GAN-based tumor images;

  • geometrically-transformed or each GAN-based non-tumor images.

We select only

images per each category for better visualization. The t-SNE method reduces the dimensionality to represent high-dimensional data into a lower-dimensional (2D/3D) space; it non-linearly balances between the input data’s local and global aspects using perplexity.

t-SNE Implementation Details The t-SNE uses a perplexity of for iterations to visually represent a 2D space.

Fig. 6: Example PGGAN-generated MR images and their refined versions by UNIT/SimGAN.

Iv Results

This section shows how PGGANs generates synthetic brain MR images and how UNIT and SimGAN refine them. The results include instances of synthetic images, their quantitative evaluation by an expert physician, their t-SNE visualization, and their influence on tumor detection.

Iv-a MR Images Generated by PGGANs

Fig. 5 illustrates examples of synthetic MR images by PGGANs. We visually confirm that, for about of cases, it successfully captures the T1c-specific texture and tumor appearance, while maintaining the realism of the original brain MR images; but, for the rest

, the generated images lack clear tumor/non-tumor features or contain unrealistic features (i.e., hyper-intensity, gray contours, and odd artifacts).

Iv-B MR Images Refined by UNIT/SimGAN

UNIT and SimGAN differently refine PGGAN-generated images—they render the texture/contours while maintaining the overall shape (Fig. 6

). Non-tumor images change more remarkably than tumor images for both UNIT/SimGAN; it probably derives from unsupervised image translation’s loss for consistency to avoid image collapse, resulting in conservative change for more complicated images.


Real/Synthetic Classification as as as as
Tumor/Non-tumor Classification as as as as
(R: , S: ) (S: )


TABLE II: Visual Turing Test results by an expert physican for classifying Real () vs PGGAN-based Synthetic () images and Tumor () vs Non-tumor () images.

Fig. 7: T-SNE plots with 300 tumor/non-tumor MR images per each category: Real images vs (a) Geometrically-transformed images; (b) PGGAN-generated images; (c) UNIT-refined images; (d) SimGAN-refined images.

Iv-C Tumor Detection Results

Table LABEL:tab1 shows the brain tumor classification results with/without DA. ImageNet pre-training generally outperforms training from scratch despite different image domains (i.e., natural images to medical images). As expected, classic DA remarkably improves classification, while no clear difference exists between the 200,000/400,000 classic DA under sufficient geometrically-transformed training images. When pre-trained, each GAN-based DA (i.e., PGGANs/UNIT/SimGAN) alone helps classification due to the robustness from GAN-generated images; but, without pre-training, it harms classification due to the biased initialization from the GAN-overwhelming data distribution. Similarly, without pre-training, PGGAN-based DA without clustering/discarding causes poor classification due to the synthetic images with severe artifacts, unlike the PGGAN-based DA’s comparable results with/without the discarding step when pre-trained.

When combined with the classic DA, each GAN-based DA significantly outperforms the GAN-based DA or classic DA alone—the former fills the real image distribution uncovered by the original dataset, while the latter provides the robustness on training for most cases; here, both image-to-image GAN-based DA, especially UNIT, produce remarkably higher sensitivity than the PGGAN-based DA after refinement. Specificity is higher than sensitivity for every DA setup with pre-training, probably due to the training data imbalance; but interestingly, without pre-training, sensitivity is higher than specificity for both image-to-image GAN-based DA—thus, when combined with the classic DA, the UNIT-based DA achieves the highest sensitivity , allowing to significantly alleviate the risk of overlooking the tumor diagnosis.

Iv-D Visual Turing Test Results

Table II

indicates the confusion matrix for the Visual Turing Test. The expert physician classifies a few PGGAN-generated images as real despite their high resolution (i.e.,

pixels). The synthetic images successfully capture tumor/non-tumor features; unlike the non-tumor images, the expert recognizes a considerable number of the mild/modest tumor images as non-tumor for both real/synthetic cases. It derives from clinical tumor diagnosis relying on a full 3D volume, instead of a single 2D slice.

Iv-E t-SNE Results

As Fig. 7

represents, the real tumor/non-tumor image distributions largely overlap while the non-tumor images distribute wider. The geometrically-transformed tumor/non-tumor image distributions also often overlap, and both images distribute wider than the real ones. All GAN-based synthetic images by PGGANs/UNIT/SimGAN distribute widely, while their tumor/non-tumor images overlap much less than the geometrically-transformed ones; the UNIT-refined images show a more similar distribution to the real ones than the PGGAN/SimGAN-based images, probably due to the UNIT’s loss function adopting both GANs/VAEs—overall, the GAN-based images, especially the UNIT-refined images, fill the distribution uncovered by the real or geometrically-transformed ones with less tumor/non-tumor overlap.

V Conclusion

Visual Turing Test and t-SNE results show that PGGANs, multi-stage noise-to-image GAN, can generate realistic and diverse brain MR images with/without tumors separately. The generated images can improve tumor classification, when combined with classic DA—especially after refining them with UNIT or SimGAN, image-to-image GANs; thanks to an ensemble effect from those GANs’ different algorithms, the refined images can replace missing data points of the training dataset with less tumor/non-tumor overlap and regularize the model, and thus handle the data imbalance with improved generalization. Especially, UNIT outperforms SimGAN, probably due to the effect of combining both GANs and VAEs.

Regarding better medical GAN-based DA, ImageNet pre-training generally improves classification despite different textures of natural/medical images; but, without pre-training, the GAN-refined images may help achieve better sensitivity, allowing to alleviate the risk of overlooking the tumor diagnosis. GAN-generated images typically include odd artifacts; however, only without pre-training, discarding them boosts DA performance.

Overall, by minimizing the number of annotated images required for medical imaging tasks, the two-step GAN-based DA can shed light not only on classification, but also on object detection [37] and segmentation [38]. Moreover, other potential medical applications exist: (i) A data anonymization tool to share patients’ data outside their institution for training without losing detection performance. This GAN-based application is reported in [38]; (ii) A physician training tool to show random pathological images for medical students/radiology trainees despite infrastructural/legal constraints [39]. As future work, we plan to define a new GAN loss function that explictly aims at optimizing the classification results, instead of visual realism, similarly to the three-player GAN proposed in [40].

Vi Acknowledgment

This research was partially supported by Qdai-jump Research Program, JSPS KAKENHI Grant Number JP17K12752, and AMED Grant Number JP18lk1010028.