Along with classical methods [Rundo, rundo2016WIRN], Convolutional Neural Networks (CNNs) have dramatically improved medical image analysis [Bevilacqua, Brunetti], such as brain Magnetic Resonance Imaging (MRI) segmentation [Havaei, Kamnitas], primarily thanks to large-scale annotated training data. Unfortunately, obtaining such massive medical data is challenging; consequently, better training requires intensive Data Augmentation (DA) techniques, such as geometric/intensity transformations of original images [Ronneberger, Milletari]
. However, those transformed images intrinsically have a similar distribution with respect to the original ones, leading to limited performance improvement; thus, generating realistic (i.e., similar to the real image distribution) but completely new samples is essential to fill the real image distribution uncovered by the original dataset. In this context, Generative Adversarial Network (GAN)-based DA is promising, as it has shown excellent performance in computer vision, revealing good generalization ability. Especially, SimGAN outperformed the state-of-the-art with
improvement in eye-gaze estimation[Shrivastava].
Also in medical imaging, realistic retinal image and Computed Tomography (CT) image generation have been tackled using adversarial learning [Costa, Chuquicusma]; a very recent study reported performance improvement with synthetic training data in CNN-based liver lesion classification, using a small number of CT images for GAN training [Frid-Adar]. However, GAN-based image generation using MRI, the most effective modality for soft-tissue acquisition, has not yet been reported due to the difficulties from low-contrast MR images, strong anatomical consistency, and intra-sequence variability; in our previous work [HAN], we generated / MR images using conventional GANs and even an expert physician failed to accurately distinguish between the real/synthetic images.
So, how can we generate highly-realistic and original-sized images, while maintaining clear tumor/non-tumor features using GANs? Our aim is to generate GAN-based synthetic contrast-enhanced T1-weighted (T1c) brain MR images—the most commonly used sequence in tumor detection thanks to its high-contrast [Militello, rundoCMPB2017]—for CNN-based tumor detection. This computer-assisted brain tumor MRI analysis task is clinically valuable for better diagnosis, prognosis, and treatment [Havaei, Kamnitas]. Generating images is extremely challenging: (i) GAN training is unstable with high-resolution inputs and severe artifacts appear due to strong consistency in brain anatomy; (ii) brain tumors vary in size, location, shape, and contrast. However, it is beneficial, because most CNN architectures adopt around input sizes (e.g., Inception-ResNet-V2 [Szegedy]: , ResNet-50 [He]: ) and we can achieve better results with original-sized image augmentation—towards this, we use Progressive Growing of GANs (PGGANs), a multi-stage generative training method [Karras]. Moreover, an expert physician evaluates the generated images’ realism and tumor/non-tumor features via the Visual Turing Test [Salimans]. Using the synthetic images, our novel PGGAN-based DA approach achieves better performance in CNN-based tumor detection, when combined with classical DA (Fig. 1).
Our main contributions are as follows:
MR Image Generation: This research explains how to exploit MRI data to generate realistic and original-sized whole brain MR images using PGGANs, while maintaining clear tumor/non-tumor features.
MR Image Augmentation: This study shows encouraging results on PGGAN-based DA, when combined with classical DA, for better tumor detection and other medical imaging tasks.
2 Generative Adversarial Networks
Originally proposed by Goodfellow et al. in 2014 [Goodfellow], GANs have shown remarkable results in image generation [Zhu] relying on a two-player minimax game: a generator network aims at generating realistic images to fool a discriminator network that aims at distinguishing between the real/synthetic images. However, the two-player objective function leads to difficult training accompanying artificiality and mode collapse [Gulrajani], especially with high resolution. Deep Convolutional GAN (DCGAN) [Radford], the most standard GAN, results in stable training on images. In this context, several multi-stage generative training methods have been proposed: Composite GAN exploits multiple generators to separately generate different parts of an image [Kwak]; the PGGANs method adopts multiple training procedures from low resolution to high to incrementally generate a realistic image [Karras].
Recently, researchers applied GANs to medical imaging, mainly for image-to-image translation, such as segmentation[Xue]Mahapatra], and cross-modality translation [Nie]. Since GANs allow for adding conditional dependency on the input information (e.g., category, image, and text), they used such conditional GANs to produce the desired corresponding images. However, GAN-based research on generating large-scale synthetic training images is limited, while the biggest challenge in this field is handling small datasets.
Differently from a very recent DA work for CT liver lesion Region of Interest (ROI) classification [Frid-Adar], to the best of our knowledge, this is the first GAN-based whole MR image augmentation approach. This work also firstly uses PGGANs to generate medical images. Along with classical transformations of real images, a completely different approach—generating novel realistic images using PGGANs—may become a clinical breakthrough.
3 Materials and Methods
3.1 BRATS 2016 Training Dataset
This paper exploits a dataset of T1c brain axial MR images containing High-Grade Glioma cases to train PGGANs with sufficient data and image resolution. These MR images are extracted from the Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) 2016 [Menze].
3.2 PGGAN-based Image Generation
3.2.1 Data Preparation
We select the slices from to among the whole slices to omit initial/final slices, since they convey a negligible amount of useful information and negatively affect the training of both PGGANs and ResNet-50. For tumor detection, our whole dataset ( patients) is divided into: (i) a training set ( patients); (ii) a validation set ( patients); (iii) a test set ( patients). Only the training set is used for the PGGAN training to be fair. Since tumor/non-tumor annotations are based on 3D volumes, these labels are often incorrect/ambiguous on 2D slices; so, we discard (i) tumor images tagged as non-tumor, (ii) non-tumor images tagged as tumor, (iii) unclear boundary images, and (iv) too small/big images; after all, our datasets consist of:
Training set ( tumor/ non-tumor images);
Validation set ( tumor/ non-tumor images);
Test set ( tumor/ non-tumor images).
is a novel training method for GANs with progressively growing generator and discriminator [Karras]: starting from low resolution, newly added layers model fine-grained details as training progresses. As Fig. 3 shows, we adopt PGGANs to generate highly-realistic and original-sized brain MR images; tumor/non-tumor images are separately trained and generated.
3.2.3 PGGAN Implementation Details
We use the PGGAN architecture with the Wasserstein loss using gradient penalty [Gulrajani]. Training lasts for epochs with a batch size of 16 and learning rate for Adam optimizer.
3.3 Tumor Detection Using ResNet-50
To fit ResNet-50’s input size, we center-crop the whole images from to pixels.
is a residual learning-based CNN with layers [He]: unlike conventional learning unreferenced functions, it reformulates the layers as learning residual functions for sustainable and easy training. We adopt ResNet-50 to detect tumors in brain MR images, i.e., the binary classification of images with/without tumors.
To confirm the effect of PGGAN-based DA, the following classification results are compared: (i) without DA, (ii) with classical DA ( for each class), (iii) with PGGAN-based DA, and (iv) with both classical DA and PGGAN-based DA; the classical DA adopts a random combination of horizontal/vertical flipping, rotation up to degrees, width/height shift up to , shearing up to , zooming up to , and constant filling of points outside the input boundaries (Fig. 4). For better DA, highly-unrealistic PGGAN-generated images are manually discarded.
3.3.3 ResNet-50 Implementation Details
3.4 Clinical Validation Using the Visual Turing Test
To quantitatively evaluate (i) how realistic the PGGAN-based synthetic images are, (ii) how obvious the synthetic images’ tumor/non-tumor features are, we supply, in a random order, to an expert physician a random selection of:
real tumor images;
real non-tumor images;
synthetic tumor images;
synthetic non-tumor images.
Then, the physician is asked to constantly classify them as both (i) real/synthetic and (ii) tumor/non-tumor, without previous training stages revealing which is real/synthetic and tumor/non-tumor; here, we only show successful cases of synthetic images, as we can discard failed cases for better data augmentation. The so-called Visual Turing Test [Salimans] is used to probe the human ability to identify attributes and relationships in images, also in evaluating the visual quality of GAN-generated images [Shrivastava]. Similarly, this applies to medical images in clinical environments [Chuquicusma, Frid-Adar], wherein physicians’ expertise is critical.
3.5 Visualization Using t-SNE
To visually analyze the distribution of both (i) real/synthetic and (ii) tumor/non-tumor images, we use t-Distributed Stochastic Neighbor Embedding (t-SNE) [Maaten] on a random selection of:
real non-tumor images;
geometrically-transformed non-tumor images;
PGGAN-generated non-tumor images;
real tumor images;
geometrically-transformed tumor images;
PGGAN-generated tumor images.
Only 300 images per each category are selected for better visualization. t-SNE is a machine learning algorithm for dimensionality reduction to represent high-dimensional data into a lower-dimensional (2D/3D) space. It non-linearly adapts to input data using perplexity, which balances between the data’s local and global aspects.
3.5.1 t-SNE Implementation Details
We use t-SNE with a perplexity of for 1,000 iterations to obtain a 2D visual representation.
This section shows how PGGANs generates synthetic brain MR images. The results include instances of synthetic images, their quantitative evaluation by an expert physician, and their influence on tumor detection.
4.1 MR Images Generated by PGGANs
Fig. 5 illustrates examples of synthetic tumor/non-tumor images by PGGANs. In our visual confirmation, for about of cases, PGGANs successfully captures the T1c-specific texture and tumor appearance while maintaining the realism of the original brain MR images; however, for about
of cases, the generated images lack clear tumor/non-tumor features or contain unrealistic features, such as hyper-intensity, gray contours, and odd artifacts.
|ResNet-50 (w/o DA)|
|ResNet-50 (w/ 200k classical DA)|
|ResNet-50 (w/ 200k PGGAN-based DA)|
|ResNet-50 (w/ 200k classical DA + 200k PGGAN-based DA)|
4.2 Tumor Detection Results
Table 1 shows the classification results for detecting brain tumors with/without DA techniques. As expected, the test accuracy improves by with the additional geometrically-transformed images for training. When only the PGGAN-based DA is applied, the test accuracy decreases drastically with almost 100% of sensitivity and
of specificity, because the classifier recognizes the synthetic images’ prevailed unrealistic features as tumors, similarly to anomaly detection.
However, surprisingly, when it is combined with the classical DA, the accuracy increases by with higher sensitivity and specificity; this could occur because the PGGAN-based DA fills the real image distribution uncovered by the original dataset, while the classical DA provides the robustness on training for most cases.
|Real/Synthetic Classification||R as R||R as S||S as R||S as S|
|Tumor/Non-tumor Classification||T as T||T as N||N as T||N as N|
|(R: , S: )||(S: )|
4.3 Visual Turing Test Results
shows the confusion matrix for the Visual Turing Test. Differently from our previous work on GAN-based/ MR image generation, the expert physician easily recognizes synthetic images [HAN], while tending also to classify real images as synthetic; this can be attributed to high resolution associated with more difficult training and detailed appearance, making artifacts stand out, which is coherent to the ResNet-50’s low tumor detection accuracy with only the PGGAN-based DA. Generally, the physician’s tumor/non-tumor classification accuracy is high and the synthetic images successfully capture tumor/non-tumor features. However, unlike non-tumor images, the expert recognizes a considerable number of tumor images as non-tumor, especially on the synthetic images; this results from the remaining real images’ ambiguous annotation, which is amplified in the synthetic images trained on them.
4.4 t-SNE Result
As presented in Fig. 6
, tumor/non-tumor images’ distribution shows a tendency that non-tumor images locate from top left to bottom right and tumor images locate from top right to center, while the distinction is unclear with partial overlaps. Classical DA covers a wide range, including zones without any real/GAN-generated images, but tumor/non-tumor images often overlap there. Meanwhile, PGGAN-generated images concentrate differently from real images, while showing more frequent overlaps than the real ones; this probably derives from those synthetic images with unsatisfactory realism and tumor/non-tumor features.
Our preliminary results show that PGGANs can generate original-sized realistic brain MR images and achieve higher performance in tumor detection, when combined with classical DA. This occurs because PGGANs’ multi-stage image generation obtains good generalization and synthesizes images with the real image distribution unfilled by the original dataset. However, considering the Visual Turing Test and t-SNE results, yet unsatisfactory realism with high resolution strongly limits DA performance, so we plan to (i) generate only realistic images, and then (ii) refine synthetic images more similar to the real image distribution.
), we can map an input random vector onto each training image[Schlegl] and generate images with suitable vectors, to control the divergence of generated images; virtual adversarial training could be also integrated to control the output distribution. Moreover, (ii) can be achieved by GAN/VAE-based image-to-image translation, such as Unsupervised Image-to-Image Translation Networks [Liu], considering SimGAN’s remarkable performance improvement after refinement [Shrivastava]. Moreover, we should further avoid real images with ambiguous/inaccurate annotation for better tumor detection.
Overall, our novel PGGAN-based DA approach sheds light on diagnostic and prognostic medical applications, not limited to tumor detection; future studies are needed to extend our encouraging results.
This work was partially supported by the Graduate Program for Social ICT Global Creative Leaders of The University of Tokyo by JSPS.