Adversarial Augmentation for Enhancing Classification of Mammography Images

02/20/2019 ∙ by Lukas Jendele, et al. ∙ ETH Zurich University Hospital Zurich 6

Supervised deep learning relies on the assumption that enough training data is available, which presents a problem for its application to several fields, like medical imaging. On the example of a binary image classification task (breast cancer recognition), we show that pretraining a generative model for meaningful image augmentation helps enhance the performance of the resulting classifier. By augmenting the data, performance on downstream classification tasks could be improved even with a relatively small training set. We show that this "adversarial augmentation" yields promising results compared to classical image augmentation on the example of breast cancer classification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning in computer vision has achieved great results in the past few years [Deng et al. (2009), Karras et al. (2017)]. Most of these have been enabled by more computational power and large amounts of data. Unfortunately, in many scientific fields such as medical imaging, there are usually several orders of magnitude fewer data samples to work with than in large-scale computer vision datasets. Leaving aside issues like anonymization and privacy, this poses several specific problems for anyone wishing to use medical imaging datasets:

  1. Scarcity — Data is hard to obtain, usually only a few samples are available per dataset.

  2. Bias — Medical and other small datasets usually contain many more negative (healthy) images than positive ones (with a valid and confirmed illness). The reason for that is that the data usually comes from a real-world diagnostic process, where data is obtained even at a low suspicion threshold, since the potential harms of the imaging procedure are far outweighed by the benefit of a prompt diagnosis. Furthermore, in screening settings a large population of completely symptom-free subjects is deliberately examined.

    Often present are also “confirmation” images – for a patient with a positive finding, many more images will be made to confirm the diagnosis and monitor the progress. This only increases the bias, as the dataset then has several positive images of the same patient. In other fields, variations on these processes also exist, all resulting in a similar bias.

  3. Noise — Introduced by capturing devices, errors made during data processing or storage, or from a naturally noisy population (e.g. synthetic implants, marker wires, or prior surgery related to the illness).

All three of these issues pose a significant challenge for training classification models. In this work, we aim to partially alleviate the first two problems in the context of binary image classification. Our contributions are the following:

  1. We train a generative model that has the ability to transform data from one class to the other and back with a CycleGAN architecture [Zhu et al. (2017)].

  2. We show that the classifier is partially fooled into thinking that the transformed images are of the respective real class-label distributions.

  3. We show that the performance of the classifier may improve when its training data is augmented with the transformed images, in comparison to classical image augmentation.

2 Related work

Generative Adversarial Networks (GANs), proposed by Goodfellow et al. (2014), have shown great potential for generating or modifying images. Many studies focused on image augmentation using GANs Shrivastava et al. (2017); Mueller et al. (2018); Bousmalis et al. (2018). The application to the medical domain is logical, because it is generally difficult to obtain data there, and all datasets are naturally heavily imbalanced. Shin et al. (2018)

focuses on brain MRI augmentation using paired image-to-image translation similar to the pix2pix approach

Isola et al. (2017).

However, paired images (e.g. the same breast in the same view with and without cancer) are very hard to obtain. Thus, we focus on unpaired image augmentation. In their work on CycleGAN, Zhu et al. (2017) used a pair of GANs coupled with a cycle-consistent loss for unpaired image-to-image translation, and succeed in converting images between two domains (e.g. horses to zebras). In our work, we adopt this idea to generate cancerous features into or remove them from mammography images.

Parallel to our research, Sun et al. (2018) have applied the CycleGAN architecture to augment brain and liver MRI scans. Aligned with our work, they show that such augmentation boosts the classifier’s performance.

For the actual cancerous lesion detection, there have been several studies utilizing deep neural networks on image patches, like

Lévy and Jain (2016). The reason for looking at smaller patches is mostly because of dimensionality reduction. There have also been attempts at detection by training on whole images Ribli et al. (2018); Shen (2017); Hussain et al. (2017). They all augment the dataset by translating, rotating, or flipping the images to improve the system’s performance, which we also compare to in our experiments.

3 Model

Our approach consists of two models, trained separately. In the first step, we train a specific GAN architecture to learn a transformation from the domain of images of one class label to the domain of images of the other class labels. In the second step, we use the generative model to augment a Faster R-CNN classifier Ren et al. (2015) to improve its performance.

3.1 Generative augmentation model

[Healthy cancer]  [Cancer healthy]

Figure 1: Given an image dataset with two classes (cancerous and healthy breast scans), the generative model learns to transform images from one class to the other.

The generative model is based on CycleGAN Zhu et al. (2017). Its goal is to perform unpaired translation of images from one domain to another, and back. It achieves this by training two generator–discriminator pairs and introducing a cycle-consistency loss. In our case, we apply it to generate and remove cancerous features from mammography images. Figure 1 shows the output of the generative model on two training samples.

More formally, CycleGAN transforms images from a domain to another domain . For that, it uses two independent mappings, and . To train these mappings directly, one would need paired images, which are very hard to obtain (for example, the same patient’s image with and without breast cancer, in the exact same orientation). Instead, CycleGAN uses a GAN-like loss of introducing a discriminator, that attempts to differentiate generated images from the empirical domain from the real images from by learning a mapping (analogously for ).

Furthermore, it adds a cycle-consistency loss , which enforces the "identity" property . All of this is analogously done for domain as well. Figure 2 shows a simple diagram of the model.

The loss is composed of the following partial loss terms. The first is the classic adversarial GAN loss, where is the discriminator of the GAN on domain , and is the generator of samples in the domain given a sample from .

For training stability reasons, our implementation uses the alternative LSGAN Mao et al. (2017)loss function, with parameters and .

The second loss term corresponds to cycle-consistency losses for both directions.

The loss of the final model sums all the partial loss terms with constant weights (regarded as hyperparameters).

The objective of training is summarized by the following optimization problem.

3.1.1 Conditioning on regions of interest

To enhance the usefulness of our model, we add another input modality into to our generative model that represents regions of interest in the picture. For example, for breast cancer imaging, this modality could contain a boolean mask indicating segmented regions with “suspicious” (potentially cancerous) tissue. This also allows for encoding of various invariants into the dataset. By varying the additional mask position spatially, we obtain several variants of the transformed image, which together encode spatial equivariance of cancerous tissue, which might not be represented in the original dataset due to a low number of samples. The datasets we use all contain masks (of varying quality) with highlighted lesions or benign masses of the same dimension as the image.

To model the additional data source, we append another channel to our input image and let the model train using both the original image and the mask as both input and output. The generator now obtains a “two channel” image, and produces two channels instead of one. The final loss function is a sum of our loss function applied to each channel individually. The rest of the model remains the same. The changes in the formulation of both generators and discriminators are the following (shown for , is the domain of masks):

Figure 2: CycleGAN model diagram with the cycle-consistency loss , only shown .

3.1.2 Removing checkerboard artifacts

Empirically, models with deconvolutional layers tend to exhibit “checkerboard” artifacts, especially when trained for longer amounts of time Odena et al. (2016). Therefore in our experiments, we 1) substitute a deconvolution with nearest-neighbor upsampling followed by a convolution, and 2) we initialize the kernel weights using ICNR Aitken et al. (2017). Generally, deconvolution preserves more details and produces less blurry results compared to upsampling and followed by convolution. We also evaluated bilinear upsampling, but it empirically produced more artifacts than nearest-neighbor upsampling.

3.2 Neural classification model

The classifier model used for all experiments was an adaptation of Faster R-CNN Ren et al. (2015) that ranked second in the DREAM breast cancer detection challenge proposed by Ribli et al. (2018)

. Faster R-CNN is a convolution-based network capable of classifying and localizing objects in an image. Pure classification networks (predicting a binary answer) are easier to train, and thus more commonly used for mammography images. However, we believe that localizing malignant tumors is important if the system was to be implemented in clinical routine, since it helps in verifying the decision. The network is based on ResNet-50, a 50 layered network with residual connections pretrained on ImageNet

Deng et al. (2009). Similarly to Ribli et al. (2018), we also changed the following parameters: we enabled the proposal network, and changed the proposal non-maximal suppression threshold to 0.5.

4 Experiments

To validate our ideas and claims, we propose several simple experiments in the domain of breast cancer recognition from 2D mammography images.

4.1 Model implementation

The generative augmentation models for all our experiments are based on the CycleGAN Zhu et al. (2017)

architecture, and are implemented in TensorFlow

111Based on the TensorFlow research CycleGAN implementation: https://github.com/tensorflow/models Abadi et al. (2016). More details about the architectures and training procedures are provided in Appendix A.

4.2 Datasets

There are several datasets that relate to breast cancer diagnosis. In most of these one can observe the limitations that we outlined in the introduction. For our experiments, we used the following datasets: (1) BCDR Guevara Lopez et al. (2012), Breast Cancer Digital Repository, several datasets from Portugal; (2) INbreast Moreira et al. (2012), INbreast digital breast database, also from Portugal. Samples with a BiRads classification greater than 3 were considered as positive (cancerous), lower than 3 were considered negative (healthy); and (3) CBIS–DDSM Sawyer Lee et al. (2016), Curated Breast Imaging Subset of DDSM (Digital Database for Screening Mammography) from the USA.

Dataset Cancerous Healthy
BCDR-1 55 199
BCDR-2 44 651
INbreast 100 270
CBIS 672 960
Dataset Cancerous Healthy
Training 655 1538
Evaluation 116 272
Testing 100 270
Table 1: Number of samples in the various datasets.

For the generative model, we use BCDR-1 and BCDR-2 (merged together) for training. For the classifier, we use both BCDR datasets along with CBIS with an 85% training and 15% evaluation split. Due to a high noise ratio in CBIS, we only used it for the classifier. We use the held-out INbreast dataset as a test dataset for both models. All images were downscaled to pixels due to hardware limitations. We also experimented with pixels, but the image quality was poorer. Table 1 shows the number of samples in the respective datasets.

4.3 Training a classifier

Our Faster R-CNN Ribli et al. (2018) based classifier was trained to localize malignant and benign lesions. We convert the pixel masks into a set of bounding box by applying Otsu threshold segmentation and taking the bounding box around every disconnected region. Images with no lesions or lesions with a bounding box area smaller than pixels were discarded, as R-CNN doesn’t need to train on “negative” images. For each image, the model predicts a set of bounding boxes, corresponding scores, and classes. For evaluation, we treat an image as positive (cancerous) if any of the bounding boxes score with a malignant class is higher than a chosen, constant confidence threshold.

We train the classifier on different datasets for a maximum of 100,000 steps (batch size 8) and pick the best model based on ROC AUC Bradley (1997) on the evaluation set. Based on inspection of the evaluation set loss, we empirically chose the models trained for steps (for all model variants).

4.4 “Fooling” a trained classifier

As a first step, we want to see if our classifier, trained only on original images, is “fooled” by the generated images. In other words, for correctly classified images, how many times does the label change after we run the images through the generative augmentation model? We evaluate this question on all of our test data (see Section 5).

Figure 3: Evaluation diagram whether the generator fooled the trained classifier into thinking that the generated image is from domain .

4.5 Improving the classifier

Secondly, we evaluate if a classifier trained in the same way on a mixed dataset of original and augmented images using the generative model performs better in terms of both classification metrics and “being fooled” We also compare the model to standard augmentation techniques such as image translation, rotation, and horizontal flipping. We use the same training/evaluation/testing split, but balance the training dataset by converting all the healthy images to cancerous ones, and adding them to our dataset. We then balance the dataset in a similar way as in Section 4.3.

5 Results

To visualize the results of our generative augmentation models, we show a random uniform selection of images augmented by our generative model from the INbreast test dataset in Figure 4 and 5 (Appendix B).

Classifier training data Correctly clf. % Fooled % ROC AUC % F1 score %
Original
Classically augmented
GAN-augmented
Table 2:

Fooling and improving the classifier evaluated on the test dataset INbreast (different patient population than the training set). GAN-augmented images are from the unconditioned GAN model because of better image quality. Each run was repeated three times — shown are the average and the standard deviation for each value.

The first and second columns of Table 2 show that the classifier learns to be less fooled by our generative augmentation model if we augment the training set images using the same model, which confirms the intuition that this makes the classifier slightly more robust.

As shown in the first row of Table 2, the classifier performs reasonably well when trained on the original dataset and evaluated on a test split from that dataset (both in terms of ROC AUC and F1 score). The F1 score is computed using a custom bounding box proposal confidence threshold of , same as in Ribli et al. (2018).

When the training set images are augmented by our GAN (third row), the average ROC AUC goes down slightly, but the error margin is too big to produce a conclusive result. As was previously shown by Becker et al. (2018b, a) and our subjective assessment, this suggests that the new GAN-generated data might be challenging to classify for our classifier. The same conclusion applies for the experiment where we augment the training set images using traditional image augmentation techniques.

6 Discussion

Overall, our GAN training has been very prone to checkerboard and “S”-shaped artifacts, as can be partially seen in Figures 4 and 5 (Appendix B). We also experimented with both higher ( px) and lower resolutions ( px) of images: the lower resolutions generally had fewer artifacts and faster training times, but a higher resolution is desirable when thinking about moving to full-field mammographic images in the future. Unfortunately, due to GPU memory limitations resolution could not be further increased. Our GAN models and RCNN-based classifiers train in less than 24 hours on an NVIDIA TITAN Xp GPU.

The classifier results are inconclusive, and it is not clear that adding our augmented images helps the classifier achieve better performance or not. We hypothesize that this might be due to the noise in our data, as the results of Sun et al. (2018) suggest that the overall method is sound and can improve classifier performance if applied well.

7 Future work

Possible future improvements to our work include investigating upscaling the resolution without obtaining artifacts with approaches similar to Wang et al. (2018), stabilizing the conditioned model training and results, and also leveraging that model fully to augment the images in pre-specified places. For a more detailed image, we could explore approaches similar to Self-Attention GAN Zhang et al. (2018), which promises to pay close attention to parts of the input image for output generation. This would also help in interpreting the resulting changes done by the GAN. Unfortunately, this approach is very memory-expensive.

Traditionally, Variational Autoencoders

Kingma and Welling (2013) (VAEs) lack detail in the output images and GANs lack “truthfulness” — they may overgenerate parts of the image Sajjadi et al. (2018). As a more hybrid approach, we could combine a VAE with a GAN to model both the location and the image details jointly with one model, similarly to the approaches in Liu et al. (2017); Huang et al. (2018); Andermatt et al. (2018). To simplify the model, one could also try using a StarGAN-like Choi et al. (2018) approach by only using one generator/discriminator pair which is conditioned by the class label, instead of using two generators and discriminators.

8 Conclusion

In our work, we have shown that for binary classification on images, there exists a simple way to potentially increase prediction accuracy by generative dataset augmentation. Leveraging the idea behind CycleGAN, we have designed a GAN that is able to translate images from one class label to the other, and use that property to augment the training dataset of a classifier into a bigger, more balanced, and less sparse dataset. We have provided a proof of concept implementation and shown that on the challenging noisy example case of breast cancer recognition from mammography images, we may be able to help improve performance of classifiers. This suggests our generative augmentation model learns a meaningful approximation of the manifolds of our class labels.

We would like to thank the Computer Vision Lab at ETH Zürich for providing us with computational resources.

References

Appendix A Model implementation

We train all our GAN models for steps, using a learning rate of for the discriminators and for the generators. The optimization is performed using Adam Kingma and Ba (2015) and a batch size of 1. All code is available on GitHub222https://github.com/BreastGAN/augmentation.

The architectures of both discriminators are the same: 4 convolutional layers with reflection padding, with filters of size 64, 128, 256, 512 and stride 2 for all layers except for the last one that has stride 1, with a LeakyReLU activation function

Hahnloser et al. (2000); Nair and Hinton (2010); Maas et al. (2013): . All the convolutions have a kernel size of . The output is subsequently flattened to one channel using a stride 1 convolution, with a sigmoid activation function.

Both generator networks consist of two convolutions with stride 2 to compress the dimensionality of the image followed by 9 ResNet blocks (2 convolutions layers each). Lastly, the result is upsampled using two additional convolutional layers as described in Section 3.1.2

. All the generator layers use ReLU activation functions.

Appendix B Random samples from our GAN augmentation models

[Healthy (top) to cancerous (bottom).]

[Cancerous (top) to healthy (bottom).]

Figure 4: Random samples of images from our trained GAN (without masks, px).

[Healthy (top) to cancerous (bottom), mask (middle).]

[Cancerous (top) to healthy (bottom), mask (middle).]

Figure 5: Random samples of images from our trained GAN (with masks, px).