Currently, Deep Neural Networks (DNNs) accomplish the state-of-the-art results across a variety of areas. Furthermore, DNNs have shown a significant impact on medical imaging by achieving a high accurate classification of many diseases, including skin cancer esteva2017dermatologist; fornaciali2016towards; perez2018data; valle2017data
. One significant problem with adopting DNNs for skin cancer classification is that there is a lack of labeled data leading to skewed class distributions. As a result, the class frequencies in the existing medical image datasets are imbalanced. This problem hinders the generalization of trained DNN models and resulting in biased DNN models towards the dominant classes in existing datasets. In the ISIC2018 challenge, there is a total of 10015 skin lesion images and the class distribution is heavily skewed among the seven types of skin lesions. Therefore, the state-of-the-art approaches for skin lesion classification and segmentation rely on heavy data augmentationyu2016automated; matsunaga2017image.
Data Augmentation alleviates the lack of labeled data by using existing data more effectively. It applies various transformations krizhevsky2012imagenet
to the original dataset to increase both the amount and diversity of data. These transformations include flips, rotations, random translations and addition of Gaussian noise. Data Augmentation is a vital technique, not only for imbalanced data but for any size of dataset e.g., the largest datasets such as Imagenetdeng2009imagenet, since data augmentation assists DNNs to exploit invariances in the existing data which leads to the training of robust and well-generalizing models.
Although data augmentation is an effective technique for improving the accuracy of DNNs, it has two potential issues. First, you need to search for improved data augmentation policies based on your understanding of the existing dataset. For example, labels of datasets such as handwritten characters dataset should be invariant to small shifts in location, small rotations or shears, changes in intensity, stroke thickness and size, etc. All these transformations lead the generated samples to be recognizable as a valid data sample when mixing these samples with the same label in feature space. Second, even when augmentation improvements have been found for a particular dataset, they may not be transferred to other datasets as effectively. For example, rotation of images during training is an effective data augmentation technique on CIFAR-10 krizhevsky2009learning, but not on MNIST lecun1998mnist
, since the classifier will be unable to distinguish properly between handwritten “6” and “9” digits.
To bypass data augmentation issues, we propose using Generative Adversarial Networks (GANs) goodfellow2014generative to automatically learn improved invariance space, in order to generate sample that preserves the class labels. The potential of GANs is hugged and scoped in their attempts to model the real image distribution by forcing the synthesized samples to be indistinguishable from real images. Based on these generative models, First successful attempts for medical data augmentation using GANs have been made in antoniou2017data; frid2018synthetic at a level of small patches. Regarding skin cancer classification, skin images should be in a high resolution to spot malignancy markers that differ a benign from a malignant skin lesion. Very few works have shown outstanding results for high-resolution image generation. For example, progressive growing of GANs (PGAN) karras2017progressive generates celebrity faces up to 1024 1024 pixels. The underlying idea is to start feeding the network with low-resolution samples and then progressively increases the resolution of generated images by gradually adding new layers to the generator and discriminator networks leading to increased stability in training behavior and very realistic, synthetic images at resolutions up to 1024 1024 pixels.
In this paper, we propose a novel enhancement to PGAN using self-attention mechanism for generating high-definition, visually-appealing and clinically-meaningful synthetic skin lesion images. To the best of our knowledge, this work is the first that successfully incorporates the self-attention mechanism to PGAN for increasing the perceptual quality of the images by modelling the attention-driven long-range dependencies. Moreover, the Two Time Update Rule (TTUR) (imbalanced learning rate) is used to improve the network stability at high resolution pixels. The ISIC2018 challenge public dataset tschandl2018ham10000; codella2018skin is used to train PGAN, attention progressive growing of GANs (APGAN) and APGAN with TTUR (APGAN+TTUR). The generated samples of APGAN+TTUR is illustrated in Figure 1. To finely separate the best performance GAN, we use GAN-train and GAN-test of shmelkov2018good and afterwards the best performance GAN was used to augment the training data of ISIC2018 dataset. Experiments show that our method can improve the classification accuracy by 2.8% on average.
Briefly the key contributions of this paper are:
Novel enhancement using self-attention based progressive Generative adversarial network.
Apply a stabilized training procedure for increased stability of training behavior.
Generate high-definition, visually-appealing and clinically-meaningful skin lesion images.
Improve classification accuracy over the corresponding real-only and standard augmentation counterparts.
The remainder of this paper is organized as follows: In Section 2 briefly recapitulates the APGAN framework as well as the stabilization technique. Section 3 outlines our methodology that builds upon previously published literature and discuss the results of our experiments in detail. In addition, it discusses artifacts of the generated samples and the utilizing of APGAN+TTUR as an augmenter to boost the classification accuracy. Finally, in the Conclusion, we conclude our work and give an outlook on future work.
2 Proposed Approach
In order to tackle the imbalanced class problem, we have to automatically find class-preserving transformations for generating a valid and representative samples to boost further the classification accuracy. However, for skin cancer classification, the samples must have a higher level of detail (high resolution) in order to spot the presence of malignancy markers and their fine-grained details that differ a benign from a malignant skin lesion. To automatically find class-preserving transformations and generate skin lesion images of high resolution pixels, we propose APGAN+TTUR framework as shown in Figure 2. The APGAN+TTUR framework is based on following aspects (i) Progressive growing of GANs (ii) Self-Attention and (iii) Two Time Update Rule. The overall process is illustrated in Figure 3.
2.1 Progressive Training of GANs
The research towards using GANs has recently led to a breakthrough for synthesizing ever-increasing resolution of images in the work of karras2017progressive. The underlying idea is to facilitate high-resolution image synthesis from noise at unprecedented levels of quality and realism. The output-resolution of the generator and the input-resolution to the discriminator are simultaneously ramped up by gradually adding new layers to the generator and discriminator networks leading to a very stable training behavior and very realistic, synthetic images at resolutions up to
pixels. Progressive training reduces training time, since most of the iterations are done at lower resolutions where the network sizes are small. The original work includes several further important contributions. A dynamic weight initialization method is proposed to equalize the learning rate between parameters at different depths, batch normalization is substituted with a variant of local response normalization in order to constrain signal magnitudes in the generator, and a new evaluation metric is proposed (Sliced Wasserstein distance). Our APGAN and APGAN+TTUR frameworks utilize the PGAN architecture, since it has shown outstanding results at generating images of high-resolution with a minimum number of parameters.
2.2 Self-Attention Progressive GAN
Self-attention mechanism is a widely used mechanism in various tasks, such as machine translation gehring2016convolutional; gehring2017convolutional; vaswani2017attention, graph embedding velivckovic2017graph, generative modeling zhang2018self, and visual recognition wang2017residual; hu2018relation; wang2018non; yuan2018ocnet
. The basic building block of all the state-of-the-art architectures in computer vision consists of the convolution operation which is stacked in multiple layers to learn a hierarchy of features. These representations are learned over a series of convolution operations, however, due to the physical design of convolutional filters, the information flow in convolutional neural networks is restricted inside local neighborhood regions, which limit the overall understanding of complex scenes. This problem can be seen inradford2015unsupervised where convolution layers are mainly used for image generation purpose. All of these experiments have one thing in common the lack of convolution operations to capture geometrical shapes. For example, four-legged animals demands long range dependencies in the generator because of its complex contour. Recently, Goodfellow et al.zhang2018self incorporated a self-attention mechanism which acts complimentary to convolution operation. Furthermore, Brock et al.brock2018large used self-attention mechanism for high-fidelity natural image synthesis improving the state-of-the-art Inception score (IS) and Frechet Inception distance (FID) from 52.52 to 166.5 and 18.65 to 7.4.
The non-local block can be deemed as a global context modeling block, which aggregates query-specific global context features (weighted averaged from all positions via a query-specific attention map) to each query position. As attention maps are computed for each query position, the time and space complexity of the non-local block are both quadratic to the number of positions . Mathematically, the non-local block can be expressed as
where is the index of query positions, and enumerates all possible positions. denotes the relationship between position and , and has a normalization factor . and
denote linear transform matrices (e.g., 1x1 convolution). For simplification, we denoteas normalized pairwise relationship between position and . The observation of cao2019gcnet that the attention maps for different query positions are almost the same, they simplify the non-local block by computing a global (query-independent) attention map and sharing this global attention map for all query positions. They omit in the simplified version. Hence the simplified non-local block (SNL) is defined as
where and denote linear transformation matrices. They also reduce the computational cost of Equation 2 by applying the distributive law to move outside of the attention pooling, as
Equation 3 is illustrated in Figure 4.
Our approach leverages the block of Equation 3 to introduce self-attention to the PGAN architecture. Incorporating self-attention mechanism teaches the PGAN to focus on target structures of varying shapes and sizes, in other words, the discriminator implicitly learns to suppress irrelevant regions in an input image while highlighting salient features useful for a specific task which leads the generator to generate images with fine-grained and high-quality images. An overview of the setup is given in Figure 2. The self-attention is incorporated before the Downsample layer of discriminator and after the Upsample layer of generator.
2.3 Two Time Update Rule
Despite using progressive training, we still had to overcome notable stability issues, due to the high resolution. Two Time Update Rule (TTUR) is used which is a stabilization technique for GAN training which improves both quantitative and qualitative results as proved in zhang2018self, thus, we set the learning rate for the discriminator 5x compared to the generator while keeping the discriminator to generator step ratio as 1:1.
3 Experiments and Results
In the first part of our experiments, we examine the proposed self-attention mechanism in Section 3.4. Next, the effectiveness of the TTUR for stabilizing GAN training is evaluated in Section 3.5. In the second part of our experiments, we examine the utility of GANs for data augmentation, i.e., for generating additional training samples, with the best-performing GAN model to boost classification accuracy.
We evaluate our method on the ISIC2018 dataset that consists of 10015 skin lesion images from seven skin diseases- Melanoma (1113), Melanocytic nevus (6705), Basal cell carcinoma (514), Actinic keratosis (327), Benign keratosis (1099), Dermatofibroma (115) and Vascular (142). The megapixel dermoscopic images are center cropped to pixels and then downsampled to pixels. Figure 1 part (a) shows some of these training samples. We split the data into train (9514 images) and validation (501 images).
For training PGAN, APGAN and APGAN+TTUR, We augment the training set to boost the GAN performance. We utilize rotation (in the range of [, ]), horizontal and vertical flipping, scale and skew. A python package named Augmentor bloice2017augmentor was used for the augmentation.
For training the classification network, we use the training set (9514 images) and the generated samples of GAN. For validation, the validation set is used (501 images). The detailed training process is investigated in Section 3.7.
3.2 Evaluation Metrics
A variety of methods have been proposed for evaluating the performance of GANs in capturing data distributions and for judging the quality of synthesized images. In order to evaluate visual fidelity, numerous works utilized either crowdsourcing or expert ratings to distinguish between real and synthetic samples. There have also been efforts to develop quantitative measures to rate realism and diversity of synthetic images. The two Inception-based, the Inception score (IS) salimans2016improved and the Fréchet Inception distance (FID) heusel2017gans, are useful measures to evaluate how training advances, but they guarantee no correlation with performance on real-world tasks they are also insufficient to finely separate state-of-the-art GAN. In addition, we noticed that they do not provide meaningful scores for skin lesions as the GoogleNet focuses on the properties of real objects and natural images. Shmelkov et al. shmelkov2018good proposed two measures based on image classification, GAN-train and GAN-test, to compare class-conditional GANs. GAN-train and GAN-test measure the quantitative (diversity) and qualitative (quality of the image) of GANs respectively. GAN-train is the accuracy of a network trained on GAN generated images and is evaluated on real-world images, whereas GAN-test is the accuracy of a network trained on real images and evaluated on the generated images.
Intuitively, when GAN-train accuracy is close to validation accuracy, it means that GAN images are as diverse as the training set. On the other hand, when GAN-test accuracy is close to validation accuracy, it means that GAN does capture the target distribution well and the image quality is good. For illustration, see Figure 5.
3.3 Experimental Setup
All of our experiments are performed on NVIDIA 1080Ti with 24GB main memory. For GAN experiments, the conditional version of PGAN (official TensorFlow implementation) is modified using state-of-the-art tweaks like (i) Self-Attention and (ii) 5:1 learning rate ratio between discriminator and generator (TTUR). To determine the best attention placement, several experiments are conducted at resolutionpixels as discussed in Section 2.2. To generate skin lesions of high resolution pixels, TTUR is utilized alongside self-attention mechanism. APGAN is trained for hours, whereas APGAN+TTUR is trained for hours.
The classifier for evaluating GAN-train and GAN-test is ResNet-18 he2016deep
using PyTorch framework. It initialized with weights pretrained on Imagenet. The model is trained for 50 epochs using a momentum optimizer with learning rate 0.001 (using a batch size of 64). However, Resnet-18 takesimages as input. To utilize ResNet-18 with images of size , AvgPool2d is replaced by AdaptiveAvgPool2d. The images are loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225].
3.4 Self-Attention mechanism
To inspect the effect of enhancing PGAN with self-attention mechanism, we build several APGAN models, at pixels, by incorporating the self-attention mechanism to different stages of the generator and discriminator. As shown in Table 1, the APGAN models with the self-attention mechanism at the -to- level feature maps (e.g., stage 64 and stage 128) achieve better performance than the models with the self-attention mechanism at the low level feature maps (e.g., stage 32 and stage 64). For example, the GAN-train of the model PGAN is improved from 67.7 to 70.1 by “APGAN, feat 64”. Moreover, GAN-test of the model PGAN is improved from 60.8 to 62.2 by “APGAN, feat 64”. The reason could be that the attention coefficients result from large feature maps learn to highlight salient image regions that are passed through -to- level feature maps and prune low-level feature responses to maintain only the activations relevant to the specific task. The attention mechanism gives more power to both generator and discriminator leading the PGAN to generate better quantitative and qualitative synthetic samples In addition, the comparison of our APGAN and the PGAN (3rd column of Table 1) demonstrate the effectiveness of enhancing the PGAN with self-attention mechanism.
|Stage 32||Stage 64||Stage 128|
3.5 High Resolution Skin Lesions
For better classifying the presence or absence of malignancy, the skin lesion is synthesized at pixels. In order to generate of skin lesions, the PGAN, APGAN and APGAN+TTUR have been presented with 8 million images, which is equivalent to over 3.2M iterations. We use a minibatch size 256 for resolutions and then gradually decrease the size according to , , , , and . Despite using progressive training and the incorporating of self-attention mechanism, the generated samples suffer from some artifacts, due to the high resolution as discussed in Section 3.6. To mitigate unstable training behavior, the imbalanced learning rates (TTUR) is utilized. We set the learning rate for the discriminator is 0.004 and the learning rate for the generator is 0.001. To finely separate the best performance GAN, we use GAN-train and GAN-test of shmelkov2018good. Training was resumed for an additional 3 network checkpoints. Per model, we generate 1000 synthetic images for each class and we measure GAN-train and GAN-test for each model. Results from each of the best models are reported. TTUR greatly stabilize APGAN training and improves both quantitative and qualitative results as shown in Table 2. Clearly, the best quantitative and qualitative results are obtained with the APGAN+TTUR samples.
Due to unstable training behavior, several types of artifacts are observed in the generated samples. For example, blur, high frequency artifacts and even mode collapse. As shown in Figure 6. The unstable training behavior was mitigated by applying TTUR. However, some of APGAN+TTUR samples suffer from bright spot in melanoma class. This problem can be attributed to problem in the training set, since real samples of melanoma class have the same bright spot. Examples of such images are shown in Figure 7.
3.7 GAN data augmentation
In this section the training set (9514) is augmented using the best performed GAN (APGAN+TTUR) and standard augmentation methods. We randomly pick 100 images from each class of training set. The 100 images of each class are increased with 1k images using APGAN+TTUR and standard augmentation methods. For classification, ResNet-18 he2016deep is utilized. It is configured as discussed in Section 3.3. The results are shown in Figure 8. We observe that the model trained on real images (100 images per class) achieves 67.3% on validation set (501), while augmenting real images using APGAN+TTUR and standard augmentation methods achieved 70.1% and 68.7% respectively. Consequently, augmentation using APGAN+TTUR improves the accuracy over the corresponding real-only and standard augmentation counterparts by 2.8% and 1.4% respectively. Based on These results, two points can be concluded (i) the generated samples have clinically-meaningful features, since there is an information gain in the synthetic samples which improves the classification accuracy and (ii) standard augmentation methods use a limited set of known invariances, whereas APGAN+TTUR automatically learns a much broader invariance space. To prove further the superiority of APGAN+TTUR, real images were augmented with 1K samples using PGAN and APGAN. The results are 67.1% and 67.7% respectively.
In this paper we have proposed a novel enhancement using self-attention based progressive GAN framework for generating high-definition, visually-appealing and clinically-meaningful synthetic skin lesion images. The proposed model leverages state-of-the-art tweaks like (i) progressive growing of GANs (ii) Self-Attention and (iii) imbalanced learning rate (TTUR). Self-attention guides the discriminator to pay more attention to the presence of malignancy which results in making the generator to generate samples that contain fine-grained details to fool the discriminator. Despite the using of self-attention mechanism, the generated samples suffer from some artifacts due to unstable training behavior. The imbalanced learning rate (TTUR) is used to tackle this issue. Finally, APGAN+TTUR was utilized to generate additional training samples to boost further the classification accuracy. Noteworthy, there is an information gain in the synthetic samples, and consequently the classification accuracy is improved. Moreover, data augmentation using APGAN+TTUR has higher information gain than using standard data augmentation methods. In future work, we aim to utilize APGAN+TTUR for performing large scale experiments on multiple datasets.