Anomaly Detection for Skin Disease Images Using Variational Autoencoder

07/03/2018 ∙ by Yuchen Lu, et al. ∙ 0

In this paper, we demonstrate the potential of applying Variational Autoencoder (VAE) [10] for anomaly detection in skin disease images. VAE is a class of deep generative models which is trained by maximizing the evidence lower bound of data distribution [10]. When trained on only normal data, the resulting model is able to perform efficient inference and to determine if a test image is normal or not. We perform experiments on ISIC2018 Challenge Disease Classification dataset (Task 3) and compare different methods to use VAE to detect anomaly. The model is able to detect all diseases with 0.779 AUCROC. If we focus on specific diseases, the model is able to detect melanoma with 0.864 AUCROC and detect actinic keratosis with 0.872 AUCROC, even if it only sees the images of nevus. To the best of our knowledge, this is the first applied work of deep generative models for anomaly detection in dermatology.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic skin disease detection would be valuable for both patients and doctors, and there has been success of applying deep supervised learning and CNN to the field of dermatology


. These models have large number of parameters and require large-scale labeled dataset for different kind of diseases. Nevertheless, human beings seem to be able to detect an abnormal skin lesion even if they are not trained, provided that they have enough experience with what a healthy mole looks like. Making our machine to have this behavior is interesting by itself, and it also provides practical advantages. By only observing normal skin image data, the algorithm is able to generalize to multiple diseases or even rare diseases, which saves time and money for data collection. Motivated by these aspects, we decide to focus on the problem of unsupervised anomaly detection for skin disease. Doing unsupervised learning over the space of images are challenging because of the curse of dimension, but recent development in deep generative models could address this issue.

There are two related models called Generative Adversarial Network (GAN) and Variational Autoencoder (VAE). Both VAE and GAN have been applied to anomaly detection [10]. [1]

proposes using a direct “reconstruction probability

for detection and shows VAE outperforms a PCA baseline on MNIST dataset. [3] applies adversarial autoencoder to the unsupervised detection of lesions in brain MRI and improves the detection AUC for BRATS challenge dataset.

Our major contribution is not proposing any fundamentally new methods, but to emphasize the potential usefulness of deep generative models in dermatology. We investigate VAE based methods instead of GAN for the following reason: 1) Even with recent tricks like gradient penalty, GAN training is still unstable and highly dynamic. As a contrast VAE training is more stable and therefore is more suitable as a proof of concept. 2) Most of GAN-based methods require training an additional network which maps from image space to the noise space in order to get the reconstruction [10], but it is unclear what the theoretical justification is of this additional network. On the contrary, VAE has a well defined mathematical framework and therefore is more interpretable.

2 Methods

We firstly give a brief introduction on the background of VAE and generative models. Then we propose different methods to use a trained VAE for anomaly detection.

2.1 Variational Autoencoder

VAE can be viewed as a directed probabilistic graphical model with the joint distribution defined as

, where is the data, is the latent variable and is the prior. We choose the prior to be in this work. When the true posterior is intractable, one can use a parametric distribution to approximate the posterior. Then in order to perform MLE, it is sufficient to maximize the evidence lower bound:


We choose

to be a Gaussian distribution with diagonal covariance, where

are the output of a neural network. Then by the reparameterization trick, the evidence lower bound becomes


where . Eqn. (2) is differentiable w.r.t. both and , and it can be trained from end to end. In this paper we choose where is pre-determined. Then maximizing Eqn. (2) is equivalent to minimizing

One can observe that the function of

here is just adjusting the relative weight between reconstruction term and KL term, as a result, the final loss function to be minimized looks like


The resulting training objective can be viewed as a specific case of VAE, but our derivation is not from an optimization perspective like in [8].

2.2 Anomaly Score

The degree of anomaly can be characterized by the possibility of seeing appear under distribution

. Therefore computing the anomaly score is essentially estimating

. Once we have a trained VAE, there are several ways to use it to generate an anomaly score for the new image .

2.2.1 VAE Based Score

One choice is to use the negative of Eqn. (1) as an anomaly score. That is


where . If is larger, then

has higher loss and thus is more likely to be an outlier. Since we can decompose the loss into reconstruction term and KL term, we can just define the corresponding anomaly scores:


The motivation of decomposition is to investigate how each term in VAE loss is useful for anomaly detection.

2.2.2 Importance Weighted Autoencoder (IWAE) Based Score

Importance Weighted Autoencoder [2] proposes a tighter lower bound on , which is


When , we recover the ELBO used by VAE. When becomes larger, it’s proved in [2] that the Eqn. (7) would become a tighter bound than Eqn. (1), resulting in a more accurate inference. Similarily we can use the negative of Eqn. (7) to compute the anomaly score as


where . The corresponding KL score and reconstruction score are


Although it is unclear whether a tighter lower bound estimate would help with outlier detection, we introduce these scores for the sake of comparison.

3 Experiment

3.1 Model Architecture

We use the architecture similar to DCGAN[11]

. For the encoder, we avoid using linear layer to produce mean and log variance, but use two separate convolution layers. This architecture is fully convolutional and the number of convolution blocks are dependent on the input image size. In our implementation, the image size is 128, which makes the encoder consisted of 5 convolutional blocks and decoder consisted of 5 deconvolutional blocks respectively. ADAM is used as the optimizer with default setting. Hyperparameters are set as below.

  • (weight for KL term): 0.01

  • learning rate: 1e-4

  • (number of samples for calculating scores): 15

  • batch size: 32

  • training epochs: 40

  • latent dimension:300

3.2 Dataset and Proprocessing

We use ISIC2018 Challenge dataset (task 3)[4, 13] which contains images from 7 diseases. A detailed dataset information can be found in Table 1. For training the VAE, we use 6369 images as training set and 336 as validation set. For anomaly detection, we select 250 images from the validation set and 100 images from the rest of diseases. We normalize our data to have range from to and resize each image to have size .

# Images 1113 6705 514 327 1099 115 142
Table 1: ISIC2018 Challenge Task 3 Dataset
0.872 0.803 0.792 0.682 0.862 0.662 0.779
0.871 0.802 0.793 0.678 0.864 0.657 0.777
0.441 0.454 0.472 0.398 0.690 0.487 0.491
0.406 0.431 0.441 0.383 0.677 0.477 0.469
0.864 0.795 0.783 0.671 0.861 0.651 0.771
0.864 0.795 0.784 0.670 0.861 0.648 0.771
Table 2: The AUC ROC Results of Disease Detection. For each column , we show the AUC results of different anomaly scores when is the abnormal class. The last column is test against all diseases. The results is the average of 5 runs.

The AUC result is summarized in Table 2. Our best AUC result is obtained by reconstruction scores with an overall AUC score of 0.77.In addition, the disease detection AUC for AKIEC and MEL reaches 0.87 and 0.86 respectively, even if the model has never seen a single image from these two diseases before. We notice that KL score is not very discriminative between normal and abnormal data. This is caused by using a small for the KL term, and model basically ignores the KL loss during training. We also try using a larger but it results in poorer AUC results. We also try using even smaller , but it causes some numerical instability and the improvement is not significant. These results imply that the current prior is not expressive enough such that enforcing the approximated posterior to be close to prior hurts the model’s expressiveness, which leads to worse AUC performance. We can also find that using IWAE variants scores does not make much difference from the VAE variants scores, which suggests that even if the bound is theoretically tighter[2], its practical implication for anomaly detection might not be huge. A sample of reconstruction images is shown in Figure 1.

We try to compare our method with a traditional baseline like PCA or Kernel-PCA for anomaly detection, but our image size (3x128x128) is way too large for these methods to be implemented without resorting to feature engineering. This also demonstrate the advantage of using VAE to cope with the curse of dimensionality in anomaly detection.

Figure 1: A non cherry-picked reconstruction result on validation set. left: original images. right: reconstruction images.

4 Future Work

Based on our current experiment results, there are several future research direction worth pursuing.

4.1 Improve VAE

As is shown above, our VAE faces the performance bottleneck because of the constraint to match posterior with a simple prior. One potential improvement would be adding a more expressive decoder like PixelVAE [7]. PixelVAE uses an expressive autoregressive structure for the decoder, which decomposes the lower level features from the higher level semantics. When the latent variable is only left to model the higher level feature, the simple Gaussian prior might be enough. From the reconstruction result, we can find the model is still outputting blurry images. This could be improved by using a more flexible posterior family or by doing hierachical variational inference [12].

4.2 Improve Detection Methods

In this work we haven’t fully explored the method to use VAE for anomaly detection, but just use different outputs from VAE to compute the scores. One could fit a probability distribution (e.g. Gamma distribution) to the distribution of normal scores and use the standard statistical tests for anomaly detection. The latents of VAE can also be used for anomaly detection in several ways. One could train a one-class SVM using the latents as features. The latent space can also be used as a metric space so that the distance between two images can be defined by their inner product in the latent space. This enables one to develop a method similar to the metric learning based anomaly detection method


5 Conclusion

In this paper we apply Variational Autoencoder (VAE) to the problem of anomaly detection in dermatology. VAE based anomaly detection method has a solid theoretic framework and is able to cope with high dimension data, like raw image pixels. Our objective is a specific case of VAE but from a different derivation. We experiment on ISIC 2018 Challenge Task 3 Dataset[4, 13]. By training only on normal data (nevus), the model is able to detect abnormal disease with 0.77 AUC. In particular, the model is able to detect AKIEC and MEL with 0.87 and 0.86 AUC respectively. This is to our knowledge the first work of applying Variational Autoencoder to dermatology, and we argue that although there have been successful applications of supervised learning and CNN based methods in dermatology, applying deep unsupervised learning in dermatology is a fruitful yet not fully explored research direction.