Unsupervised anomaly detection is a key technique that could allow us to overcome the data bottleneck that is ever so present especially in the medical domain. Unsupervised models can directly learn the data distribution from a large cohort of unannotated subjects and then be used to detect out of distribution samples and thus ultimately identify diseased or suspicious cases. By decoupling abnormality detection from reference annotations, these approaches are completely independent of human input and can therefore be applied to any medical condition or image modality. First approaches on unsupervised anomaly detection were built on explicit assumptions, often preventing generalizability to other tasks. Juan-Abarrachin et al.  manually designed a set of image features for brain tumor detection. By mapping the original image into carefully chosen feature representations, they were able to separate tumorous from healthy tissue by clustering of the voxels in feature space and combining this with an atlas-based approach. Erihov et al. 
instead relied on the natural symmetry of the brain (as well as some other organs) to identify regions that behave abnormally. Only the era of deep-learning has allowed to address the problem in a more principled fashion, with the aim of learning normal data distributions in order to detect abnormal samples. Ideally, anomaly detection should not build upon case-specific assumptions in the form of medical domain knowledge or specific annotated validation sets to optimize for, which should be interpreted as an unwanted form of supervision implicitly added by design of the method. Variational Auto-Encoders (VAEs) and their extensions are, alongside flow-based and auto-regressive models, a current de-facto standard for density estimation and particularly anomaly/out-of-distribution sample detection tasks[1, 3, 12, 14]. Here, the evidence lower bound (ELBO), by definition a combination of the reconstruction error with the Kullback-Leibler (KL)-divergence, commonly serves as a proxy for the sample likelihood [3, 12]. Recent work has even demonstrated the ability of VAEs to localize and segment the parts in the image that are most suspicious, which is of particular importance in medical applications. Pawlowski et al.  compare different auto-encoders for CT based pixel-wise segmentation and Baur et al.  propose a VAE with an adversarial loss on the reconstructions to improve performance. Chen et al. [6, 5] use a VAE with an additional adversarial latent loss. The localization part in these studies is currently solely based on the reconstruction error, thus outlining regions as suspicious if they cannot be adequately reconstructed by the model. You et al.  include the KL-term for reconstructions closer to the data manifold. In this work, we demonstrate that reconstruction-based anomaly detection is sub-optimal. One obvious deficiency is that the capability of a VAE to reconstruct anomalies is by design tightly coupled to the expressiveness (size and configuration) of the latent space. This, at the same time, also explains why reconstruction-based techniques are still able to achieve high performance scores on unsupervised tasks: their deficiencies can be compensated for to a certain extent by tuning the model architecture towards being optimally suited for a specific tasks (see also [6, 16, 9]), as is common practice when hyperparameters are optimized on annotated validation sets. Task specific hyperparameter optimization, however, contradicts the principle of assumption-free anomaly detection. To investigate this we analyze the robustness of the reconstruction-based anomaly detection on a sample-wise and pixel-wise level and compare it to the ELBO and the KL-divergence. Inspired by the results, we propose to integrate the KL-divergence of a VAE into a pixel-wise anomaly detection as well. This is in analogy to sample-wise anomaly detection, where the ELBO is also based on both the reconstruction error and the KL-divergence. The proposed approach outperforms reconstruction-based localization in a broad variety of different model configurations, demonstrating the robustness with respect to hyperparameters. When allowing task-specific fine-tuning, the model as well outperforms previously reported deep-learning based results.
2.1 Variational Autoencoders for anomaly detection
However VAEs have design choices and data dependent parameters which can influence the performance, such as the network architecture, the number of latent variables, the standard deviationof and the data dimension. Given sufficiently powerful neural networks , , and and a large enough latent space, VAEs with Gaussian encoders and decoders can (and under some conditions will ) approximate the true data distribution. Thus, after optimization, is often used as a proxy for the likelihood of a data sample and consequently as an anomaly score.
2.2 Pixel-wise anomaly detection
For medical applications a pixel-wise localization, similar to a segmentation, is often more desirable than a sample-wise score. Related methods typically generate a segmentation map by thresholding the pixel-wise reconstruction error [4, 5, 6, 16, 19]. However, in contrast to anomaly detection in the common sample-wise setting, this discards the KL-term and potentially ignores useful information, since a low and consequently high anomaly score can be caused by the reconstruction-term () and/or the KL-term (). To alleviate this problem we aim to include, in addition to the pixel-wise reconstruction error (“Rec-Error”), the KL-term for a pixel-wise anomaly scoring. Our experiments include different strategies of achieving this as well as strategies for each term separately:
“ELBO-grad”: Building on the assumption that allows for a good enough approximation of the true data distribution, we propose to use the derivative of
with respect to the input, yielding a pixel-wise vector pointing towards a data sample with a lower:
Given that is locally convex, the magnitude of the pixel gradient should correspond to a pixel-wise anomaly score .
“KL-grad”: To get a pixel-wise score for only the KL-term of , we differentiate the KL-term with respect to the input.
“Rec-Grad”: To get a pixel-wise score for only the reconstruction-term of , we differentiate the reconstruction-term of with respect to the input.
“Combi”: Instead of differentiating the reconstruction-term of , we can directly use the reconstruction error and combine it with “KL-grad”. This should be less prone to noise artifacts. For this model, we combine the derivative of the KL-term with the reconstruction error by multiplication, since the terms differ by several orders of magnitude.
2.3 Generalization and Robustness across different parameters
We compare the discriminative performance of the ELBO , the KL-term, and the reconstruction-term separately to inspect the information they contain about the data distribution and consequentially the abnormality. We further analyze the robustness and generalizability across different parameter settings and different (medical and non-medical) datasets.
First, we use the FashionMNIST dataset , where we train and validate the model on 54000 images using 9 out of the 10 provided classes and then evaluate the performance by attempting to discriminate between the classes seen during training and the 10th unseen class. In analogy to 
, we used a model with a 3-layer fully connected encoder and decoder with 400 hidden units and ReLU non-linearities. To analyze the robustness we vary the number of latent variables, the standard deviationof (resulting in a down or up-weighing of the reconstruction loss), the image-size/scaling and class left out during training. By default we use 20 latent variables, , a scaling factor of 1 and class is left out during training.
patients). While the HCP patients are all young healthy subjects, the patients in the BraTS dataset all have brain tumours. Finding tumours in this setting is a particularly hard problem because additionally to the challenging nature of the task itself there is a considerable domain shift between the datasets. The BraTS dataset includes tumor annotations which we can use for model evaluation. We treat slices without annotations as healthy whereas slices with at least 20 annotated tumor voxels are considered diseased. Our model consists of a 5-Layer fully-convolutional encoder and decoder with feature-map size of 16-32-64-256. We use strided convolutions (stride 2) as downsampling and transposend convolutions as upsampling operations, each followed by a LeakyReLU non-linearity (inspired by DCGAN). To analyze the robustness and generalizability of the different methods, we vary the number of latent variables (default 256), the standard deviation of (default 1) and the image-size (default pixels).
The models are trained with Adam and an initial learning rate of . Whenever the validation loss reaches a plateau, the learning rate is decreased by multiplying it with
. The training is stopped once the validation loss does not decrease for more than 3 epochs. For each model we perform 5 runs and report the mean as well as the max/min performance. The code to replicate the results is available athttps://github.com/MIC-DKFZ/vae-anomaly-experiments.
3.1 Sample-wise performance
shows a fine-tuned performance with odd-class 5 ().
The sample-wise results across different parameter settings can be seen in Fig. 1. It is apparent that in most cases the reconstruction-term shows lower discriminative power than either the KL-term or the ELBO. Consequently, important information is lost when focusing only on the reconstruction error. Furthermore, cases where it has better performance, the model is severely constrained, for example by having a small latent variable dimension (which was shown in  to hinder VAEs from approximating the data distribution and to lead to poor reconstruction). Thus the robustness of the KL-term can perhaps be intuitively explained by , which states that for VAEs the ELBO best approximates the data distribution having “perfect reconstructions using the fewest number of clean, low-noise latent dimensions”. So far, no hyperparameters were specifically tuned to specifically improve one of the losses. However we want to demonstrate that by using an annotated validation set to tune the parameters, the performance of the reconstruction error as well as the KL-term individually can give competitive performance. This is done by choosing the odd class with the largest gap between the rec-loss and the kl-loss (class 5) and using a single hyperparameter adjustment (setting ). By doing so, we were able to achieve an area under the receiver operator curve (AUROC) of for the KL-loss, which now significantly outperformed the reconstruction loss. However, when no annotated dataset is available our results indicate that in general including the KL-loss for anomaly scoring not just for a sample-wise level, but also on a pixel-wise level could increase performance, as analyze next.
3.2 Pixel-wise performance
The pixel-wise performance on the BraTS2017 dataset across different hyperparamater settings is summarized in Fig. 2. Here the model was trained on the healthy HCP subjects and then applied to BraTS2017 for anomaly detection. We used the same model and data setting as before, but in this case evaluate the performance to detect pixel-wise whole tumor annotations. We compare the methods presented in Sec. 2.2: The pixel-wise reconstruction-error (Rec-Error
), the backpropagated(Elbo-Grad), its backpropagated KL-term (KL-Grad) and reconstruction-term (Rec-Grad) separately as well as the combi model. Similar to the previous cases, it is obvious that in most cases the reconstruction error alone is outperformed by other methods. Furthermore, for most choices the KL-Grad and the combi model perform best, indicating a more robust performance at least for this particular dataset (similar observations can be made for the ISLES2015 dataset, as shown in the Suppl.).
3.3 Hyperparameter tuning
The top performing methods in the experiments shown above already exhibit high AUROC . In particular, the often ignored KL-term shows robust performance across a variety of tested settings. The combi approach exhibits a similar robustness. We were interested in investigating the top performance of the KL-term approach in a scenario where an annotated validation set can be employed for hyperparameter tuning, just as it is often done in the literature when presenting reconstruction-based approaches. Results are shown in Table 1 and Fig. 3. Dice scores are calculated by thresholding the anomaly values at a value that was determined using of the test dataset. The reported dice scores were then taken from the other th of the dataset.
|-GAN ||VAE-Rec ||default||fine-tuned||GHMRF ||X-Saliency ||GMM |
4 Discussion & Conclusion
In this work we compared different approaches of detecting anomalies with VAEs over many different hyperparameter settings. However, a hyperparameters are regularly chosen by task specific optimization on an annotated validation set, which contradicts the principle of unsupervised anomaly detection. We showed that for a pixel-wise anomaly detection the reconstruction error does not always have the best performance and can regularly be improved by combining it with the backpropagated KL-term. This is in analogy to common VAE-based anomaly detection methods that consider the ELBO , which constitutes definition the combination of the KL-term with the reconstruction-term, as a proxy score. The proposed approaches shows promising performance across a broad range of hyperparameters and thus effectively reduce the need for manually tuning towards a validation set and thus keep the unsupervised property intact. If an annotated validation set is available, however, our approach can still be fine tuned to achieve the same competitive performance as other methods. On a first glance, our method is outperformed by non-deep learning methods on the BraTS dataset. This however neglects the fact that these models incorporate specific domain knowledge via their algorithmic design. While this results in a strong performance, it renders them unsuitable for application to a different organs or modalities. Our proposed approach does not make such assumptions, is robust with respect to the exact choice of hyperparameters and could therefore effectively be transferred to new problems or datasets without requiring any modification.
We believe that our proposed method constitutes a step in improving anomaly detection for medical imaging applications. In the future anomaly detection algorithms have the potential of making use of the increasing amounts of available raw data, offering the perspective of effective radiological support tools that are not affected by the annotation data bottleneck.
-  Abati, D., Cucchiara, R., et al: AND: Autoregressive Novelty Detectors (2018)
-  Alain, G., Bengio, Y.: What Regularized Auto-encoders Learn from the Data-generating Distribution. JMLR (2014)
-  Baur, C., Navab, N., et al: Deep Autoencoding Models for Unsupervised Anomaly Segmentation in Brain MR Images. CoRR (2018)
-  Chen, X., Konukoglu, E.: Unsupervised Detection of Lesions in Brain MRI using constrained adversarial auto-encoders. CoRR (2018)
-  Chen, X., Konukoglu, E., et al: Deep Generative Models in the Real-World: An Open Challenge from Medical Imaging. CoRR (2018)
-  Dai, B., Wipf, D.: Diagnosing and enhancing VAE models. In: ICLR (2019)
-  Erihov, M., Hashoul, S., et al: A cross saliency approach to asymmetry-based tumor detection. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (2015)
-  Goldstein, M., Uchida, S.: A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data. PLoS ONE (2016)
-  Juan-Albarracín, J., García-Gómez, J.M., et al: Automated glioblastoma segmentation based on a multiparametric structured unsupervised classification. PLoS One (2015)
-  Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes. CoRR (2013)
-  Kiran, B., Parakkal, R., et al: An Overview of Deep Learning Based Methods for Unsupervised and Semi-Supervised Anomaly Detection in Videos. Journal of Imaging (2018)
-  Menze, B.H., Van Leemput, K., et al: The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans Med Imaging (2015)
-  Nalisnick, E., Lakshminarayanan, B., et al: Do Deep Generative Models Know What They Don’t Know? ICLR (2019)
Paszke, A., Lerer, A., et al: Automatic differentiation in PyTorch (2017)
-  Pawlowski, N., Glocker, B., et al: Unsupervised Lesion Detection in Brain CT using Bayesian Convolutional Autoencoders (2018)
-  Radford, A., Chintala, S., et al: Unsupervised representation learning with deep convolutional generative adversarial networks (2015)
-  Rezende, D.J., Wierstra, D., et al: Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In: ICML. JMLR.org (2014)
-  Schlegl, T., Langs, G., et al: Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery. In: IPMI. Springer (2017)
-  Van Essen, D.C., WU-Minn HCP Consortiumand, et al: The Human Connectome Project: a data acquisition perspective. Neuroimage (2012)
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms (2017)
-  You, S., Konukoglu, E., et al: Unsupervised Lesion Detection via Image Restoration with a Normative Prior. In: International Conference on Medical Imaging with Deep Learning – Full Paper Track (2019)