1 Introduction
In recent years several deeplearningbased methods have reported reaching comparable performance to trained medical physicians Liu et al. (2017); V et al. (2016). One weakness of those approaches is that they still require a lot of annotated data for each condition to be trained on. Due to the timeintensive work of annotating medical images and the combinatorial number of cases for different modalities, image qualities, hardware devices, and different conditions, it is still infeasible to train an algorithm for each of the existing combinations. Anomaly detection can, while not determining the condition, highlight and identify suspicious regions for a closer inspection by a trained physician. By assigning each pixel an anomaly rating, it allows for an easy tradeoff of specificity and sensitivity. While this may not be able to outperform supervised algorithms, it offers a way to make use of unlabeled data and aid physicians during the diagnosis.
Previous unsupervised anomaly detection approaches in the medical field were primarily based on a reconstruction error. Leemput et al. Van Leemput et al. (2001) use a statistical model to reconstruct the input tissuewise, quantifying the discrepancies between the actual image and the model prediction to identify anomalies. Liu et al. Liu et al. (2014) decompose the model into lowrank components which representing the normal parts of the image, and highfrequency parts which representing anatomical and pathological variations and are thus able to delineate suspicious areas. More recently multiple deep learning Autoencoder (AE) based methods have been proposed, all considering the reconstruction error. Chen et al. Chen and Konukoglu (2018); Chen et al. (2018) propose to use an adversarial latent loss in addition to a Variational Autoencoder (VAE) and compare it to different AEbased approaches. Baur et al. Baur et al. (2018) use a VAE with an adversarial loss on the reconstruction to get a more realistic reconstruction. Pawlowski et al. Pawlowski et al. (2018) compare different AEs for CT based pixelwise segmentation.
All those approaches use the reconstruction error to identify suspicious regions, based on the idea that models can not truthfully reproduce anomalies not seen during training. Despite showing good results, there are no formal guarantees for that assumption. In the next section we will describe how to use the score, defined as the derivative of the logdensity with respect to the input Hyvärinen (2005), as an alternative anomaly rating.
2 Methods
Alain et al. Alain and Bengio (2014) have shown that for AEbased models with a denoising criterion the reconstruction error approximates the score. It can be anticipated that most AE and reconstructionbased models work due to an approximation of the score. Consequently and based on the following assumptions, we hypothesize that the score can give a good approximation for an abnormality rating:

The score gives the directions towards the normal data samples, which for medical data is the data sample with abnormal anatomies and pathologies transformed into healthy parts,

The magnitude of the score indicates how abnormal the pixel is.
In this work, we describe a way to directly estimate the score using VAEs, one of the best performing densityestimation models for images Chen et al. (2018); Kiran et al. (2018). The objective of VAEs is to learn a generative model of the data by maximizing the evidence lower bound (ELBO) for the given training data. The ELBO is defined as:
(1) 
Where is the inference model, is the prior for the latent variables,
is the KullbackLeibler divergence, and
is the generative model. Thus after training the VAE and maximizing the ELBO, an estimate of the log probability
of a data sample can be calculated by evaluating the rhs of Eq. 1 for the data sample . The approximate score can consequently be calculated by taking the derivative of the ELBO with respect to the data sample:(2) 
Furthermore, the ELBO is fully differentiable Kingma and Welling (2013); Rezende et al. (2014)
, when training a VAE using Gaussian distributions for
and, a parameterization by neural networks, the reparameterization trick, and MC sampling to approximate the expectation. This allows training of the VAE and the evaluation of Eq.
2using the backpropagation algorithm.
We note that the abovementioned assumptions can be violated in practice, especially in cases far away from the healthy sample data distribution. However, in the next section, we will present empirical evidence that our model can outperform reconstructionbased methods on an anomaly detection tasks and describe its benefits.
3 Experiments & Results
To learn the healthy data distribution we trained the VAE model on 1092 T2 MRI images of Human Connectome Project (HCP) dataset Van Essen et al. (2012), with minor data augmentations, such as multiplicative color augmentations, random mirroring, and rotations. We evaluate the anomaly detection in the context of finding and outlining tumors on the BraTS2107 dataset Bakas et al. (2017); Menze et al. (2015)
. Therefore we calculate a pixelwise rating and then report the ROCAUC. Both datasets were normalized and slicewise resampled to a resolution of 64x64 pixels. As encoder and decoder for the AEbased models, we used a 5layer fully convolutional neural network with LeakyReLUs and a latent size of 1024. To backpropagate onto the image and approximate the
score, we used the Smoothgrad algorithm Smilkov et al. (2017). Due to checkerboard artifacts caused by the convolutions, we apply Gaussian smoothing to the gradients. The model was trained for 60 epochs with a batchsize of 64 and Adam as the optimizer with a learning rate of
.To evaluate the benefits of the score
, we compare the model to a Denoising Autoencoder (DAE)
Vincent et al. (2010) with the same architecture using the reconstruction error. Furthermore, we compare the scorewith the reconstruction error of the VAE, the smoothed reconstruction error, and the sampling deviations by determining the standard deviation of multiple MC samples. We further inspect the
score, dividing it into the reconstructionloss gradient and KLloss gradient to get insights into the benefits of including the KLterm into the anomaly detection. The results can be seen in Fig. 3a (and Appendix Table 1), samples and the corresponding pixelwise ratings for samples are presented in Fig. 1b (and Appendix Fig. 3 & 4).The reconstruction error performs similarly for the VAE and the DAE, which was also reported in Chen et al. (2018); Pawlowski et al. (2018)
. Smoothing leads to slightly improved results, presumably by removing highfrequency detections, and performs on par with the usage of the sampling variances. The approximated
score using the ELBO gradient (KLloss + reconstructionloss) performs best with a pixelwise ROCAUC of 0.94 (see Appendix Fig. 2) . It is interesting to see, that the addition of the reconstructionloss to the KLloss shows little benefit over the KLloss gradient. Furthermore, the reconstructionloss gradient performs worse than the KLloss gradient but outperforms the reconstruction error.In Fig. 1a, the reconstructionloss gradient focuses on parts of poor reconstruction, and the combination of the KLloss with the reconstructionloss shows only marginal benefits over the KLloss gradient. This might be an indication that for this model the KLloss focuses primarily on the distance to the data distribution, while the reconstruction focuses more on the actual reconstruction task.
3.1 Discussion & Conclusion
We have presented a way to estimate the score using VAE gradients to detect anomalies on the BraTS2017 tumor segmentation dataset. The results show competitive unsupervised segmentation performance, slightly outperforming the previously best reported ROCAUC of 0.92 Chen and Konukoglu (2018); Chen et al. (2018). The relative influence of the reconstruction loss can depend on the regularization of the latent variables. Using fewer latent variables or putting more importance on the KLloss could, while potentially causing inferior overall performance, lead to a more competitive performance of the reconstruction error.
To the best of our knowledge, we are the first to use the gradients of a VAE, which approximate the score, to identify anomalies in images. The results suggest that the approximated score, including the often ignored KLloss, can give a boost on the pixelwise anomaly detection performance. Furthermore, we want to stress the point that including the KLloss for a pixelwise anomaly detection and the score of a model can lead to an improvement in VAEbased methods for pixelwise anomaly ratings.
References
 What Regularized Autoencoders Learn from the Datagenerating Distribution. J. Mach. Learn. Res. 15 (1), pp. 3563–3593. External Links: ISSN 15324435, Link Cited by: §2.
 Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci Data 4, pp. 170117 (eng). External Links: ISSN 20524463, Document Cited by: §3.
 Deep Autoencoding Models for Unsupervised Anomaly Segmentation in Brain MR Images. CoRR abs/1804.04488. Cited by: §1.
 Unsupervised Detection of Lesions in Brain MRI using constrained adversarial autoencoders. CoRR abs/1806.04972. Cited by: §1, §3.1.
 Deep Generative Models in the RealWorld: An Open Challenge from Medical Imaging. CoRR abs/1806.05452. Cited by: §1, §2, §3.1, §3.
 Estimation of nonnormalized statistical models by score matching. J. Mach. Learn. Res. 6, pp. 695–709. External Links: ISSN 15324435, Link Cited by: §1.
 Glow: Generative Flow with Invertible 1x1 Convolutions. CoRR abs/1807.03039. Cited by: §3.1.
 AutoEncoding Variational Bayes.. CoRR abs/1312.6114. External Links: Link Cited by: §2.
 An Overview of Deep Learning Based Methods for Unsupervised and SemiSupervised Anomaly Detection in Videos. Journal of Imaging 4 (2), pp. 36 (en). External Links: Link, Document Cited by: §2.
 LowRank to the Rescue – Atlasbased Analyses in the Presence of Pathologies. Med Image Comput Comput Assist Interv 17 (Pt 3), pp. 97–104. External Links: Link Cited by: §1.
 Detecting cancer metastases on gigapixel pathology images. Technical report arXiv. External Links: Link Cited by: §1.
 The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans Med Imaging 34 (10), pp. 1993–2024 (eng). External Links: ISSN 1558254X, Document Cited by: §3.
 Unsupervised Lesion Detection in Brain CT using Bayesian Convolutional Autoencoders. CoRR. Cited by: §1, §3.

Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
In
Proceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32
, ICML’14, Beijing, China, pp. II–1278–II–1286. External Links: Link Cited by: §2.  PixelCNN++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. CoRR abs/1701.05517. Cited by: §3.1.
 SmoothGrad: removing noise by adding noise. CoRR abs/1706.03825. Cited by: §3.
 Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316 (22), pp. 2402–2410. External Links: Document, Link, /data/journals/jama/935924/joi160132.pdf Cited by: §1.
 The Human Connectome Project: a data acquisition perspective. Neuroimage 62 (4), pp. 2222–2231 (eng). External Links: ISSN 10959572, Document Cited by: §3.

Automated segmentation of multiple sclerosis lesions by model outlier detection
. IEEE Trans Med Imaging 20 (8), pp. 677–688 (eng). External Links: ISSN 02780062, Document Cited by: §1.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research 11 (Dec), pp. 3371–3408. External Links: ISSN ISSN 15337928, Link Cited by: §3.
4 Appendix
4.1 Quantitative Results
ROCAUC  

DAE  
Reconstruction Error  
Smoothed Reconstruction Error  
Sampling Variance  
ReconstructionLoss Gradient  
KLLoss Gradient  
ELBO Gradient 


4.2 Qualitative Results


















Comments
There are no comments yet.