In recent years several deep-learning-based methods have reported reaching comparable performance to trained medical physicians Liu et al. (2017); V et al. (2016). One weakness of those approaches is that they still require a lot of annotated data for each condition to be trained on. Due to the time-intensive work of annotating medical images and the combinatorial number of cases for different modalities, image qualities, hardware devices, and different conditions, it is still infeasible to train an algorithm for each of the existing combinations. Anomaly detection can, while not determining the condition, highlight and identify suspicious regions for a closer inspection by a trained physician. By assigning each pixel an anomaly rating, it allows for an easy trade-off of specificity and sensitivity. While this may not be able to outperform supervised algorithms, it offers a way to make use of unlabeled data and aid physicians during the diagnosis.
Previous unsupervised anomaly detection approaches in the medical field were primarily based on a reconstruction error. Leemput et al. Van Leemput et al. (2001) use a statistical model to reconstruct the input tissue-wise, quantifying the discrepancies between the actual image and the model prediction to identify anomalies. Liu et al. Liu et al. (2014) decompose the model into low-rank components which representing the normal parts of the image, and high-frequency parts which representing anatomical and pathological variations and are thus able to delineate suspicious areas. More recently multiple deep learning Autoencoder (AE) based methods have been proposed, all considering the reconstruction error. Chen et al. Chen and Konukoglu (2018); Chen et al. (2018) propose to use an adversarial latent loss in addition to a Variational Autoencoder (VAE) and compare it to different AE-based approaches. Baur et al. Baur et al. (2018) use a VAE with an adversarial loss on the reconstruction to get a more realistic reconstruction. Pawlowski et al. Pawlowski et al. (2018) compare different AEs for CT based pixel-wise segmentation.
All those approaches use the reconstruction error to identify suspicious regions, based on the idea that models can not truthfully reproduce anomalies not seen during training. Despite showing good results, there are no formal guarantees for that assumption. In the next section we will describe how to use the score, defined as the derivative of the log-density with respect to the input Hyvärinen (2005), as an alternative anomaly rating.
Alain et al. Alain and Bengio (2014) have shown that for AE-based models with a denoising criterion the reconstruction error approximates the score. It can be anticipated that most AE- and reconstruction-based models work due to an approximation of the score. Consequently and based on the following assumptions, we hypothesize that the score can give a good approximation for an abnormality rating:
The score gives the directions towards the normal data samples, which for medical data is the data sample with abnormal anatomies and pathologies transformed into healthy parts,
The magnitude of the score indicates how abnormal the pixel is.
In this work, we describe a way to directly estimate the score using VAEs, one of the best performing density-estimation models for images Chen et al. (2018); Kiran et al. (2018). The objective of VAEs is to learn a generative model of the data by maximizing the evidence lower bound (ELBO) for the given training data. The ELBO is defined as:
Where is the inference model, is the prior for the latent variables,
is the Kullback-Leibler divergence, and
is the generative model. Thus after training the VAE and maximizing the ELBO, an estimate of the log probabilityof a data sample can be calculated by evaluating the rhs of Eq. 1 for the data sample . The approximate score can consequently be calculated by taking the derivative of the ELBO with respect to the data sample:
, when training a VAE using Gaussian distributions forand
, a parameterization by neural networks, the reparameterization trick, and MC sampling to approximate the expectation. This allows training of the VAE and the evaluation of Eq.2
using the backpropagation algorithm.
We note that the above-mentioned assumptions can be violated in practice, especially in cases far away from the healthy sample data distribution. However, in the next section, we will present empirical evidence that our model can outperform reconstruction-based methods on an anomaly detection tasks and describe its benefits.
3 Experiments & Results
To learn the healthy data distribution we trained the VAE model on 1092 T2 MRI images of Human Connectome Project (HCP) dataset Van Essen et al. (2012), with minor data augmentations, such as multiplicative color augmentations, random mirroring, and rotations. We evaluate the anomaly detection in the context of finding and outlining tumors on the BraTS-2107 dataset Bakas et al. (2017); Menze et al. (2015)
. Therefore we calculate a pixel-wise rating and then report the ROC-AUC. Both datasets were normalized and slice-wise resampled to a resolution of 64x64 pixels. As encoder and decoder for the AE-based models, we used a 5-layer fully convolutional neural network with LeakyReLUs and a latent size of 1024. To backpropagate onto the image and approximate thescore, we used the Smoothgrad algorithm Smilkov et al. (2017)
. Due to checkerboard artifacts caused by the convolutions, we apply Gaussian smoothing to the gradients. The model was trained for 60 epochs with a batchsize of 64 and Adam as the optimizer with a learning rate of.
To evaluate the benefits of the score
, we compare the model to a Denoising Autoencoder (DAE)Vincent et al. (2010) with the same architecture using the reconstruction error. Furthermore, we compare the score
with the reconstruction error of the VAE, the smoothed reconstruction error, and the sampling deviations by determining the standard deviation of multiple MC samples. We further inspect thescore, dividing it into the reconstruction-loss gradient and KL-loss gradient to get insights into the benefits of including the KL-term into the anomaly detection. The results can be seen in Fig. 3a (and Appendix Table 1), samples and the corresponding pixel-wise ratings for samples are presented in Fig. 1b (and Appendix Fig. 3 & 4).
. Smoothing leads to slightly improved results, presumably by removing high-frequency detections, and performs on par with the usage of the sampling variances. The approximatedscore using the ELBO gradient (KL-loss + reconstruction-loss) performs best with a pixel-wise ROC-AUC of 0.94 (see Appendix Fig. 2) . It is interesting to see, that the addition of the reconstruction-loss to the KL-loss shows little benefit over the KL-loss gradient. Furthermore, the reconstruction-loss gradient performs worse than the KL-loss gradient but outperforms the reconstruction error.
In Fig. 1a, the reconstruction-loss gradient focuses on parts of poor reconstruction, and the combination of the KL-loss with the reconstruction-loss shows only marginal benefits over the KL-loss gradient. This might be an indication that for this model the KL-loss focuses primarily on the distance to the data distribution, while the reconstruction focuses more on the actual reconstruction task.
3.1 Discussion & Conclusion
We have presented a way to estimate the score using VAE gradients to detect anomalies on the BraTS-2017 tumor segmentation dataset. The results show competitive unsupervised segmentation performance, slightly outperforming the previously best reported ROC-AUC of 0.92 Chen and Konukoglu (2018); Chen et al. (2018). The relative influence of the reconstruction loss can depend on the regularization of the latent variables. Using fewer latent variables or putting more importance on the KL-loss could, while potentially causing inferior overall performance, lead to a more competitive performance of the reconstruction error.
To the best of our knowledge, we are the first to use the gradients of a VAE, which approximate the score, to identify anomalies in images. The results suggest that the approximated score, including the often ignored KL-loss, can give a boost on the pixel-wise anomaly detection performance. Furthermore, we want to stress the point that including the KL-loss for a pixel-wise anomaly detection and the score of a model can lead to an improvement in VAE-based methods for pixel-wise anomaly ratings.
- What Regularized Auto-encoders Learn from the Data-generating Distribution. J. Mach. Learn. Res. 15 (1), pp. 3563–3593. External Links: Cited by: §2.
- Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci Data 4, pp. 170117 (eng). External Links: Cited by: §3.
- Deep Autoencoding Models for Unsupervised Anomaly Segmentation in Brain MR Images. CoRR abs/1804.04488. Cited by: §1.
- Unsupervised Detection of Lesions in Brain MRI using constrained adversarial auto-encoders. CoRR abs/1806.04972. Cited by: §1, §3.1.
- Deep Generative Models in the Real-World: An Open Challenge from Medical Imaging. CoRR abs/1806.05452. Cited by: §1, §2, §3.1, §3.
- Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, pp. 695–709. External Links: Cited by: §1.
- Glow: Generative Flow with Invertible 1x1 Convolutions. CoRR abs/1807.03039. Cited by: §3.1.
- Auto-Encoding Variational Bayes.. CoRR abs/1312.6114. External Links: Cited by: §2.
- An Overview of Deep Learning Based Methods for Unsupervised and Semi-Supervised Anomaly Detection in Videos. Journal of Imaging 4 (2), pp. 36 (en). External Links: Cited by: §2.
- Low-Rank to the Rescue – Atlas-based Analyses in the Presence of Pathologies. Med Image Comput Comput Assist Interv 17 (Pt 3), pp. 97–104. External Links: Cited by: §1.
- Detecting cancer metastases on gigapixel pathology images. Technical report arXiv. External Links: Cited by: §1.
- The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans Med Imaging 34 (10), pp. 1993–2024 (eng). External Links: Cited by: §3.
- Unsupervised Lesion Detection in Brain CT using Bayesian Convolutional Autoencoders. CoRR. Cited by: §1, §3.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, Beijing, China, pp. II–1278–II–1286. External Links: Cited by: §2.
- PixelCNN++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. CoRR abs/1701.05517. Cited by: §3.1.
- SmoothGrad: removing noise by adding noise. CoRR abs/1706.03825. Cited by: §3.
- Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316 (22), pp. 2402–2410. External Links: Cited by: §1.
- The Human Connectome Project: a data acquisition perspective. Neuroimage 62 (4), pp. 2222–2231 (eng). External Links: Cited by: §3.
Automated segmentation of multiple sclerosis lesions by model outlier detection. IEEE Trans Med Imaging 20 (8), pp. 677–688 (eng). External Links: Cited by: §1.
- Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research 11 (Dec), pp. 3371–3408. External Links: Cited by: §3.
4.1 Quantitative Results
|Smoothed Reconstruction Error|