Automating anomaly detection in medical imaging with artificial intelligence has gained popularity and interest in recent years. Indeed, the analysis of images to localize potential abnormality seems well suited to supervised computer vision algorithms. However these solutions remain data hungry and require knowledge transfer from human to machine via image annotations. Furthermore, the classification in a limited number of user-predefined categories such as healthy, tumor and so on, will not generalize well if a previously unseen anomaly appears. For visual inspection, a better-suited task is unsupervised anomaly detection, in which the localization of the abnormality must be done only via prior knowledge of normal samples.
From a statistical point of view, an anomaly may be seen as an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism
. In this setting, deep generative models such as Variational AutoEncoders (VAEs) and -VAE , are especially interesting because they are capable to infer possible sampling mechanisms for a given dataset. The VAE jointly learns an encoder model, which compresses input samples into a low dimensional space, and a decoder, which decompresses the low dimensional samples into the original input space, by simultaneously minimizing the distance between the input of the encoder and the output of the decoder, and the distance between latent distribution and a prior distribution (usually Gaussian). The output decompressed sample for a given input is often called the reconstruction, and is used as some sort of projection of the input on the support of the normal data distribution, usually called the normal manifold. In most unsupervised anomaly localization methods based on VAE, models are trained on normal data and anomaly localization is then performed using a distance metric between the input sample and its reconstruction [4, 3, 5]. The localization part in those studies is solely based on the reconstruction error, thus outlining regions as suspicious if they cannot be adequately reconstructed by the model. One obvious deficiency is that the capability of a VAE to reconstruct anomalies is by design tightly coupled to the expressiveness (size and configuration) of the latent space. Then to further improve the localization performance, there are at least two branches to explore: 1) searching for other predictors that are not highly dependent on the modeling capacity of VAE; 2) principally improving the projection quality such that normal regions in the projected normal image are the same as that in the input image. The loss of VAE, called the evidence lower bound (ELBO), consists of two parts: reconstruction loss and Kullback-Leibler (KL)-divergence loss. Zimmerer et al. [22, 23] found that KL loss gradient with respect to input is one robust predictor. For the latter branch, instead of using the VAE reconstruction, Dehaene et al.  proposed to iteratively project the abnormal data to the normal manifold more accurately by optimizing a specific energy function. For natural images, they found that their reconstruction error based method outperforms [22, 23] significantly.
In this paper, we argue that the performance of different predictors are highly dependent on the VAE settings, e.g. the size of latent space and the weight of KL loss (VAE becomes -VAE). We also test the energy minimization projection based method  in the medical imaging (T2 MRI Brain images) scenario, and found that it is not as powerful as on simple natural images.
2.1 VAE and -Vae
In unsupervised anomaly detection, the only available data during training are samples from a normal dataset
. In a generative setting, we assume the existence of a probability function of density, having its support on all
. The generative objective is to model an estimate of, from which one can obtain new samples close to the dataset.
Popular deep generative models are generative adversarial networks (GAN)  and VAE. The advantages of GANs are that they can generate sharp and realistic samples, as a discriminator is trained simultaneously to guide the generator. However, disadvantages of GANs are that they are notoriously difficult to train , and suffer from mode collapse, meaning that they have the tendency to only generate a subset of the original dataset. This can be problematic for anomaly detection, in which we do not want some subset of the normal data to be considered as anomalous . Recent works such as  propose substantial upgrades, however other works such as  still supports that GANs have more trouble than other generative models to cover the whole distribution support.
Another deep generative model is VAE, which consists of an encoder and a decoder. The decoder, similar to a GAN generator, tries to approximate the conditional dataset distribution on a simple latent variables prior . We would like to maximize the estimate on the dataset. To make the learning tractable, importance sampling by introducing density functions output of an encoder is utilized, and the variational evidence lower bound (ELBO) can be deduced as:
, the opposite of ELBO, is utilized as the loss function of VAE for training. VAEs are known to produce blurry reconstructions and generations. The advantages are that VAEs probably do not suffer the mode collapse problem and VAEs can generate projection of new input to the training dataset manifold in one forward pass, without need of iterative optimization if using GANs . -VAEs share all the merits with VAEs, and its loss function is formulated as:
By putting more weight () on the KL term, the trained -VAEs encourage disentangled factor learning in the latent space.
2.2 Predictors for Pixel-wise Anomaly Localization
We will consider that an anomaly is a sample with low probability under our estimation of the normal dataset distribution. The VAE loss, being a lower bound on the density, is a proxy to classify samples between the normal and abnormal categories. To this effect, a thresholdcan be defined on the loss function, where anomalous samples with and normal samples with . However, according to Nalisnick et al. , the likelihood of a data point in deep generative models is not a reliable measure for detecting abnormal samples. Also according to Matsubara et al. , the regularization term has a negative influence in the computation of anomaly scores. They proposed instead an unregularized score , which is equivalent to the reconstruction loss of a standard autoencoder. Going from anomaly detection to anomaly localization, this reconstruction term becomes crucial to most of existing solutions. Indeed, the inability of the model to reconstruct a given part of an image is used as a way to segment the anomaly, using a pixel-wise threshold on the reconstruction error [4, 3, 5]. We call it a reconstruction-loss based predictor. However, according to [23, 22], the magnitude of the loss gradient with respect to , such as , and so on, is useful and maybe more robust predictor. To clarify the notations, we list out all the predictors and their formulations, as follows.
“Rec-Error”: . We parameterize as diagonal Gaussian , and parameterize as . During inference, we approximate it as , which is basically the pixel-wise L2 distance between input and reconstruction.
2.3 Improve Performance of different predictors
For the two classes of predictors, different strategies can be utilized to improve their respective performance on anomaly localization. For “Rec-Error” predictor, we apply one iterative projection method, similar to adversarial sample generation, to medical imaging and test its effectiveness. For other predictors, we propose to utilize -VAE to capture other balance between latent space information and reconstruction accuracy for better anomaly localization.
Iterative projection for more accurate reconstruction error For “Rec-Error” predictor, the assumption is that the trained VAE has the capability to alter anomalous pixels and keep normal pixels untouched during reconstruction. In other words, for this predictor, the ideal generative model has the following functional:
where is some positive number. However, practical VAEs can not be guaranteed to have the aforementioned property held, which makes the “Rec-Error” predictor sub-optimal. To make the projection more accurate with respect to Eqn. 2.3, Dehaene et al. propose to apply adversarial samples generation idea, that is to say, starting from a sample , iterate gradient descent steps over the input , constructing samples , to minimize the energy , defined as
An iteration is done by calculating as
where is the learning rate, and is a parameter trading off the inclusion of in the normal manifold, given by , and the proximity between and the input , assured by the regularization term . This method enables the “Rec-Error” predictor to significantly outperform all gradient based predictors on simple natural images. However, the effectiveness of this method on more challenging dataset, e.g. brain MRI images, is not clear before our work. To make the notation clear, we call this method “Proj-Rec-Error”.
-VAE for better anomaly localization As a variant of VAE, -VAE 
is designed for unsupervised discovery of interpretable factorized representations from raw image data. An adjustable hyperparameteris introduced to balance the extent of learning constraints (a limit on the capacity of the latent information and an emphasis on learning statistically independent latent factors) and reconstruction accuracy. Hoffman et al.  introduced a reformulation of -VAE for . They argued that, within this range, training -VAE is equivalent to optimizing an approximate log-marginal likelihood bound of VAE under an implicit prior. All in all, different values should induce different balance between latent information (related to ) and reconstruction accuracy (), which then change the performance of “Rec-Error”, “Rec-grad”, “KL-grad” and their combinations.
Intuitively, a bigger , putting less weight on the reconstruction accuracy, may cause insensitive reconstruction change with respect to the input change and then less accurate reconstruction, which is similar to posterior collapse . The appearance is that a bigger may degenerate the performance of “Rec-Error”. However, the aforementioned argument only makes sense for normal data. On the other side, a bigger encourages more disentangled representation in the latent space and then capture the normal data manifold better, and the corresponding -VAE may have superior ability to “inpaint” the abnormal pixels with their corresponding normal pixels successfully, and finally improve the performance of “Rec-Error”. Similar induction can be applied to KL related predictors. In this work, we will investigate the effects of experimentally.
We also investigate that whether combining different predictors can improve localization accuracy. Actually, one can think the “Combi” is one heuristic approach that ensembles the predictor “KL-grad” and “Rec-Error”. However, it is non-trivial to do this systematically in the fully unsupervised setting. To leverage the full power of all predictors, we have to use a small portion of dataset, which includes both normal and abnormal data, to find the reasonable way to combine all predictors. We propose to utilize a logistic regression model, where the values of different predictors are treated as different features, to ensemble different predictors for possible performance improvement.
In this section, we evaluate the effectiveness of the iterative projection method on more challenging dataset, i.e. the brain MRI images. Later we will test the proposed -VAE based anomaly localization method.
Training dataset: To learn the normal brain MRI image distribution, we trained the VAE and -VAE on 3T T2 MRI images of Human Connectome Project (HCP) dataset , which are from 1113 healthy young adult (age 22-35) participants. Data augmentation includes random noise adding, random rotation, and color augmentation. Test dataset: We evaluate the anomaly localization method on the BraTS2018 dataset [15, 1, 2]. There are 285 cases in total, and only T2 image of each case was utilized for our experiment. The resolution is 1x1x1 mm isotropic and all image volumes have a size 240x240x155. We do not have access to the BraTS2017 dataset, as the resource link is out of date.
3.2 Pre-processing and Hyperparameters
Both training and test dataset were normalized to have zero mean and unit variance, and slice-wise resampled to have a size of 64x64 pixels. For training, all models were trained for 500 epochs with an Adam optimizer having an initial learning rate of. During inference, for the “Proj-Rec-Error” method, an Adam optimizer with a learning rate in Eqn. 5 was utilized.
3.3 VAE and -VAE architecture
As VAE and -VAE differ only on the loss function, for fair comparisons, we set them to have the same architecture as in 
, which consists of a 5 layer fully-convolutinal encoder and decoder with feature-map size of 16-32-64-256. We used strided convolutions (stride 2) as downsampling and transposed convolutions as upsampling operations, each followed by a LeakyReLU non-linearity.
In this section, we aim to answer the following questions: 1) Which predictors are better? 2) Is the “Proj-Rec-Error” effective in medical imaging? 3) Is a big good or bad for anomaly localization? 4) Are these predictors complementary
? The metric we utilize is the pixle-wise area under a receiver operating characteristic curve (AUROC), which is commonly used for unsupervised binary classification.
Which predictors are better?
As noticed in in , the localization performance is highly dependent on two settings of VAE: the image size and the latent space dimension. The good image size is a trade-off between the modeling difficulty of VAE and the localization accuracy of tumors caused by different resolutions. It is well known that VAEs will encounter difficulty for dataset with large image size. However, if the resolution is too low, the localization will be too coarse. It is non-trivial to select the latent space dimension size , i.e. if it is too big, the learned latent space may not be well defined, e.g. VAE can reconstruct abnormal input successfully, which is demonstrated in Fig. 1(a) denoted by ; if the size is too small, VAE can not keep all the variational generative factors and then cause over-smoothed reconstruction and hinders accurate abnormal localization, which is demonstrated in Fig. 1(a) denoted by .
The best settings claimed in  on the BraTS2017 are: the image size is 64x64; the latent space dimension size is 256. In our experiments on BraTS2018, the scenario of which is similar to that of BraTS2017, we settle down the image size to 64x64 and explore the effect of latent size on the localization performance. The results are listed in the Tab. 1. As can be seen, in contradict to the observations in , instead of “KL-grad” and “Combi”, our experiments support that “Rec-Error”, “ELBO-grad” and “Rec-grad”, are the best predictors, and also the latter two perform comparably for all three latent space dimension sizes.
Effectiveness of “Proj-Rec-Error” in medical imaging We applied the “Proj-Rec-Error”, which is highly successful in simple natural images , to more challenging brain tumor localization problem. The baseline is the VAE model with latent size 64. The AUROC of “Proj-Rec-Error” is 0.861, compared to 0.856 of “Rec-Error” and 0.900 of “Rec-grad”. It can be seen that the projection can boost the performance of “Rec-error”. However, it is still outperformed by the gradient based method “Rec-grad”. But it seems we should blame more on the VAE model, which may not model the normal brain distribution good enough. This can be demonstrated in Fig. 1(b), where the iterative projection can correct only part of reconstruction on normal regions, which indicates that the corresponding normal image of the input is not modeled well by the VAE.
Does -VAE help? Based on the indication from the previous part, we explored the modeling capacity of -VAEs, which is kind of generalized VAE, and their performance on anomaly localization task. The results using different values are listed in Tab. 2. As can be seen, seems inducing the best performance for the first four predictors, and however interestingly, the performance of “Combi” degrades consistently as increases. Moreover, “Proj-Rec-Error” consistently outperforms “Rec-Error”. The reconstruction behavior of -VAE is demonstrated in Fig. 2. One can notice the reconstruction seems differing from the input as increases. In some sense, by -VAE with bigger for training on normal data distribution, the reconstruction of abnormal data seems more biased to the learned normal data distribution, which is good for abnormal regions but bad for normal regions with respect to Eqn. 2.3.
Complementariness of different predictors To attempt for further boosting of the localization performance, 10% test data was utilized to train a logistic regression model using “Rec-Error”, “KL-grad” and “Rec-grad” as independent features. We abandon the “ELBO-grad”, since it is just the sum of “KL-grad” and “Rec-grad”. We use the -VAE (=10) as the backbone. The resulting weight parameters of the three predictors are in the scale of and and the AUROC is , which is a little bit worse than that of “Rec-grad” (). This basically indicates that the “Rec-grad” predictor may include almost all the information within other predictors and they are far from being complementary to each other.
In this paper, we applied the energy-based projection in more challenging medical imaging scenario and found it is not as useful as on natural images. Moreover, we observe that the robustness of KL gradient predictor totally depends on the setting of the VAE. We also explored the effect of the weight of KL loss within beta-VAE in anomaly localization. Ensemble of different predictors were also investigated.
-  (2017) Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific data 4, pp. 170117. Cited by: §3.1.
Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. arXiv preprint arXiv:1811.02629. Cited by: §3.1.
-  (2018) Deep autoencoding models for unsupervised anomaly segmentation in brain mr images. In International MICCAI Brainlesion Workshop, pp. 161–169. Cited by: §1, §2.2.
MVTec ad–a comprehensive real-world dataset for unsupervised anomaly detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9592–9600. Cited by: §1, §2.1, §2.2.
Unsupervised detection of lesions in brain mri using constrained adversarial auto-encoders.
International Conference on Medical Imaging with Deep Learning, Cited by: §1, §2.2.
-  (2020) Iterative energy-based projection on a normal data manifold for anomaly localization. In International Conference on Learning Representations, Cited by: §1, §1, §3.4.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1.
-  (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §2.1.
Identification of outliers. Vol. 11, Springer. Cited by: §1.
-  (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. In International Conference on Learning Representations, Cited by: §1, §2.3.
-  (2017) The -vae’s implicit prior. In Workshop on Bayesian Deep Learning, NIPS, pp. 1–5. Cited by: §2.3.
-  (2014) Auto-encoding variational bayes. In International Conference on Learning Representations, Cited by: §1.
-  (2019) Don’t blame the elbo! a linear vae perspective on posterior collapse. In Advances in Neural Information Processing Systems, pp. 9403–9413. Cited by: §2.3.
Anomaly machine component detection by deep generative model with unregularized score.
2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.2.
-  (2014) The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34 (10), pp. 1993–2024. Cited by: §3.1.
-  (2019) Do deep generative models know what they don’t know?. In International Conference on Learning Representations, Cited by: §2.2.
-  (2019) Classification accuracy score for conditional generative models. In Advances in Neural Information Processing Systems, pp. 12247–12258. Cited by: §2.1.
-  (2019) Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, pp. 14837–14847. Cited by: §2.1.
-  (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pp. 146–157. Cited by: §2.1.
-  (2019) Improving generalization and stability of generative adversarial networks. In International Conference on Learning Representations, Cited by: §2.1.
-  (2012) The human connectome project: a data acquisition perspective. Neuroimage 62 (4), pp. 2222–2231. Cited by: §3.1.
-  (2019) Unsupervised anomaly localization using variational auto-encoders. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 289–297. Cited by: §1, §2.2, §3.3, §3.4, §3.4.
-  (2018) A case for the score: identifying image anomalies using variational autoencoder gradients. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Cited by: §1, §2.2.