Autoencoders for Unsupervised Anomaly Segmentation in Brain MR Images: A Comparative Study

04/07/2020 ∙ by Christoph Baur, et al. ∙ 0

Deep unsupervised representation learning has recently led to new approaches in the field of Unsupervised Anomaly Detection (UAD) in brain MRI. The main principle behind these works is to learn a model of normal anatomy by learning to compress and recover healthy data. This allows to spot abnormal structures from erroneous recoveries of compressed, potentially anomalous samples. The concept is of great interest to the medical image analysis community as it i) relieves from the need of vast amounts of manually segmented training data—a necessity for and pitfall of current supervised Deep Learning—and ii) theoretically allows to detect arbitrary, even rare pathologies which supervised approaches might fail to find. To date, the experimental design of most works hinders a valid comparison, because i) they are evaluated against different datasets and different pathologies, ii) use different image resolutions and iii) different model architectures with varying complexity. The intent of this work is to establish comparability among recent methods by utilizing a single architecture, a single resolution and the same dataset(s). Besides providing a ranking of the methods, we also try to answer questions like i) how many healthy training subjects are needed to model normality and ii) if the reviewed approaches are also sensitive to domain shift. Further, we identify open challenges and provide suggestions for future community efforts and research directions.



There are no comments yet.


page 2

page 3

page 8

page 9

page 10

page 11

page 12

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

MR imaging of the brain is at the heart of diagnosis and treatment of neurological diseases. When sifting MR scans, Radiologists intuitively rely on a learned model of normal brain anatomy to detect pathologies. However, reading and interpreting MR scans is an intricate process: It is estimated that in 5-10% of scans, a relevant pathology is missed 


. Recent breakthroughs in machine learning have led to automated medical image analysis methods which achieve great levels of performance in the detection of tumors or lesions arising from neuro-degenerative diseases such as Alzheimers or Multiple Sclerosis (MS). Despite all their outstanding performances, these methods—mainly based on Supervised Deep Learning—carry some disadvantages: 1) their training calls for large and diverse annotated datasets, which are scarce and costly to obtain; 2) the resulting models are limited to the discovery of lesions which are similar to those in the training data. This is especially crucial for rare diseases, for which collecting training data poses a great challenge. Lately, there have been some Deep Learning-driven attempts towards automatic brain pathology detection which tackle the problem from the perspective of so-called Unsupervised Anomaly Detection (UAD). These approaches are more similar to how Radiologists read MR scans, do not require data with pixel-level annotations and have the potential to detect arbitrary anomalies without a-priori knowing about their appearances.

UAD has a long history in medical image analysis and in brain imaging in particular. Traditional methods are based on statistical modeling, content-based retrieval, clustering or outlier-detection. A review on such classical approaches with a focus on brain CT imaging is given in


. Since the rise of Deep Learning, a plethora of new, data-driven approaches has appeared. Initially, Autoencoders (AEs), with their ability to learn non-linear transformations of data onto a low-dimensional manifold, have been leveraged for cluster-based anomaly detection. Lately, a variety of works used AEs and generative modeling to not simply detect, but localize and segment anomalies directly in image-space from imperfect reconstructions of input images, which is surveyed in this work in the context of brain MRI.

The underlying idea thereby is to model the distribution of healthy anatomy of the human brain with the help of deep (generative) representation learning. Once trained, anomalies can be detected as outliers from the modeled, normative distribution. AEs [3][2] and their generative siblings [3][6][23][24][14] have emerged as a popular framework to achieve this by essentially learning to compress and reconstruct MR data of healthy anatomy. The respective methods can essentially be divided into two categories: 1) Reconstruction-based approaches compute a pixel-wise discrepancy between input samples and their feed-forward reconstructions to determine anomalous lesions directly in image-space; 2) Restoration-based methods [18][22] try to alter an input image by moving along the latent manifold until a normal counterpart to the input sample is found, which in turn is used again to detect lesions from the pixel-wise discrepancy of the input data and its healthy restoration. To date—albeit all of these methods report promising performances—results can hardly be compared and drawing general conclusions on their strengths & weaknesses is barely possible. This is hindered by the following issues: i) most of the works rely on very different datasets with barely overlapping characteristics for their evaluation, ii) are evaluated against different pathologies, iii) operate on different resolutions and iv) utilize different model architectures with varying model complexity. The main intent of this work is to establish comparability among a broad selection of recent methods by utilizing—where applicable—a single network architecture, a single resolution and the same dataset(s).

Contribution—Here, we provide a comparative study of recent Deep-Learning based UAD approaches for brain MRI. We compare various reconstruction- as well as restoration based methods against each other on a variety of different MR datasets with different pathologies111Code will be made publicly available at after successful peer-review of the manuscript.. The models are tested on four different datasets for detecting two different pathologies. To evaluate the methods without having to make general assumptions about what constitutes a detection, we utilize pixel-wise segmentation measures as a tight proxy for UAD performance. For a fair comparison, we determined a single, unified architecture on which all the methods rely in this study. This ensures that model complexity is the same for all approaches, if applicable. The performances of the originally proposed networks are also presented. Further, we provide insights on the number of healthy training samples and their impact on model performance, and peek at generalization capabilities of AE models.

Ii Unsupervised Deep Representation Learning for Anomaly Detection

Ii-a Modeling Healthy Anatomy

Fig. 1: The concept of Autoencoder-based Anomaly Detection/Segmentation: A) Training a model from only healthy samples and B) anomaly segmentation from erroneous reconstructions of input samples, which might carry an anomaly.
(a) AE
(b) VAE
(c) AAE
(e) Context AE
(f) GAN
Fig. 2: Autoencoder-based architectures for UAD at a glance

The core concept behind the reviewed methods is the modeling of healthy anatomy with unsupervised deep (generative) representation learning. Therefor, the methods leverage a set of healthy MRI scans and learn to project it to and recover it from a lower dimensional distribution (see Fig. 1). In the following, we first shed the light on the ways how this normative distribution can be modeled, and then present different approaches how anomalies can be discovered using trained models.

Autoencoders—Early work in this field relied on classic AEs (Fig. (a)a) to model the normative distribution: An encoder network with parameters is trained to project a healthy input sample to a lower dimensional manifold , from which a decoder with parameters then tries to reconstruct the input as . In other words, the model is trained to compress and reconstruct healthy anatomy by minimizing a reconstruction loss


, which in our case is the -distance between input and reconstruction. The rationale behind this is the assumption that an AE trained on only healthy samples cannot properly reconstruct anomalies in pathological data. This approach has been successfully applied to anomaly segmentation in brain MRI [3][2] and in head CT [16]. A slightly different attempt was made in [24], where the reconstruction-problem was turned into an inpainting-task using a Context Autoencoder (Context AE) (Fig. (e)e), in which the model is trained to recover missing sections in healthy training images. The natural choice for the shape of , here also referred to as latent space, bottleneck or manifold

, is a 1D vector. However, it has been shown that spatial AEs with a tensor-shaped bottleneck can be beneficial for high-resolution brain MRI as they preserve spatial context and can generate higher quality reconstructions 


Latent Variable Models—In classic AEs, there is no regularization on the manifolds structure. In contrast, latent variable models such as Variational Autoencoders (VAEs[10], Fig. (b)b) constrain the latent space by leveraging the encoder and decoder networks of AEs to parameterize a latent distribution , using the following objective:

, where is a Lagrangian multiplier which weights the reconstruction loss against the distribution-matching KL-Divergence . In practice, the VAE projects input data onto a learned mean

and variance

, from which a sample is drawn and then reconstructed (see Fig. (b)b). While the VAE tries to match to a prior

(typically a multivariate normal distribution) by minimizing the KL-Divergence, which has various shortcomings, the so-called Adversarial Autoencoder (AAE 

[13], Fig. (c)c) leverages an adversarial network as a proxy metric to minimize this discrepancy between the learned distribution and the prior . As opposed to the KL-Divergence, the optimization via an adversarial network does not favor modes of distributions and is always differentiable. Another extension to the VAE, the so-called Gaussian Mixture VAE (GMVAE [7]

) even replaces the mono-modal prior of the VAE with a gaussian mixture, leading to higher expressive power. Due to their ability to model the underlying distribution of high dimensional data, these frameworks are naturally suited for modeling the desired normative distribution. Further, their probabilistic nature facilitates the development of principled density-based anomaly detection methods. Consequently, they have been widely employed for outlier-based anomaly detection: VAEs were used in brain MRI for MS lesion 

[3], tumor and stroke detection [24]. They have also been utilized for tumor detection in head CT [14] from aggregate means of Monte-Carlo reconstructions. In brain MRI, AAE-[6] and GMVAE[22]-based approaches have also been successfully employed for tumor detection.

Generative Adversarial Networks—Pioneering work, even before AEs were successfully applied for UAD in medical imaging, leveraged Generative Adversarial Networks (GANs [8], Fig. (f)f) to detect anomalies in OCT data. Therefor, Schlegl et al  [18] modeled the distribution of healthy retinal patches with GANs and determined anomalies by computing the discrepancy between the retinal patch and a healthy counterpart restored by the GAN. Inspired by this work, Baur et al. [3] leveraged the VAEGAN [11]—a combination of the GAN and VAE (Fig. (d)d)—to overcome the training instabilities of the GAN and to allow for faster feed-forward inference, which they successfully employed for anomaly segmentation in brain MRI. In recent follow-up work, Schlegl et al. [17] improved on their GAN and also introduced an efficient way to replace the costly iterative restoration method by a single forward pass through the network.

Ii-B Anomaly Segmentation

The trained models can be used for anomaly detection & segmentation in a variety of ways, which are summarized in the following. The interested reader is referred to the original papers for more detailed information.

Reconstruction Based Methods—Such approaches rely on pixel-wise residuals obtained from the difference


of input samples and their reconstruction (see Fig. 1). The underlying idea being that anomalous structures, which have never been seen during training, cannot be properly reconstructed from the distribution encoded in the latent space, such that reconstruction errors will be high for anomalous structures.

(a) Bayesian AE
(b) Bayesian VAE
Fig. 3: Monte Carlo Reconstructions aggregate and average N reconstructions for a single sample.

Monte Carlo Methods—For non-deterministic generative models such as VAEs, multiple reconstructions can be obtained by Monte-Carlo (MC) sampling the latent space and an average consensus residual can be computed [14]


, with being the number of MC samplings and being a single MC reconstruction. For deterministic AEs, a similar effect can be achieved by applying dropout with rate to the latent space during inference time, which is also investigated in this work (see Fig. (a)a and Fig. (b)b for a visual explanation).

Gradient-Based Methods—The gradient-based method proposed in [24]

solely relies on image gradients obtained from a single backpropagation step when virtually optimizing for the following objective,


i.e. the pursuit of bringing the reconstruction and input of the model together while simultaneously moving the latent representation of an input sample closer to the prior (the normal distribution). The resulting pixel-wise gradients are used as a saliency map for anomalies, where it is assumed that stronger gradients constitute anomalies.

Restoration Based Methods—In contrast to reconstruction based methods, restoration based methods involve an optimization on the latent manifold. In the pioneering approach using GANs [18], the goal is to iteratively move along the GANs input distribution until a healthy variant of a query image is reconstructed well. Similarly, the method in [22] tries to restore a healthy counterpart of an input sample , but by altering it until the ELBO of its latent representation is maximized. This can be achieved by initializing and then iteratively optimizing for the objective in Eq. 4. Again, the anomalies can be detected in image space from residual maps (see Eq. 2).

Iii Experiments

Fig. 4: The unified network architecture with a dense bottleneck. In the case of a spatial bottleneck, the flatten-, dense- and reshape-layers are replaced by a single set of 2D convolutional kernels.

In the following, we first introduce the datasets used in the experiments, together with their pre-processing, and then introduce the unified network architecture which is the foundation of all the subsequently investigated models. We further explain our post-processing pipeline and all the metrics used in our investigations, before we finally present and discuss the results from various perspectives.

Iii-a Datasets

Dataset Training Validation Testing
110 28 -
- 3 45
- - 28
- - 20
- - 30
TABLE I: Training, Validation, Testing subjects of the datasets used in this study

For this survey, we rely on three different datasets. Selection criteria for these datasets were i) the availability of corresponding T1, T2 and FLAIR scans per subject to be able to leverage a single shared preprocessing pipeline and ii) each dataset being produced with a different MR device.

Healthy, MS & GB—The primary dataset used in this comparative study is a homogenous set of MR scans of both healthy and diseased subjects, produced with a single Philips Achieva 3T MR scanner. It comprises FLAIR, T2- and T1-weighted MR scans of 138 healthy subjects, 48 subjects with MS lesions and 26 subjects with Glioma. All scans have been carefully reviewed and annotated by expert Neuro-Radiologists. Informed consent was waived by the local IRB.

MSLUB—The second MRI dataset [12] consists of co-registered T1, T2 and FLAIR scans of 30 different subjects with MS. Images have been acquired with a 3T Siemens Magnetom Trio MR system at the University Medical Center Ljubljana (UMCL). A gold standard segmentation was obtained from consensus segmentations of three expert raters.

MSSEG2015—The third MRI dataset in our experiments is the publicly available training set of the 2015 Longitudinal MS lesion segmentation challenge [5], which contains 21 scan sessions from 5 different subjects with T1, T2, PD and FLAIR images each. All data has been acquired with a 3.0 Tesla Philips MRI scanner. The exact device is not known, but the intensity distribution is different from our primary MS & GB datasets. Thus, in this study we utilize the data to test the generalization capabilities of the models and approaches.

Preprocessing and Split—All scans have been brought to the SRI24 ATLAS [15] space to ensure all data share the same volume size and orientation. In succession, the scans have been skull-stripped with ROBEX [9] and denoised with CurvatureFlow [19]. Prior to feeding the data to the networks, all volumes have been normalized into the range [0,1] by dividing each scan by its 98th percentile. All datasets have randomly been split (patient-wise) into training, validation and testing sets as listed in Table I. Training and testing is done on all axial slices of each volume for which the corresponding brainmask indicates the presence of brain pixels. Modus operandi is at a slice resolution of px. This is in stark contrast to some other works, which restrict themselves anatomically to the axial midline [3] or lower resolution [14, 6].

Iii-B Network Architecture and Models

The unified architecture depicted in Fig. 4 was empirically determined in a manual iterative architecture search. The goal was to achieve low reconstruction error on both the training and validation data from . This unified architecture was then used to train a great variety of models coming from the different, previously introduced domains:

Autoencoders—As a baseline, we used the unified architecture to train a variety of non-generative AEs:

  1. AE (dense): an AE with a dense bottleneck

  2. AE (spatial) [3]: an AE with a spatial bottleneck

  3. Context AE [24]: with

  4. Constrained AE [6]: with

Latent Variable Models—Further, we trained various generative latent variable models using the same unified architecture and bottleneck configurations:

  1. VAE [3, 23]: with

  2. Context VAE [24]: with

  3. Constrained AAE [6]: with

  4. GMVAE (dense) [22]: with

  5. GMVAE (spatial) [22]: with

Generative Adversarial Networks—Finally, we also trained an AnoVAEGAN [3] and an f-AnoGAN [17], whose encoder-decoder networks implement the unified architecture, and the discriminator network is a replica of the encoder:

  1. AnoVAEGAN [3]: with

  2. fAnoGAN [17]: with

Noteworthy, both methods were optimized with the Wasserstein loss [1] to avoid GAN training instabilities and mode collapse.

All models were trained from until convergence using an automatic early stopping criterion, i.e. training was stopped if the reconstruction loss on the held-out validation set from did not improve more than an

for 5 epochs. In succession, all the methods were used for reconstruction-based anomaly detection. The trained VAE and GMVAE were also used for the density-based image restoration 

[22], where each sample was restored in 500 iterations:

  1. VAE (restoration) [22]

  2. GMVAE (restoration) [22]

Both AE (dense) and VAE were also used for MC-reconstruction based anomaly detection:

  1. Bayesian AE [14]: Dropout rate 0.2

  2. Bayesian VAE [14]: MC-samples per input slice

and in the case of the Context VAE, we also tried the gradient-based approach proposed in [24]:

  1. Context VAE (gradient) [24]

Hyperparameters can be taken from Table II.

Param Value
learning rate 0.0001
dropout rate 0.2
TABLE II: Hyperparameters for the different models

Iii-C Postprocessing

The output of all models and approaches is subject to the same post-processing. Every residual image is first multiplied with a slightly eroded brain-mask to remove prominent residuals occuring near sharp edges at brain-mask boundaries and gyri and sulci (the latter are very diverse and hard to model). Further, for the MS lesion datasets we make use of prior knowledge and only keep positive residuals as these lesions are known to be fully hyper-intense in FLAIR images. For each MR volume, the residual images for all slices are first aggregated into a corresponding 3D residual volume, which is then subject to a 3D median filtering with a

kernel to remove small outliers and to obtain a more continuous signal. The latter is beneficial for the subsequent model assessment as it leads to smoother curves. As a final step, the continuous output is binarized and a 3D connected component analysis is performed on the resulting binary volumes to discard any small structures with an area less than 8 voxels.

Iii-D Metrics

We assess the anomaly segmentation performance at a level of single voxels, at which class imbalance needs careful consideration as anomalous voxels are usually less frequent than normal voxels. To do so, we generate dataset-specific Precision-Recall-Curves (PRC) and then compute the area under it (AUPRC). Noteworthy, this allows to judge the models capabilities without choosing an Operating Point (OP). Further, for each model we provide an estimate of its theoretically best possible DICE-score (DICE) on each dataset. Therefor, for each testing dataset , we utilize the available ground-truth segmentation and perform a greedy search up to three decimals to determine the respective OP on the PRC curve which yields the best possible DICE score for dataset . Additionally, to simulate the models performance in more realistic settings, we utilize a held-out validation set from to determine an OP at which we then compute patient-specific DICE-scores for every dataset. Some of the reviewed works originally utilize Receiver-Operating-Characteristics (ROC) to evaluate anomaly detection performance. We report the area under such ROC curves (AUROC) as well, but want to emphasize that it has to be used with care. Under heavy class imbalance, ROC curves can be misleading as they give much higher weight to the more frequent class and thus in the case of a pixel-wise assessment very optimistic views on performance.

To gain deeper insights what makes a model capable of segmenting anomalies better than others, we also report dataset-specific -reconstruction errors on normal (-RE) and anomalous voxels (-RE), as well as the -distance of the respective normal and anomalous residual histograms for every model.

Iii-E Overview

[tabular=—l—l—l—l—l—l—l—l—, table head=Approach & AUROC & AUPRC & DICE & DICE () & -RE () & -RE () &
, late after line=
]csv/all_results_unified.csvApproach=, MSKRI AUROC=, MSKRI AUPRC=, MSKRI BPDICE=, MSKRI DICE=, MSKRI Rec.-Error (Normal)=, MSKRI Rec.-Error (Anomalous)=, MSKRI IOU=, MSKRI ChiSq=& & & & & & &

TABLE III: Experimental results on the MS dataset

[tabular=—l—l—l—l—l—l—l—l—, table head=Approach & AUROC & AUPRC & DICE & DICE () & -RE () & -RE () &
, late after line=
]csv/all_results_unified.csvApproach=, GBKRI AUROC=, GBKRI AUPRC=, GBKRI BPDICE=, GBKRI DICE=, GBKRI Rec.-Error (Normal)=, GBKRI Rec.-Error (Anomalous)=, GBKRI IOU=, GBKRI ChiSq=& & & & & & &

TABLE IV: Experimental results on the GB dataset

[tabular=—l—l—l—l—l—l—l—l—, table head=Approach & AUROC & AUPRC & DICE & DICE () & -RE () & -RE () &
, late after line=
]csv/all_results_unified.csvApproach=, MSLUB AUROC=, MSLUB AUPRC=, MSLUB BPDICE=, MSLUB DICE=, MSLUB Rec.-Error (Normal)=, MSLUB Rec.-Error (Anomalous)=, MSLUB IOU=, MSLUB ChiSq=& & & & & & &

TABLE V: Experimental results on the MSLUB dataset

[tabular=—l—l—l—l—l—l—l—l—, table head=Approach & AUROC & AUPRC & DICE & DICE () & -RE () & -RE () &
, late after line=
]csv/all_results_unified.csvApproach=, ISBI AUROC=2015, ISBI AUPRC=2015, ISBI BPDICE=2015, ISBI DICE=2015, ISBI Rec.-Error (Normal)=2015, ISBI Rec.-Error (Anomalous)=2015, ISBI IOU=2015, ISBI ChiSq=2015& 2015 & 2015 & 2015 & 2015 & 2015 & 2015 & 2015

TABLE VI: Experimental results on the MSSEG2015 dataset
(d) MSSEG2015
Fig. 5: AUPRC of all models and UAD approaches, using the unified architecture.

Detailed results of all models and UAD approaches on all datasets can be found in Tables III (), IV (), V () and VI (). In the following, we analyze all these data from different perspectives. We start by first comparing different model types and bottleneck design, followed by the different ways to detect anomalies directly in image-space. Then, we shed the light on the number of training subjects and their impact on performance, and elaborate on domain shift.

(a) axial slice from , ventral
(b) axial slice from , midline
(c) axial slice from , dorsal
(d) axial slice from , ventral
(e) axial slice from , dorsal
(f) axial slice from , midline
Fig. 6: Visual examples of the different reviewed methods on different datasets, using the unified architecture. Top row: reconstructions; Bottom row: raw residuals.

Iii-F Constraining & Regularization

Initially, we compare the classic AE (dense) to its VAE and Constrained AE counterpart to investigate the effect of constraining or regularizing the latent space of the models. Recall that VAEs regularize the latent space to follow a prior distribution, whereas the deterministic Constrained AE enforces that reconstructions and input lie closely on the manifold. We measure the models’ performances in terms of the AUPRC as well as the DICE and glimpse at the reconstruction errors for normal and anomalous pixels (see Fig. 11 for residual histograms of normal and anomalous voxels). We see from Table III that explicitly modeling a distribution with a VAE leads to dramatic performance gains on over the standard AE, and introducing the matching constraint (Constrained AE) between and improves the performance even more. On all other datasets, the VAE clearly is the winner among the compared models, but the Constrained AE still outperforms the classic AE. From these results, we deduce that enforcing a structure on the manifold of AEs is indeed beneficial for UAD.

Iii-G Dense vs Spatial Bottleneck

Fig. 7: Reconstructions and postprocessed residuals using dense and spatial AEs

To determine if the design of the AE bottleneck can improve the performance of the models, we further compare dense models for which a spatial counterpart exists, i.e. AE (dense) vs AE (spatial) vs GMVAE (dense) vs GMVAE (spatial). The spatial bottleneck allows the model to preserve spatial information and geometric features in its latent space, which positively affects the models reconstruction capabilities. From Tables III-VI it can be seen that the dense models outperform the spatial variants, alone the spatial AE performs slightly better on than its dense counterpart. We find that at our resolution of 128x128px, the spatial models reconstruct their input too well (see Fig. 7), including the anomalies.

Iii-H Latent Variable Models

Next, we focus only on different latent variable model types, i.e. the VAE, GMVAE (dense) and Constrained AAE. On the MS datasets and , the VAE constitutes the best among the compared models. The Constrained AAE yields lower performance than the other models—also lower than its non-generative sibling, the Constrained AE. However, on the Glioblastoma dataset, it is on par with the GMVAE, and both models significantly outperform the VAE in the detection of brain tumors. Generally, the performance of the GMVAE generally seems to heavily depend on the dataset rather than the pathology: On and it behaves very similar to the VAE, whereas on and its performance resembles that of the Constrained AAE.

Iii-I GAN-based models

GAN-based models are known to produce very realistic and crisp images, while AEs are known for their blurry reconstructions. Indeed, qualitative comparison of the f-AnoGAN and the AnoVAEGAN to the AE and VAE shows that the GAN-based models promote sharpness. This is particularly evident near the boundaries of the brain (see Fig. 6). However, both the f-AnoGAN and AnoVAEGAN model the training distribution too well, such that reconstructions often differ anatomically from the actual input samples (see Fig. (b)b for an axial midline slice from ). This is especially the case for the AnoVAEGAN, which produces the most crisp reconstructions, but often does not preserve anatomical coherence at all. As a result, on the MS datasets its performance is only comparable to the VAE, but it works considerably better for Glioblastoma segmentation. The f-AnoGAN does not provide as crisp images, but preserves the shape of the input sample and the difference between reconstruction residuals on normal and anomalous pixels is considerably higher across all datasets than for any of the other methods. This makes the UAD performance of the f-AnoGAN stand out. In total, both GAN-based approaches significantly outperform the standard AE (on average, more than 9% for the AnoVAEGAN and more than 15% for the f-AnoGAN) and the f-AnoGAN clearly also outperforms the VAE (on average more than 6%).

Iii-J Monte-Carlo Methods

Monte-Carlo methods applied to (variational) AEs provide an interesting means to aggregate a consensus reconstruction, in which only very likely image features should be emphasized. To investigate if anomalies are affected, we experiment with MC-reconstructions and—where necessary—an empirically chosen dropout-rate to trade-off reconstruction quality and chance. We find that, compared to one-shot reconstructions, the impact of MC-sampling is at most subtle, and not consistent across different models and datasets. A comparison of AE (dense) to the Bayesian AE shows that MC-dropout leads to a slightly worse performance in almost all metrics across all datasets. On the other hand, the Bayesian VAE, which does not need dropout for MC sampling due to its probabilistic bottleneck, is equal to or slightly outperforms the VAE on , but not on and . Overall, these numbers indicate that MC methods, albeit an interesting approach, do not provide significant gains in the way they are currently employed.

Iii-K Reconstruction vs Restoration

Previous comparisons focused on different model types and all relied on the reconstruction-based UAD concept. In the following, we rank reconstruction-based methods, against gradient- and restoration-based UAD approaches. More precisely, we compare reconstruction against restoration on the VAE, GMVAE (dense) and GMVAE (spatial). We further rank the restoration-based methods against the top-candidate f-AnoGAN. From Tables III to VI it is evident that restoration based UAD is generally superior to the reconstruction-based counterparts (ranging from 4-17% for the VAE, 4-10% for the dense GMVAE and 6-20% for the spatial GMVAE). Consistent with our previously measured results on dense versus spatial models, we also witness a dramatic drop in performance when using the spatial GMVAE, though. Except for , the dense restoration methods outperform the f-AnoGAN in all scenarios in terms of the AUPRC and DICE.

Iii-L Domain Shift

Deep Learning models trained from data coming from one domain generally have difficulties to generalize well to other domains, and tackling such domain shift is still a highly active research area. Here, we want to determine to which extent AEs are prone to this effect and if some methods generalize better than others. Subject to our investigations are the MS datasets , and among which such shifts occur. Generally, UAD performance is best on , which matches the training data distribution, and on both & , the UAD performance drops significantly. However, the reasons for this drop can be manifold, and we want to emphasize that UAD performance as such is not a good indicator for domain shift, as the lesion size and count differs across datasets, and the contrast for is considerably better than for the other datasets. Instead, we suggest to look at the reconstruction error of normal pixels -RE in these datasets. From Tables III, V and VI it can be seen that this error hardly degrades across all these datasets. This implies that generalization measured in terms of the models reconstruction capabilities is not of primary concern. However, from aforementioned tables it can be seen that the reconstruction error of anomalous pixels -RE is significantly smaller on and , which is a clear indicator of weaker contrast between normal tissue and lesions in these datasets.

Iii-M Different Pathologies

On both Multiple Sclerosis () and Glioblastoma (), the restoration-based approaches with dense bottleneck constitute the top-performers, delivering results in roughly the same league. Similarly, lowest performances can be seen from the spatial models, the gradient-based UAD approach and the standard AE. However, in contrast to , on there is a large performance gap between the top-performing restoration approaches and any other methods: the GAN-based methods f-AnoGAN and AnoVAEGAN drop by 10% and 4%, respectively, the performance of the VAE models degrades by at least 12% and the Constrained AE even loses 20% in AUPRC. Interestingly, the Constrained AAE gains by 10%. Multiple factors lead to the lower performance: In contrast to MS lesions, tumors do not purely appear hyper-intense in FLAIR MRI. Some compartments of the tumor also resemble normal tissue, and the investigated UAD approaches have difficulties to properly delineate those. Second, tumors often are not only larger than MS lesions, but can have very complex shape (see Fig. (d)d). This is hard to segment with precision—even among human annotators, there is variation.

Iii-N How much healthy training data is enough?

(d) MSSEG2015
Fig. 8: AUPRC of selected models trained with different numbers of healthy numbers of healthy training subjects (10, 50 and 100%, respectively).

In our previous experiment, we relied on 110 healthy training subjects. The question arises whether this is a sufficient amount, or if fewer scans even lead to comparable results. To give insights into the behavior of the examined models in this context, we provide a comparison of the AUPRC of conceptually most different models, all trained at varying number of healthy subjects, i.e. 10, 50 and 100% of the available training samples. Results on the four different datasets can be seen in Fig. 8. The GAN-based models, which model the healthy distribution the closest due to the Wasserstein-loss, show consistent improvements in AUPRC with a growing training set. Alone the AnoVAEGAN shows a slight drop at 50% of the training data on . The overall top-performer, with one exception, is still the restoration method, here reported using the GMVAE (dense). Alone on , this GMVAE shows inconsistent behavior. Both the VAE and Context VAE, our selection from the family of VAEs with a dense bottleneck, show improved and similar performance with increasing number of training subjects on any of the MS datasets. On , both models exhibit inconsistent behavior, and the VAE performs considerably better. Among all the methods, the dense AE yields the most unpredictable performance, varying greatly among different datasets and different number of healthy subjects.

Iii-O Model Complexity

(d) MSSEG2015
Fig. 9: AUPRC of all models and UAD approaches, using the original, more complex architectures proposed in the respective papers.

To give some insights on the relation between model complexity and segmentation performance, we further rank some of the approaches based on the architectures originally proposed in the respective papers against each other. A comparison is provided on all datasets in Fig. 9. Therein, we find the VAE and the restoration-based GMVAE methods to be stable candidates. Except for , the standard VAE approach as proposed in [24, 23, 3] shows reliable performance. Similarly, the GMVAE, especially in combination with restoration-based UAD, shows good performance across all datasets. Interestingly, the more complex VAE and Context VAE models in Fig. 9 show only comparable performance to the less complex models following our unified architecture (Fig. (d)d. On , none of the more complex models beat the top-performing unified restoration approach. The gradient-based approach, proposed in combination with the original Context VAE, yields lower AUPRC than its unified counterpart. We relate this observation to the reconstruction capabilities of models, which improve with an increase of model parameters. With increasing complexity, larger lesions such as Glioblastoma get reconstructed better as well, which is not desirable.

Iii-P Reconstruction Fidelity and UAD Performance

(d) MSSEG2015
Fig. 10: Correlation matrices among segmentation performance, reconstruction fidelity and overlap among residual histograms of normal and anomalous intensities.

From Fig. 6 it is clear that apart from spatial models, none of the approaches can reconstruct input perfectly, i.e. none of these methods leave healthy regions intact and substitute anomalous regions with plausible healthy anatomy. Nonetheless, some works perform better than others. We try to relate anomaly segmentation performance to the overlap between a models’ residual histograms of normal and anomalous pixels and general reconstruction fidelity. Therefor, we correlate the AUPRC and DICE to the -distance of the aforementioned histograms, and further determine how the -distance correlates with reconstruction fidelity of normal and/or anomalous tissue. We do this for every dataset separately to find out if the correlation differs across datasets and pathologies. Fig. 10 shows the correlation heatmaps of aforementioned measures on all datasets.

On and , AUPRC and DICE show moderate to strong correlation to the reconstruction error on anomalous pixels -RE, but not so much to residuals of normal intensities -RE. Their correlation to the -distance among residual histograms is the strongest. There is also a strong correlation between and -RE, the correlation to -RE is less pronounced. From these results we deduce that actual reconstruction fidelity is less important for UAD than clearly distinguishable residual histograms of normal and anomalous intensities.

For , similar, but generally stronger correlations can be seen. Interestingly, there is also a moderate to strong, positive relationship between segmentation performance and magnitude of normal residuals. This indicates that with increasing reconstruction error on both normal and anomalous intensities, segmentation performance improves. We hyptothesize that models which reconstruct data well, also reconstruct tumors well. Models with generally poor reconstruction capabilities substitute tumors with poor reconstructions of healthy tissue, leading to better separability between anomalies and normal intensities.

On , the previously noticed correlations are hardly present. Instead, -RE and -RE are strongly correlated and seem to correlate similarly with all other metrics. This clearly reflects the poor contrast in the underlying MR images, which renders UAD unsuitable.

Iii-Q Discussion

Ranking—The clear winner of this comparative study is the restoration method applied to a VAE (VAE (restoration)), which achieves best performance on and , i.e. works best on different pathologies, but also achieves best performance on , i.e. under domain shift. However, there is a downside to the restoration method, namely runtime. A restoration of a single axial slice in 500 iterations takes multiple seconds, which for an entire MR volume accumulates quickly to multiple minutes. The feed-forward nature of purely reconstruction-based approaches allows for a much faster inference. In this context, a very promising method is the reconstruction-based f-AnoGAN, which achieves best performance on the very challenging MSSEG2015 Dataset, and is only slightly inferior to the winning restoration approach on all other datasets. Also, we find that latent variable models perform better in anomaly segmentation than classic AEs. Their reconstructions tend to be more blurry, but the gap between reconstruction errors of normal and anomalous pixels is considerably higher and allows to discriminate much better between anomalies and normal tissue. Among the latent variable models, we find the VAE to be the recommended choice, as it not only performs the best, but is the easiest to optimize. It involves fewer hyperparameters than the other approaches and does not require a discriminator network, which is a critical building block in GANs.

Open Problems

—Despite all the recent successes of this paradigm, there are many questions yet to be answered. A key question is how to choose an Operating Point at which the continuous output i) can be binarized and a segmentation can be obtained or ii) an input sample can be considered anomalous. Most of the methods currently either rely on a held-out validation set to determine a threshold for binarization, or make use of heuristics on the intensity distribution. One such heuristic uses the 98th percentile of healthy data as a threshold, above which every value is considered an outlier

[3]. It is necessary that more principled approaches for binarization are developed.

Although reconstruction fidelity here is far from perfect, the reviewed methods seem to be indeed capable of segmenting different kinds of anomalies. Nonetheless, we believe that the community should still aim for higher levels of fidelity and modeling MRI also at higher resolution to facilitate segmentation of particularly small brain lesions (e.g. MS lesions, which can become very small) and enhance precision of anomaly localization.

Another obvious downside of the reviewed methods is the necessity of a curated dataset of healthy data. It is debatable whether such methods can actually be called unsupervised or should be seen as weakly-supervised. The community should aim for methods which can be trained from all kinds of samples, even data potentially including anomalies, without the need for human ratings. You et al. [22] made an initial attempt towards this direction by using a percentile-based heuristic on the training data to mask out potential outliers during training, and with so called discriminative reconstruction autoencoders [21] an interesting concept has recently been proposed in the Computer Vision field. All in all, more research in this direction is heavily encouraged.

Generally, the field of Deep Learning based UAD for brain imaging is rapidly growing, and without the availability of a well defined benchmark dataset the field becomes increasingly confusing. This confusion primarily arises from the different datasets used in these works, which come at different resolutions, with different lesion load and different pathologies. All of these properties make it hard to compare methods. Here, we try to give an overview of recent methods, bring them into a shared context and establish comparability among them by leveraging the same data for all approaches. Nonetheless, even the datasets used in this comparative study are limited and many open questions have to remain unanswered. Since UAD methods aim to be general, they need to be evaluated on the most representative dataset possible. Ideally, a benchmark dataset for UAD in brain MRI should comprise a vast number of healthy subjects as well as different pathologies from different scanners, covering the genders and the entire age spectrum.

To date, different works do not only employ different datasets, but also report different metrics. In addition to the benchmark, a clear set of evaluation metrics needs to be defined to facilitate comparability among methods.

Last, the majority of approaches relies on 2D slices, but 3D offers greater opportunity and more context.

Iv Conclusion

In summary, we presented a thorough comparison of autoencoder-based methods for anomaly segmentation in brain MRI, which rely on modeling healthy anatomy to detect abnormal structures. We find that none of the models can perfectly reconstruct or restore healthy counterparts of potentially pathological input samples, but different approaches show different discrepancies between reconstruction-error statistics of normal and abnormal tissue, which we identify as the best indicator for good UAD performance.

To facilitate comparability, we relied on a single unified architecture and a single image resolution. The entire code behind this comparative study, including the implementations of all methods, pre-processing and evaluation pipeline will be made publicly available and we encourage authors to contribute to it. Authors might benefit from a transparent ranking which they can report in their work without having to reinvent the wheel to run extensive comparisons against other approaches.

In our discussion, we also identify different research directions for future work. Comparing different model-complexities, their correlation with reconstruction quality and its effect on anomaly segmentation performance is another research direction orthogonal to our investigations. Determining the correlation between image resolution and UAD performance is also an open task. However, our main proposal is the creation of a benchmark dataset for UAD in brain MRI, which involves many challenges by itself, but would be very beneficial to the entire community.


The authors would like to thank their clinical partners at Klinikum rechts der Isar, Munich, for generously providing their data. S.A. was supported by the PRIME programme of the German Academic Exchange Service (DAAD) with funds from the German Federal Ministry of Education and Research (BMBF), and B. W. was supported by the DFD SFB-824 grant.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017-06–11 Aug) Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 214–223. External Links: Link Cited by: §III-B.
  • [2] H. E. Atlason, A. Love, S. Sigurdsson, V. Gudnason, and L. M. Ellingsen (2019) Unsupervised brain lesion segmentation from mri using a convolutional autoencoder. In Medical Imaging 2019: Image Processing, Vol. 10949, pp. 109491H. Cited by: §I, §II-A.
  • [3] C. Baur, B. Wiestler, S. Albarqouni, and N. Navab (2018) Deep autoencoding models for unsupervised anomaly segmentation in brain mr images. arXiv preprint arXiv:1804.04488. Cited by: §I, §II-A, §II-A, §II-A, item 2, item 1, item 1, §III-A, §III-O, §III-Q, §III-B.
  • [4] M. A. Bruno, E. A. Walker, and H. H. Abujudeh (2015) Understanding and confronting our mistakes: the epidemiology of error in radiology and strategies for error reduction. Radiographics 35 (6), pp. 1668–1676. Cited by: §I.
  • [5] A. Carass, S. Roy, A. Jog, J. L. Cuzzocreo, E. Magrath, A. Gherman, J. Button, J. Nguyen, F. Prados, C. H. Sudre, et al. (2017) Longitudinal multiple sclerosis lesion segmentation: resource and challenge. NeuroImage 148, pp. 77–102. Cited by: §III-A.
  • [6] X. Chen and E. Konukoglu (2018) Unsupervised detection of lesions in brain mri using constrained adversarial auto-encoders. arXiv preprint arXiv:1806.04972. Cited by: §I, §II-A, item 4, item 3, §III-A.
  • [7] N. Dilokthanakul, P. A. Mediano, M. Garnelo, M. C. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan (2016) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Cited by: §II-A.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §II-A.
  • [9] J. E. Iglesias, C. Liu, P. M. Thompson, and Z. Tu (2011) Robust Brain Extraction Across Datasets and Comparison With Publicly Available Methods. IEEE Transactions on Medical Imaging 30 (9), pp. 1617–1634. Cited by: §III-A.
  • [10] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In International Conference on Learning Representations, External Links: Link Cited by: §II-A.
  • [11] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2015) Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300. Cited by: §II-A.
  • [12] Ž. Lesjak, A. Galimzianova, A. Koren, M. Lukin, F. Pernuš, B. Likar, and Ž. Špiclin (2018) A novel public mr image dataset of multiple sclerosis patients with lesion segmentations based on multi-rater consensus. Neuroinformatics 16 (1), pp. 51–63. Cited by: §III-A.
  • [13] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow (2016) Adversarial autoencoders. In International Conference on Learning Representations, External Links: Link Cited by: §II-A.
  • [14] N. Pawlowski, M. C. Lee, M. Rajchl, S. McDonagh, E. Ferrante, K. Kamnitsas, S. Cooke, S. Stevenson, A. Khetani, T. Newman, et al. (2018) Unsupervised lesion detection in brain ct using bayesian convolutional autoencoders. Cited by: §I, §II-A, §II-B, item 1, item 2, §III-A.
  • [15] T. Rohlfing, N. M. Zahr, E. V. Sullivan, and A. Pfefferbaum (2009-12) The SRI24 multichannel atlas of normal adult human brain structure. Human Brain Mapping 31 (5), pp. 798–819. Cited by: §III-A.
  • [16] D. Sato, S. Hanaoka, Y. Nomura, T. Takenaga, S. Miki, T. Yoshikawa, N. Hayashi, and O. Abe (2018) A primitive study on unsupervised anomaly detection with an autoencoder in emergency head ct volumes. In Medical Imaging 2018: Computer-Aided Diagnosis, Vol. 10575, pp. 105751P. Cited by: §II-A.
  • [17] T. Schlegl, P. Seeböck, S. M. Waldstein, G. Langs, and U. Schmidt-Erfurth (2019) F-anogan: fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis 54, pp. 30–44. Cited by: §II-A, item 2, §III-B.
  • [18] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, pp. 146–157. Cited by: §I, §II-A, §II-B.
  • [19] J. A. Sethian (1999) Level set methods and fast marching methods: evolving interfaces in computational geometry, fluid mechanics, computer vision, and materials science. Vol. 3, Cambridge university press. Cited by: §III-A.
  • [20] A. Taboada-Crispi, H. Sahli, D. Hernandez-Pacheco, and A. Falcon-Ruiz (2009) Anomaly detection in medical image analysis. In Handbook of Research on Advanced Techniques in Diagnostic Imaging and Biomedical Applications, pp. 426–446. Cited by: §I.
  • [21] Y. Xia, X. Cao, F. Wen, G. Hua, and J. Sun (2015) Learning discriminative reconstructions for unsupervised outlier removal. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1519. Cited by: §III-Q.
  • [22] S. You, K. C. Tezcan, X. Chen, and E. Konukoglu (2019-08–10 Jul) Unsupervised lesion detection via image restoration with a normative prior. In Proceedings of The 2nd International Conference on Medical Imaging with Deep Learning, M. J. Cardoso, A. Feragen, B. Glocker, E. Konukoglu, I. Oguz, G. Unal, and T. Vercauteren (Eds.), Proceedings of Machine Learning Research, Vol. 102, London, United Kingdom, pp. 540–556. External Links: Link Cited by: §I, §II-A, §II-B, item 4, item 5, item 1, item 2, §III-Q, §III-B.
  • [23] D. Zimmerer, F. Isensee, J. Petersen, S. Kohl, and K. Maier-Hein (2019) Unsupervised anomaly localization using variational auto-encoders. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 289–297. Cited by: §I, item 1, §III-O.
  • [24] D. Zimmerer, S. A. Kohl, J. Petersen, F. Isensee, and K. H. Maier-Hein (2018) Context-encoding variational autoencoder for unsupervised anomaly detection. arXiv preprint arXiv:1812.05941. Cited by: §I, §II-A, §II-A, §II-B, item 3, item 2, item 1, §III-O, §III-B.

Appendix A

Fig. 11: Normalized histograms of residuals of normal (blue) and anomalous (red) pixels in the intensity range (ignoring residuals which are completely 0)