1 Introduction
Detecting lesions plays a critical role in radiological assessment; it is often the first step in the diagnosis pipeline. Manual detection, the dominant current practice, relies on an exceptional understanding of the normal anatomy. Specifically, radiologists detect lesions as regions that deviate from the normal variation of the healthy anatomy, and then identify the disease based on the features of the lesion as well as other patient information. This detection process is unsupervised in the sense that the radiologist is not looking for a specific lesion but for any abnormal variation.
Developing algorithmic approaches for automatic unsupervised lesion detection is important for both algorithm development and clinical practice. For algorithm development, accurate unsupervised lesion detection would allow developing further algorithms that are robust to lesions unseen in the training set. For instance, a machine learningbased segmentation algorithm that is aware of the presence of an outlier can take this into account while segmenting the rest of the image avoiding being affected by the abnormal feature responses of the outlier area. Such robustness would be extremely helpful for a variety of tasks researchers are tackling with machine learning methods, such as segmentation, localization, reconstruction and restoration. Good unsupervised lesion detection methods can also be applied to detecting artifacts and furthermore developing methods robust to such artifacts. On the other hand, for clinical practice, a generic and accurate unsupervised lesion detection would be a useful prescreening tool to improve efficiency in radiological assessment by focusing expert effort directly on abnormal areas. This aspect is becoming particularly critical considering the increase in image resolution and number of modalities used in clinical routines, which on one hand have the potential to improve diagnostic accuracy but on the other hand result in drastically larger image volumes and number of images to be examined for each study, thus more effort.
Machine learningbased approaches have attracted considerable attention for lesion detection in the last decade. Majority of the research effort in this direction focus on developing supervised algorithms [1, 34, 9, 7, 23, 13, 16]. In this approach, algorithms are optimized during the training phase to detect lesions of prespecified types based on examples in the training set. When tested on the same lesion types, these methods yielded stateoftheart detection and segmentation performance. However, being optimized to detect prespecified lesion types limits their applicability to unseen lesions. In theory, adding examples of all possible types of lesions in training can address this issue but this option, even when feasible, would be very expensive in manual effort and time.
Unsupervised lesion detection methods take another approach. They focus on learning prior knowledge on the appearance of healthy anatomy in order to detect lesions as areas that disagree with the prior knowledge, mimicking radiologists. In one of the first works, a probabilistic atlas of healthy anatomy is utilized as the prior knowledge and built tissuespecific mixture models with an outlier class to segment healthy tissue while identifying areas that do not fit the other tissue classes [30]. In similar approaches, an atlas with spatial features instead of intensities is used [20] and combines spatial and intensity atlases [24]
. More recent works use prior models of patches of images to include contextual information around a pixel in the detection process. Considering the curse of dimensionality in modeling high dimensional distributions, such works adopt dimensionality reduction methods, such as principle component analysis
[32, 8], patchbased mixture model [3] and sparse representation [33].Advances in deep learning (DL)based unsupervised learning methods provided strong alternatives for approximating distributions in high dimensions. Neural samplers, such as Generative Adversarial Networks (GAN) [10]
, can learn to generate realistic highdimensional images similar to the ones seen in the training set starting from a lowerdimensional latent space, and DLbased nonlinear latent variable models, such as Variational Autoencoders (VAE)
[14], approximate the distribution from the samples in a training set, using networks to map from images to latent space and back, implementing the original idea [18] efficiently. DLbased approaches yield better approximations to the highdimensional distributions, as evidenced by the realistic samples they can generate. Recently proposed works [26, 2, 22, 4] already apply DLbased models to approximating normative distributions with only images showing healthy anatomy and then detect lesions as the outliers. These methods typically has three steps. First, a given image with a lesion is projected to the latent space of the model, which in other words is to find the latent space representation of the image. Since the models have only seen healthy anatomy during training, the most likely latent representations correspond to images with healthy anatomy. Based on this assumption, the original image is reconstructed from its latent representation, with the lesion reconstructed with large errors and the rest of the image reconstructed faithfully. Hence, the difference between the reconstruction and the original image will highlight the lesion. While using this prior projection is a sound idea, it relies on the assumption that healthy areas in the images will remain the same in the reconstructed image, meaning that the latent space representations of the image with lesion and its lesionfree version are very similar. Unfortunately, this assumption may not hold since the intensities in the lesion area may greatly affect the projection step, causing large deviations between the mentioned latent space representations. In contrast, a method that directly takes such possible deviations into account can lead to improved detection accuracy.In this work, we propose an approach that casts the unsupervised detection problem as an image restoration problem. The proposed method also uses DLbased nonlinear latent variable models to approximate the normative distribution of healthy anatomy from example images. However, instead of using prior projection, we formulate the detection as a MaximumAPosteriori (MAP) image restoration problem with the normative image prior estimated by a DLbased model, in the same spirit as the method [28], and a likelihood model that places minimal assumptions on the lesion. The MAP estimation is then solved using gradient ascent. The proposed image restoration method is agnostic to the choice of the priors and therefore can be applied with any DLbased latent variable model. To learn the image prior for image restoration, we use VAE as well as GMVAE [6, 12]
, an extension of VAE that uses Gaussian mixture modeling in the latent space. We evaluate and analyze the proposed method on two different Magnetic Resonance Imaging (MRI) lesion datasets, one showing brain tumors and the other showing lesions due to stroke. A preliminary version of this work
[31] was presented at International Conference on Medical Imaging with Deep Learning. In this extended version, we present an indepth analysis of the model and evaluations on an additional dataset.We note that unsupervised lesion detection approaches, including the method proposed here, currently yield lower detection accuracy compared to the supervised approaches. This is due to the fact that unsupervised lesion detection is a much more challenging task. The method we present here improves the stateoftheart in unsupervised lesion detection and closes the accuracy gap between supervised and unsupervised approaches a bit further, taking a step towards making unsupervised lesion detection a viable alternative to supervised approaches with important benefits as described above.
2 Method
We start this section with a brief review on normative distribution estimation using latent variable models, particularly VAE and GMVAE, and move on to give more details on MAPbased image restoration for unsupervised lesion detection. For an overview of the method, please refer to Figure 1
2.1 Learning Prior Normative Distribution
Latent variables models have been a popular choice for approximating highdimensional distributions from a set of examples. Let us denote an image with pixels with , a latent variable model for the distribution of in its most generic form is given as
where is the latent variable, is the prior distribution in the latent space. is the dimension of the latent space and often a much smaller than is used, i.e. , assuming images can be represented in a lower dimensional subspace of . The conditional distribution encodes the mapping between the latent space and the image space.
The simplest example of a latent variable model is the probabilistic principal component analysis proposed
[29], whereencodes a linear map between the latent and image spaces within a Gaussian distribution, i.e.
with and variance of the assumed noise in the data. Being powerful already, this model has been used as a normative distribution in unsupervised lesion detection [8]. However, a linear mapping can be limiting and a nonlinear latent variable model with the mapping parameterized by neural networks
[18] has been proposed to address the limitations. The version of this model using Gaussian conditional distributions can be given aswhere and are neural networks and represents the entire set of parameters. Given this model, learning is modeled as maximizing the evidence , where the subscript goes over the samples.
The optimization of requires solving the integral over either analytically or numerically. The former is not possible and the latter becomes infeasible for even modestly large . MacKay in his article proposes to use importance sampling to address the issue [18]. VAE [14] takes this approach and defines a distribution parameterized with another neural network to approximate the true posterior , and maximizes the evidence lower bound (ELBO) rather than the evidence itself
where and are modeled with two networks, with the former parameterized with and the latter with , and KL
represents the KullbackLeibler divergence. In VAE, learning is defined as
and is implemented with stochastic sampling to compute the expectations in the ELBO. For further information regarding the optimization of the VAE objective function, we refer the readers to the original paper [14].
Following the modeling choices of the original work, here we use the VAE model with the following

,

, and

,
where denotes a diagonal matrix with elements on the diagonal.
is the identity matrix in
dimensions, and and are parameterized as neural networks, whose architectures are described in Section 3.Remark: Note that with the Gaussian distribution models, is a computed as loss between the image and its mean reconstruction. The loss is derived from Gaussian distribution with the mean and the homoscedatic variance . The variance is set as a constant value of for each pixel. The approaches that uses prior projection with VAE, or other latentvariable models, often first project the image to its latent representation and and then reconstruct the image with . The difference between and is then assumed to be the lesion.
The original VAE model uses a unimodal distribution as the prior in the latent space. A more expressive prior can potentially allow the latent variable model to fit more complicated distributions to the samples. With this motivation, Gaussian Mixture VAE (GMVAE) [6] is proposed and uses Gaussian mixture models as the prior in the latent space.
To model the latent distribution as Gaussian mixtures, the authors introduce two additional latent variables in GMVAE: for mixture assignment and for mixture model coefficients. The prior distribution in the latent space is given as
(1) 
where is the prespecified number of Gaussian mixture components,
is a onehot vector and
. and are functions of parameterized as neural networks.Similar to VAE, an ELBO can be derived for GMVAE [6], using multiple approximations for the posterior distributions
The mixture model in the latent space gives rise to two additional distributions and . We use a Gaussian distribution with diagonal covariance for the first one, , while the posterior for can be computed analytically for a given and . We use the same models for and as in the VAE model. Training of the GMVAE model follows a similar approach as VAE, and we refer the interested reader to the original paper [6] for details.
We use convolutional neural networks to parameterize the necessary distributions in the VAE and the GMVAE models and train the models with MR images acquired from healthy individuals to obtain the normative distribution, which is used in the restoration framework as explained next. The details on the architectures and datasets are presented in Section
3. While the VAE and GMVAE models are used for demonstrations, the MAPbased outlier detection described next can be combined with any latent variable models, such as the flowbased unsupervised learning approaches
[11]. An illustrated reconstruction process can be found in Figure 2.2.2 Detecting outliers with MAPbased restoration
The restoration framework assumes the image with the lesion is a “normal” image, i.e. one that is coming from the normative distribution, but with an additional “noise” component, the lesion. The goal is to restore the normal image from the “noisy” observation and in the meanwhile detect the lesion as noise. This is fundamentally different from the prior projection model or detecting outliers in the latent space with another metric. The approach here specifically aims to retain the normal anatomy in the images and only change the outlier lesion part during the restoration. Next we describe how we achieve this with a probabilistic model and MAP estimation.
Let us denote an image with a lesion as . We assume that is a noisy version of , , modeling the lesion with and the noisefree, which corresponds to lesionfree, version of with . In areas of the image where no lesion is present, should hold at those pixels. In cases with no lesion at all, should hold in the entire image, i.e. the model should find no lesion.
The usual MAP estimation maximizes the posterior distribution of given
where is the likelihood term, which can be interpreted as the data consistency term. This term reflects the assumptions on , which we detail in Section 2.2.1. is the normative prior, i.e. distribution of healthy images. These two terms form a balanced optimization problem. The first term aims to preserve the image by penalizing deviations between the observed image and its restored version
. Maximizing that term alone would not change the image. The second term, however, yields higher values for images that fit the normative distribution. An image with lesions has a lower probability in the normative distribution and consequently gives a lower
. For such images, the optimal point is to remove the outlier lesion that does not fit the normative distribution, assigning it to , while keeping the normal anatomy fixed between and . For an image without lesions, leaving the image unchanged is the optimal solution.The MAP estimation optimizes , but this term is neither analytically tractable nor easy to compute via sampling for most nonlinear latent variable models. Instead, for VAE and GMVAE, we propose to optimize the evidence lower bound, or in other words, to solve the approximated MAP estimation problem
(3) 
The difference between and is given with . Therefore, as the approximation gets better, the KLdivergence will go to zero and maximizing ELBO will get closer to maximizing . An important advantage of using is that, it is formed of differentiable functions allowing for a gradientbased optimization scheme to solve the approximate MAP problem.
To optimize the objective formulated in Equation 3, we adopt a gradient ascent method and perform iterative optimization to obtain the restored image. Specifically, in each iteration, we approximate the gradients of by taking the gradient of Equation (3)
(4)  
(5) 
where is the step size at the th iteration, and . Assume the optimization convergences at th iteration, we then obtain the restored image as , which is an estimate of the lesionfree version of the observed image . As we assumed an additive model, the estimated outlier lesion can be revealed as , providing a pixelwise detection of the lesion. Taking both hypointense and hyperintense areas into account, the absolute value of the difference can also be used as the final detected lesion, as .
2.2.1 Data consistency
The likelihood is a critical component in the MAP estimation. As explained in the previous section, it reflects the modeling assumptions on through how it measures the deviation between and . Penalizing certain type of deviations more than the others, the likelihood term encodes a preference over and, thus influence the entire MAP estimation. In the unsupervised detection setting, we are aiming for detecting any type of lesions in contrast to the supervised detection setting, where the goal is to detect specific type of lesions. As a result, should not be based on lesionspecific information, such as intensity or specific shape features. Generic and mathematically driven likelihood terms, instead of datadriven likelihood terms, can provide the generality required for unsupervised detection.
In this work, we adopt a likelihood term that prefers to be composed of larger, continuous blocks over many isolated islands. Total Variation (TV) norm [25] formulates this preference and can be easily used as the likelihood term in the MAP estimation. Using TV norm, we define the final restoration problem as
(6) 
The TV norm is weighted by , which balances the magnitudes of gradients from the two terms during the gradient ascent optimization. An effective weight prevents large deviation between and , and results in accurate detection performance. Determining
weight is not trivial in the most generic setting. Instead, we propose a heuristic method to tune the weight using images from healthy individuals.
2.2.2 Determining the weight parameter
Intuitively, images from healthy individuals are lesionfree and, ideally, the detection method should not detect lesions in them. Based on this intuition, we determine that minimizes the change caused by the MAP estimation on healthy images. We measure the change caused by the restoration using the distance between the restored image and the input images for a small validation set composed only of healthy images,
(7) 
where we denote dependence of the restoration to with the subscript and denotes the number of validation images used for the measurement. We perform a parameter search in a wide range of possible values to determine the smallest that yields the smallest .
The described heuristic approach relies on the fact that even lesionfree images will be changed during the restoration. This happens if the ELBO term yields nonzero gradients for images composed only of healthy anatomy. There are three reasons why this can happen: 1) as modeled with a latent variable model is not a perfect approximation, 2) ELBO is an approximation to the , and 3) may assign higher probability to certain anatomical formations and appearance. Most likely, all these reasons are affecting the restoration simultaneously, and even healthy images incur change during restoration. As a result, in lesionfree images, we expect very low values to yield large number of changes in the images. Moreover, very large values may also yield changes that have low TV norm and slightly higher ELBO. In our experiments, this is indeed what we observed empirically. Examples of the parameter search are plotted in Figure 9(a) and 10(a) showing an optimal value that minimizes for different prior models and datasets. More details are given in Section 3.
2.2.3 Binary detections by thresholding difference maps
The described MAP estimation will yield a restored image and the outlier area as the difference map , which is a pixelwise continuous map. To obtain a pixelwise lesion segmentation, which is a binary map, we need to determine a threshold and label pixels with differences larger than as lesion and others as normal anatomy. Similar to the case described in the previous section, we need to find the threshold using only healthy images in the unsupervised setting. Once again, we assume that healthy images show only normal anatomy and the detection method should not find any lesion in a set of validation healthy images. The naive approach of setting the threshold to the maximal pixelwise difference in the healthy images may yield very conservative method with high specificity but low sensitivity. Instead, we adopt the approach as in [15] that implements a compromise by setting a limit to the permitted False Positive Rate (FPR) in the detection results. Any detection in the lesionfree healthy images will be a false positive. The method proposed in [15] uses detections in lesionfree healthy images to estimate the FPR for any threshold, and determines the smallest threshold that satisfies a user defined FPR limit. This FPR limit dependent threshold can then be used to convert the difference maps into binary segmentations by determining the pixels with .
To test the effectiveness of this approach, we additionally experimented with determining the threshold using the ROC curve as a baseline. In this approach, the threshold can be determined by finding the pixelwise difference value that maximizes the difference between true positive and false positive rates, i.e. , on the ROC curve. Note that this selection method makes use of groundtruth lesion annotation to obtain the ROC curve using the difference maps between detections and lesion annotations, therefore, it is not unsupervised. Nonetheless, this baseline gives optimistic detection results to compare to the results using FPRbased threshold selection.
3 Experiments
3.1 Datasets & Preprocessing
We used three different MRI datasets, one for training the prior normative distribution and determining the hyperparameters of the model, i.e. and , and the other two for evaluating the proposed approach for detecting lesions.
CamCAN^{1}^{1}1http://www.camcan.org/: Cambridge Centre for Ageing and Neuroscience dataset, described in [27], contains T1weighted (CamCANT1) and T2weighted (CamCANT2) brain scans of 652 healthy subjects of ages ranging from 18 to 87, among which 600 subjects were randomly selected as the data for training the prior models VAE and GMVAE, and 52 as the validation data for determining the hyperparameters.
BRATS17^{2}^{2}2https://www.med.upenn.edu/sbia/brats2018.html: Multimodal Brain Tumor Image Segmentation Challenge dataset, described in [19], contains T1weighted and T2weighted brain scans of 285 subjects with brain tumors. Among them, 210 subjects show highgrade glioblastomas and 75 subjects show lowgrade gliomas. We downloaded the version of the dataset published in 2017 and only used the T2weighted images in this dataset, where the lesions appear as hyperintensity areas. For all the subjects in this dataset, ground truth segmentations of the tumors visible in the T2weighted images are also available.
ATLAS^{3}^{3}3http://fcon_1000.projects.nitrc.org/indi/retro/atlas.html: Anatomical Tracings of Lesions After Stroke (ATLAS) dataset , described in [17], contains 220 T1weighted brain scans of stroke patients. In these images, the lesions appear as hypointensity areas. Similar to the BRATS17 dataset, this dataset also has the ground truth pixelwise segmentations available.
We trained two sets of prior models, one using T2weighted and the other using T1weighted images in the CamCAN datasets. The former is used for the evaluations on the BRATS17 dataset and the latter for ATLAS dataset. All datasets were preprocessed before training. We performed skull stripping to compute brain masks and remove the skulls, registered the images to MNI space, histogram matching among the CamCAN subjects using the method proposed in [21], and normalized pixel intensities for each subject by scaling them as , with indicating the intensity values and and were computed on a randomly selected subject and then fixed for the other subjects, the background pixel intensities were set to . Especially for histogram matching, the choice of the reference subject that the other subjects are matched to will not significantly affect the detection results. To mitigate possible domain gaps caused by different acquisition protocols between CamCANT2 and BRATS17 and between CamCANT1 and ATLAS, we matched histograms of the BRATS17 subjects to the same CamCAN subject’s T2weighted image as for CamCANT2 and ATLAS subjects to the same CamCAN subject’s T1weighted image as for CamCANT1, both after normalizing the CamCAN subjects among themselves for each modality respectively. BRATS17 and ATLAS are normalized in the same way as CamCAN.
All computations were done in 2D using transversal slices. For the sake of computation efficiency, we excluded from the training data the slices in the beginning and end of a scan in the CamCAN datasets, where no brains structures were present. We also removed excessive background in all scans of CamCAN, BRATS17 and ATLAS datasets to obtain transversal slices of size
. During testing, detection was applied to all the slices of a test subject independently while the evaluation metrics were computed subjectwise.
3.2 Implementation Details
We trained two sets of normative prior models using the VAE and GMVAE respectively. For each method, a prior model was trained for T1weighted images and another model for T2weighted images using the CamCAN dataset. The BRATS17 and ATLAS images were not used during the training of the prior models.
Our VAE architecture consists of an encoder network and a decoder network. The encoder consists of 6 downsampling residual blocks (illustrated in Fig.3
) with 16, 32, 64, 128, 256 and 512 filters. Downsampling is done by taking a stride of 2 and upsampling is done by bilinear upsampling. The decoder network is symmetrical to the encoder network. The latent variable has a size of
. We implemented GMVAE following the repository [6]^{4}^{4}4https://github.com/NatD/GMVAE, except that the GMVAE shared the same network structures for the encoder and decoder as the VAE model to ensure a fair comparison between the two models. The latent space of GMAVE was therefore of the same dimension as the VAE. To extensively evaluate the model performance, we also trained the GMVAE model with different numbers of Gaussian mixtures and different latent dimensions, see Section 3.6.2.leakyrelu
activation is applied to all hidden layers while identity
activation is applied only to output layers and layers that connect to the latent variables.
To provide the baseline achieved by supervised methods, a UNet is implemented with the same encoder and decoder structure while the encoder and decoder are connected by skip connections. Crossentropy loss is used to train the network. The choice of the UNet structure is to ensure consistency among experiments and may not overperform the stateoftheart accuracy on the BRATS17 leaderboard.
We applied the proposed MAPbased detection method to the test images using the appropriate prior models. By observing the convergence of gradient ascent optimization, we performed 500 steps of restoration, with a step size of in the first 100 iterations and in the following iterations. This training strategy was observed to lead to higher MAP values at the end of the optimization, evaluation metrics were not used to determine the strategy. We refer to our proposed versions as VAE (TV) and GMVAE (TV) while presenting the results.
For model comparison, we implemented four different methods that uses prior projection approach explained in the introduction. These are VAE256, VAE128, AAE128, which corresponding to models trained with images resized to 256256 and 128128, as described in [5], and AnoGAN proposed by [26], with all hyperparemeters tuned with our datasets. We compared the detection accuracy of the proposed MAPbased restoration approach using VAE and GMVAE to these prior works.
Lastly, we used different evaluation metrics to present a better picture of the performance of the proposed model and the ones we used as comparison. First, we computed the ROC curves using the continuous valued detections . ROC curves were computed using the detections in the entire dataset. From the ROC curve, we extracted the AreaUndertheCurve (AUC), which was used as the first metric.
Using the procedure explained in Section 2.2.3 we also computed thresholds at different FPR limits for all the algorithms, at 1%, 5% and 10%. We thresholded the absolute valued detection maps, e.g. for our model, to construct binary detection maps and computed the Dice’s Similarity Coefficient between the ground truth segmentations and the detections, referred to as DSC1, DSC5 and DSC10, respectively. The DSC values were computed subjectwise, pooling all the transversal slices of the subject together. To understand the efficacy of the threshold determining method, we also extracted the threshold from the ROC curve that yielded the biggest difference between the true positive and false positive rates, as explained previously. This threshold was also used to compute the DSC for all the methods and forms a baseline, and we refer to it as DSC_AUC while presenting the results.
Methods  AUC  DSC1  DSC5  DSC10  DSC_AUC 
VAE (TV) (ours)  0.80  0.340.20  0.360.27  0.400.24  0.340.18 
GMVAE (TV), c=3 (ours)  0.82  0.210.20  0.390.22  0.380.20  0.350.20 
GMVAE (TV), c=6 (ours)  0.81  0.310.14  0.400.22  0.370.16  0.330.19 
GMVAE (TV), c=9 (ours)  0.83  0.320.23  0.450.20  0.420.19  0.360.19 
VAE256  0.67  0.060.06  0.180.13  0.250.20  0.200.14 
VAE128  0.69  0.090.06  0.190.15  0.260.17  0.220.14 
AAE128  0.70  0.030.03  0.180.14  0.230.15  0.230.13 
AnoGAN  0.65  0.020.02  0.100.06  0.19  0.190.10 
UNet (supervised)  /  /  /  /  0.85* 
Methods  AUC  DSC1  DSC5  DSC10  DSC_AUC 
VAE(TV) (ours)  0.79  0.100.06  0.110.05  0.110.05  0.110.07 
GMVAE(TV), c=3 (ours)  0.79  0.060.06  0.090.07  0.080.07  0.070.07 
GMVAE(TV), c=6 (ours)  0.79  0.100.09  0.120.12  0.080.07  0.080.08 
GMVAE(TV), c=9 (ours)  0.77  0.080.07  0.100.08  0.070.07  0.080.07 
VAE256  0.66  0.000.00  0.010.01  0.020.02  0.020.02 
VAE128  0.64  0.000.00  0.010.01  0.010.01  0.010.01 
AAE128  0.63  0.000.00  0.010.01  0.010.01  0.010.01 
AnoGAN  0.64  0.000.00  0.010.01  0.020.02  0.010.01 
UNet (supervised)  /  /  /  /  0.50* 
3.3 Results
3.3.1 Model Performance and Comparisons
We present the quantitative results in Tables 1 and 2
with summary metrics. The tables present a single AUC value for each method as this metric was computed over the entire datasets, while DSC values were computed subjectwise and the table presents mean and standarddeviations. The two variations of the proposed MAPbased detection method, with VAE and GMVAE as the prior models, are shown in the top rows. For GMVAE, we show results for different number of clusters, i.e.
. The baseline methods are shown in the following rows.In our experiments with the BRATS17 dataset, the baseline methods, namely VAE256, VAE128, AAE128 and AnoGAN, yielded values of 0.70 or lower. In comparison, the proposed method improved the AUC score to over 0.80 for all the prior models. Similar results were obtained in the experiments with the ATLAS dataset. The variations of the proposed MAPbased detection method yielded higher AUC for both prior terms. The GMVAE prior model yielded a modestly higher AUC than the VAE prior model in the BRATS17 dataset and the same AUC values in the Atlas dataset. The ROC curves for both the datasets are presented in Figure 4.
The DSC values revealed a similar picture as the AUC with respect to baseline methods. In the experiments with the BRATS17 dataset, in a setting with conservative FPR limit, i.e. 1%, baseline methods achieved DSC values lower than 0.10. The proposed method improved the DSC to 0.34 with VAE and more than 0.20 for all the GMVAE variations on average. When the FPR limit was increased to 5%, a less conservative limit, all the methods yielded higher DSC than in the 1% limit setting. Despite the increase, the DSC of the baseline methods remained lower than 0.20. The proposed method improved mean DSC substantially for all the prior terms over the best baseline methods. In the least conservative setting with FPR limit at 10%, DSC value of the best baseline method (VAE128) increased to 0.26, which is comparable to the DSC the proposed method with GMVAE at 1% FPR limit. The DSC values of the other baseline methods were also higher at this FPR limit.
Similar trends were also observed in the experiments with the ATLAS dataset. DSC values of the proposed methods were substantially higher than the ones yielded by baseline methods for both prior terms. However, the DSC values were substantially lower for all the methods in the ATLAS dataset compared to those in the BRATS17 dataset. The lower DSC values are due to the difficulty of the detection task in the ATLAS dataset. As can be seen in the visual examples shown in Figure 6, lesions can be smaller and have similar intensities as the normal structures. These lower DSC values also demonstrate the current limitations in the performance of the unsupervised lesion detection approaches, and suggests substantial room for improvement.
In our experiments, there was not a clear winner between VAE and GMVAE priors in the MAP restoration approach. GMVAE prior with yielded higher AUC and mean DSC at 5% and 10% FPR limits than using VAE prior on the BRATS17 dataset. However, the increase was not observed in the ATLAS dataset.
The DSC_AUC values are presented at the left most column in both the tables. These scores were obtained with thresholds computed retrospectively with the knowledge of the ROC curves, therefore, they cannot be used for evaluation. However, they are useful for evaluating the efficacy of the method for identifying the thresholds automatically, which was described in Section 2.2.3. We observe that DSC values obtained with the automatic method were similar to those set using ROC curves for all the methods, suggesting that the automatic method is indeed appropriate for identifying thresholds for generating binary detection maps.
Lastly, we present visual results obtained by the proposed method using the GMVAE prior model in Figures 5 and 6. Figures show both continuous valued detections, , and binary detection maps thresholded at the selected FPR limits. From the BRATS17 dataset, to give a better perspective of the performance, we selected images with large and small tumors, and show how the detection performance was affected by tumor sizes. Cases with large tumors are shown in columns A to D and small tumors in columns E to G in Figure 5. In the restored images (row 2), the abnormal intensities have been substantially reduced to a normal range, resulting in detectable intensity changes in the difference images (row 3). For tumors of large sizes, the thresholded detection results matches well with the ground truth segmentations. Smaller tumors were more difficult for the method to detect accurately. Particularly, the method may include ”abnormallooking” healthy tissues in the detection in addition to the small tumors (column F) or produce no detections, overlooking the tumors. We also note that although the detection becomes less accurate on small tumors, the method does not detect healthy pixels in a large region to be abnormal. Comparing binary detections at different FPR limits, we observe that as the FPR limit is higher the detections include more tumor pixels as well as nontumor pixels, as expected.
Visual results for the ATLAS dataset are shown in Figure 6. This problem is much harder due to the size and intensity characteristics of the lesions. Comparing the continuous valued detection maps shown in the third row and the ground truth segmentations in the last row shows that the model is able to highlight the lesions but with additional areas including normal structures. Binary detections presented in the fourthtosixth rows show this more clearly. While the lesion pixels are often captured in the outlier maps, multiple areas displaying normal anatomy are also captured, yielding lower DSC values at the end.
3.4 Accuracy Analysis with Lesion Size
To observe the relation between detection accuracy and tumor size, we calculated the dice scores for all test subjects in BRATS17 and ATLAS respectively. Specifically, we measured the lesion size as the annotated lesion pixels. The relation is visualized as scatterplots in Figure 5 using the bestperforming model for the two datasets.
The scatterplots for the two dataset both indicate a tendency that higher dice scores are often obtained on subjects with larger lesions. The tendency is more obvious on ATLAS dataset while it is weak on BRATS17. As shown as Figure 6(a) ad (b), the FPR for BRATS17 is mostly between 0.10 and 0.20 while the FPR for ATLAS is mostly lower than 0.10. Although the accuracy on BRATS is significantly higher than ATLAS, the low FPR on ATLAS may be caused by the large amount of true negatives. On the other hand, as in Figure 5(b), the higher dice score achieved on ATLAS is for a subject with the largest lesion and very small lesions with size 25000 pixels mostly give dice scores lower than the best average dice of 0.12. However, in Figure 5(a),this accuracysize relation is less obvious on BRATS17 than ATLAS. With the size larger than approximately 25000, dice scores higher than the optimal average dice, 0.45, can be achieved. As the lesion size increases, the dice scores can vary from 0.1 to 0.8. The highest dice score is also not obtained for the relatively larger lesions with more than 200,000 pixels. Besides the lesion size, other characteristics of the lesion, such as intensity, location and shape, may also affect the accuracy in a complicated way.
3.5 3D consistency in detection
3.6 Model Analysis with GMVAE prior
To further analyze the behavior of the method, we investigated the effects of the model hyperparameters on the performance and the convergence behavior of the optimization. Our analyses focuses on the proposed method used with GMVAE, as this version lead to slightly higher performance and it also has additional hyperparameters. We experimented with data consistency weight
, the number of Gaussian mixtures and the dimension of the latent variable .

AUC  DSC1  DSC5  DSC10  DSC_AUC  

c=1 
256  0.80  0.320.22  0.340.24  0.360.24  0.320.19 
512  0.80  0.330.24  0.360.24  0.390.23  0.350.19  
1024  0.79  0.300.22  0.310.23  0.330.23  0.310.18  
c=3 
256  0.82  0.200.18  0.320.23  0.300.21  0.300.18 
512  0.82  0.210.20  0.390.22  0.380.20  0.350.20  
1024  0.82  0.200.19  0.340.19  0.330.19  0.320.20  
c=6  256  0.81  0.300.21  0.400.18  0.350.20  0.340.18 
512  0.81  0.320.23  0.440.21  0.410.20  0.350.18  
1024  0.81  0.290.22  0.380.23  0.350.19  0.330.19  
c=9  256  0.82  0.320.30  0.380.19  0.360.15  0.310.15 
512  0.83  0.310.23  0.450.20  0.420.19  0.360.19  
1024  0.82  0.300.14  0.380.22  0.350.20  0.340.18 
3.6.1 Data Consistency Weighed with Different Values
The weight parameter is actually selected automatically using the method described in Section 2.2.2. For this selection, the method relies on the error term computed using a set of validation images. Here we investigate how changes. First, we plot vs in Figures 9(a) and 10(a) to show the behavior of this term. The plots are generated using GMVAE prior models that was trained with T2 and T1weighted images from the CamCANT2 dataset, respectively. was evaluated using the validation images for different values. We observe a dip in the curves indicating optimal values that yielded the least amount of change on validation images consisting only of healthy anatomy for each . These values were used in the experiments on the BRATS17 and ATLAS dataset with the GMVAE prior model presented previously.
A natural question that arises is whether choosing according to corresponds to choosing the best according to the detection accuracy on a specific lesion dataset. To answer this question and analyze the sensitivity of the detection results to the parameter, we repeated the detection experiments with multiple values in the [1.0, 9.0] range and evaluated the performance of these different models using AUCs. Figures 9(b) and 10(b) plots the results of these experiments.
First, we observe that value can influence the AUC. Very low values yielded lower AUC values in both the datasets. Second, we see that the values chosen using are not very far from the values that yielded the maximum AUC values for all the prior terms in both the datasets. Choosing the using the method described in Section 2.2.2 led to at most 0.01 less than the maximum AUC values.
3.6.2 Analysis of the Gaussian Mixture Prior
GMVAE uses a Gaussian mixture model as the prior distribution in the latent space to allow fitting more complex normative distributions. This model has two hyperparameters: number of Gaussian mixtures and the dimension of the latent space. In this section, we present analysis showing the effect of these parameters on the detection performance on the BRATS17 dataset.
For a range of the cluster number and latent space dimension , we trained multiple prior models and detected lesions on the BRATS17 datasets. In Table 3 we present the detection results for the different ranges. We notice that in terms of AUC, the parameters have minimal effect. For DSC, 512 latent dimensions seems to provide the highest scores for all the FPR limits and . The numerical differences, however, are small compared to the standard deviations. The biggest difference is arguably between and the others in DSC1 and DSC5. Increasing seems to improve DSC at these FPR limits. Additionally, GMVAE with is a special case of GMVAE and is similar to VAE. We empirically validate this by reporting the results of GMVAE () in Table 3 and Figure 10 and 11. GMVAE(TV) with give very similar results to VAE(TV).
3.6.3 Convergence of Image Restoration
As the image restoration is performed in an iterative manner, it is important to show that the restoration stably converges at an optimal point. We experimentally show the convergence of the restoration process. We evaluated the AUC using the BRATS17 and ATLAS datasets at each 50 steps during the MAPoptimization and plot the evolution of AUC with respect to gradientascent steps in Figures 9(c) and 10(c). For the restoration on BRATS17, the plot shows a sharp increase in the first 100 iterations and then increases to stable values until 500 iterations at around AUC0.8 for all the selected values. For the restoration on ATLAS, the plot shows similar increases where the AUC values increases in the first 300 iterations and stabilizes around 0.80. The plots indicate that the iterative restoration converged and the model performance stably improved with increasing restoration steps.
4 Conclusions
We proposed an unsupervised detection method which restores images using an estimated normative prior for healthy images. Specifically, the image prior was learned with autoencodingbased methods, VAE and GMVAE, and the images were iteratively restored with MAP estimation. Detection on brain lesion datasets showed that the proposed method achieved better AUC and Dice scores, significantly outperforming the existing methods. Extended model analysis on GMVAE indicated that the prior learned by this model is robust to parameter selection and the model shows stable convergence. Meanwhile, the model has limited capability when applied to detect small and unobvious lesions. On the other hand, some lesions may cause large deformations in the surrounding healthy structures. Although the data consistency term prevents the model from detecting such deformations as abnormal lesions, there is limited guarantee that large deformations will not be detected as abnormalities. We have not observed this in our experiments, however, this remains a limitation. To improve the performance, future works may impose noise assumption that better describes the lesions and use adaptive thresholding selection.
Acknowledgments
We thank Swiss National Science Foundation (SNSF) and Platform for Advanced Scientific Computing (PASC) for funding this project (project no. 205321_173016), as well as Nvidia for GPU donations.
References

[1]
(2009)
Brain tumor segmentation using support vector machines
. In European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty, pp. 736–747. Cited by: §1.  [2] (2018) Deep autoencoding models for unsupervised anomaly segmentation in brain mr images. arXiv preprint arXiv:1804.04488. Cited by: §1.
 [3] (2015) Templatebased multimodal joint generative model of brain data. In International Conference on Information Processing in Medical Imaging, pp. 17–29. Cited by: §1.
 [4] (2018) Unsupervised detection of lesions in brain mri using constrained adversarial autoencoders. arXiv preprint arXiv:1806.04972. Cited by: §1.
 [5] (2018) Deep generative models in the realworld: an open challenge from medical imaging. arXiv preprint arXiv:1806.05452. Cited by: §3.2.
 [6] (2016) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Cited by: §1, §2.1, §2.1, §3.2.
 [7] (2017) Automatic brain tumor detection and segmentation using unet based fully convolutional networks. In Annual Conference on Medical Image Understanding and Analysis, pp. 506–517. Cited by: §1.
 [8] (2014) Individualized statistical learning from medical image databases: application to identification of brain lesions. Medical image analysis 18 (3), pp. 542–554. Cited by: §1, §2.1.
 [9] (2011) Spatial decision forests for ms lesion segmentation in multichannel magnetic resonance images. NeuroImage 57 (2), pp. 378–390. Cited by: §1.
 [10] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
 [11] (2018) Ffjord: freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367. Cited by: §2.1.
 [12] (2016) Structured vaes: composing probabilistic graphical models and variational autoencoders. arXiv preprint arXiv:1603.06277 2. Cited by: §1.
 [13] (2017) Efficient multiscale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis 36, pp. 61–78. Cited by: §1.
 [14] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.1.
 [15] (2018) Reconstructing subjectspecific effect maps. NeuroImage 181, pp. 521–538. Cited by: §2.2.3.
 [16] (2018) Fully convolutional network ensembles for white matter hyperintensities segmentation in mr images. NeuroImage 183, pp. 650–665. Cited by: §1.

[17]
(2018)
A large, open source dataset of stroke anatomical brain images and manual lesion segmentations
. Scientific data 5, pp. 180011. Cited by: §3.1.  [18] (1995) Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 354 (1), pp. 73–80. Cited by: §1, §2.1, §2.1.
 [19] (2015) The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34 (10), pp. 1993. Cited by: §3.1.
 [20] (2002) Automatic brain and tumor segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 372–379. Cited by: §1.
 [21] (2000) New variants of a method of mri scale standardization. IEEE transactions on medical imaging 19 (2), pp. 143–150. Cited by: §3.1.
 [22] (2018) Unsupervised lesion detection in brain ct using bayesian convolutional autoencoders. Cited by: §1.
 [23] (2016) Brain tumor segmentation using convolutional neural networks in mri images. IEEE transactions on medical imaging 35 (5), pp. 1240–1251. Cited by: §1.
 [24] (2004) A brain tumor segmentation framework based on outlier detection. Medical image analysis 8 (3), pp. 275–283. Cited by: §1.
 [25] (1992) Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60 (14), pp. 259–268. Cited by: §2.2.1.

[26]
(2017)
Unsupervised anomaly detection with generative adversarial networks to guide marker discovery
. In International Conference on Information Processing in Medical Imaging, pp. 146–157. Cited by: §1, §3.2.  [27] (2017) The cambridge centre for ageing and neuroscience (camcan) data repository: structural and functional mri, meg, and cognitive data from a crosssectional adult lifespan sample. Neuroimage 144, pp. 262–269. Cited by: §3.1.
 [28] (2018) MR image reconstruction using deep density priors. IEEE transactions on medical imaging. Cited by: §1.
 [29] (1999) Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 611–622. Cited by: §2.1.
 [30] (2001) Automated segmentation of multiple sclerosis lesions by model outlier detection. IEEE transactions on medical imaging 20 (8), pp. 677–688. Cited by: §1.
 [31] (2019) Unsupervised lesion detection via image restoration with a normative prior. In International Conference on Medical Image with Deep Learning, Cited by: §1.
 [32] (2012) Abnormality segmentation in brain images via distributed estimation. IEEE Transactions on Information Technology in Biomedicine 16 (3), pp. 330–338. Cited by: §1.
 [33] (2016) Abnormality detection via iterative deformable registration and basispursuit decomposition. IEEE transactions on medical imaging 35 (8), pp. 1937–1951. Cited by: §1.
 [34] (2012) Contextsensitive classification forests for segmentation of brain tumor tissues. Proc MICCAIBraTS, pp. 1–9. Cited by: §1.
Comments
There are no comments yet.