Log In Sign Up

CVAD: A generic medical anomaly detector based on Cascade VAE

by   Xiaoyuan Guo, et al.
Mayo Foundation for Medical Education and Research
Emory University

Detecting out-of-distribution (OOD) samples in medical imaging plays an important role for downstream medical diagnosis. However, existing OOD detectors are demonstrated on natural images composed of inter-classes and have difficulty generalizing to medical images. The key issue is the granularity of OOD data in the medical domain, where intra-class OOD samples are predominant. We focus on the generalizability of OOD detection for medical images and propose a self-supervised Cascade Variational autoencoder-based Anomaly Detector (CVAD). We use a variational autoencoders' cascade architecture, which combines latent representation at multiple scales, before being fed to a discriminator to distinguish the OOD data from the in-distribution (ID) data. Finally, both the reconstruction error and the OOD probability predicted by the binary discriminator are used to determine the anomalies. We compare the performance with the state-of-the-art deep learning models to demonstrate our model's efficacy on various open-access medical imaging datasets for both intra- and inter-class OOD. Further extensive results on datasets including common natural datasets show our model's effectiveness and generalizability. The code is available at


page 1

page 4

page 8


Addressing Variance Shrinkage in Variational Autoencoders using Quantile Regression

Estimation of uncertainty in deep learning models is of vital importance...

OSCARS: An Outlier-Sensitive Content-Based Radiography Retrieval System

Improving the retrieval relevance on noisy datasets is an emerging need ...

IntrA: 3D Intracranial Aneurysm Dataset for Deep Learning

Medicine is an important application area for deep learning models. Rese...

Dual-distribution discrepancy with self-supervised refinement for anomaly detection in medical images

Medical anomaly detection is a crucial yet challenging task aiming at re...

Deep Variational Clustering Framework for Self-labeling of Large-scale Medical Images

We propose a Deep Variational Clustering (DVC) framework for unsupervise...

Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains

Multi-modal foundation models are typically trained on millions of pairs...

On the Pitfalls of Using the Residual Error as Anomaly Score

Many current state-of-the-art methods for anomaly localization in medica...

1 Introduction

Figure 1: ID, Intra- and Inter-class OOD examples for medical images. Compared to natural images, medical OOD samples exhibit more subtle intra-class variations (e.g., normal vs pneumonia in the 1st row and benign vs malignant in the 2nd row).

Despite recent advances in deep learning that have contributed to solving various complex real-world problems [7], the safety and reliability of AI technologies remain a big concern in medical applications. Deep learning models for medical tasks are often trained with data from known distributions, and fail to identify out-of-distribution (OOD) inputs and possibly assign high probabilities to the anomalies during inference because of the insensitivity to distribution shifting. Medical anomalies, a.k.a., OOD data, outliers, can arise due to various reasons such as noise during data acquisition, changes in disease prevalence and incidence (e.g., the evolution of rare cancer types), or inappropriate inputs (e.g., different modalities unseen during training) [9]

. To ensure the reliability of deep models’ predictions, it is necessary to identify unknown types of data that are different from the training data distribution. A good anomaly detector should be able to capture the variations between the in-distribution (ID) data used in training and the OOD data from open word and thus identify the outliers. However, the core challenges for medical anomaly detection are – (1) the OOD data is usually unavailable at the time of model training; (2) in theory, there are infinite numbers of variations of OOD data; and (3) different types of OOD data can be identified with varying difficulties. In general, the OOD classifications 

[5] can be refined based on the variation difference by summarizing them as inter-class OOD data and intra-class OOD data. Inter-class OOD data has larger variations from the ID data, whereas the intra-class OOD data is close to ID data, as observed in Fig. 1. Thus, identifying intra-class OOD data is more difficult than the inter-class OOD data given subtle differences with ID data. To cope with the OOD unavailability and uncertainty challenges, we adopt an unsupervised way to design our anomaly detector. To acquire high identification of hard OOD cases, we expect our model can learn both coarser and finer features to screen the various dissimilar inputs. Inspired by [18, 4], we propose a generative anomaly detector – Cascade Variational autoencoder based Anomaly Detector (CVAD), which is built on top of a branch-cascaded VAE – pchVAE [40]

. With the cascade VAE architecture to model the in-distribution representations, CAVD gains superior reconstructions and learns good-quality features to threshold out the OOD data. The ability of CVAD to detect anomalies is further enhanced through training a binary discriminator with the reconstructed data with random perturbations on aforementioned cascade VAE’s latent parameters as OOD category. In this paper, our contributions are three-fold:

  • We propose a novel OOD detector – CVAD. By utilizing a cascade VAE to learn latent variables of in-distribution data, CVAD owns good reconstruction ability of in-distribution inputs and obtains discriminative ability for OOD data based on the reconstruction error.

  • We adopt a binary discriminator to further separate the in-distribution data from the OOD data by taking the reconstructed image as fake OOD samples. We add minor random disturbance in VAE latent parameters during fake data generation to enrich data variations. Thus, our model has better discriminative capability for the inter-class as well as intra-class OOD cases.

  • We conduct extensive experiments on multiple public medical image datasets to demonstrate the generalization ability of our proposed model. We evaluate comprehensively against state-of-the-art anomaly detectors in detecting both intra-class and inter-class OOD data, showing improved performance.

2 Related Work

Although there have been extensive research on outlier detection 

[1, 7]

, effective medical image OOD detectors are still lacking due to complicated data types (e.g., various modalities and protocols, difference in acquisition devices) and user-defined application situations (e.g., disease types). Without OOD data available during training, unsupervised anomaly detection becomes the mainstream research direction, which CVAD also belongs to. Recent unsupervised anomaly detection approaches can be roughly classified as two main categories - generative and objective.

2.1 Generative methods

Deep generative models appear to be promising in detecting OOD data since they can learn latent features of training data and generate synthetic data with similar features to known classes [20]. Thus, the compressed latent features can be used to distinguish OOD data from ID data. Two major families of deep generative models are Variational Autoencoders (VAEs) [15]

and Generative Adversarial Networks (GANs) 


VAEs: Traditional AutoEncoders [6] (AEs) can reconstruct input images well, but risk learning the identity of deep image features. Comparatively, VAEs generate contents by regularizing the latent feature distribution representations. With this trait, VAE [15] and its modifications have been used widely in generating realistic synthetic images [13, 40, 34]. Although VAEs are theoretically elegant and easy to train with nice manifold representations, they usually produce blurry images that lack detailed information [13, 4]. To improve the image reconstruction quality, pchVAE [40] adds a conditional hierarchical VAE branch to learn lower-level image components. The improved reconstructions of VAEs are adopted for detecting OOD samples based on the reconstruction quality [3]

. Other approaches seek to enhance the reliable uncertainty estimation of VAE for better performance 

[26, 38, 27, 7, 33]. Reference [27] applies an improved noise contrastive prior (INCP) to acquiring reliable uncertainty estimate for standard VAEs; whereas Bayesian VAE [7]

detects OOD by estimating a full posterior distribution over the decoder parameters using stochastic gradient Markov chain Monte Carlo. Nonetheless, most of the VAE-based OOD detections are only evaluated on natural image datasets (MNIST 

[19], FashionMNIST [37], CIFAR10 [16]


[23], etc.), which are with small image size (e.g., ) and clear intra- and inter-class variations.

GANs: Compared with VAEs, GANs usually generate much sharper images but face challenges in training stability and sampling diversity, especially when synthesizing high-resolution images [13]. Still, GANs remain popular in outlier detection, such as, AnoGAN [31], ADGAN [8], GANomaly [2] to detect OOD samples using GAN architectures. Besides standard architectures, there are hybrid models that detect anomalies by combining a VE/VAE with a GAN [18, 4, 25]. In order to acquire competitive OOD discriminative ability, OCGAN [25] integrates four components: a denoising auto-encoder, two discriminators and a classifier with complicate training process. Generally, such hybrid networks are not competitive for image datasets with clear class variations, as reported in [18]. Their experiments are often done with small-sized images and may fail when experimenting on large-sized medical images.

2.2 Objective methods

Objective anomaly detectors learn identifying OOD data via specific optimization functions and auxiliary transformations. Such OOD detection approaches include classifier-based and transformation-based methods [29, 35, 21].

Classifier-based method: ODIN [21] uses temperature scaling and adds small perturbations to input data for separating the softmax score distributions between ID and OOD images. Similar separation via a multi-class classifier is also followed by [39]. However, the prerequisite of balanced multiple classes is not always applicable in medical applications. Comparatively, the one-vs-rest setup [22] is much more common and useful in medical OOD detection, which treats one-class as in-distribution data and evaluates performance on the left OOD data. Following the setting, the anomaly detection reduces to a one-class classification (OCC) problem [14]. Representative one-class classifiers are OCSVM [32], DeepSVDD [29].

Transformation-based method: Most of the anomaly detectors are unsupervised given the assumption the anomalies are unavailable during training. Hence, good detection performance largely depends on the learning of high-quality in-distribution features. Self-augmentation with transformations on training data not only enriches the training diversity but also introduces discriminative knowledge. For example,  [35] proposes contrasting shifted instances for anomaly detection. Nevertheless, the augmentations are with limited transformations and consume more time to train as more generated data are fed as fake OOD data. Our model CVAD has no additional augmentations but still captures high-quality representations of in-distribution data.

Besides, there are many other approaches contributing to OOD detection, such as GradCon [17], generalized ODIN [11] and FSSD [12]. Please refer to the papers for more details.

3 Methods

Figure 2: Proposed CVAD architecture - CVAE as the generator and a separate binary classifier (C) as the discriminator.

Anomaly detection includes both intra- and inter-class OOD identification, of which medical intra-class OOD data is much more challenging because of the minute dissimilarity compared to ID data. With no prior knowledge available and no sophisticated pre-processing, we utilize a variation autoencoder to learn the “normality” of in-distribution inputs via image reconstruction and enhance the discriminative ability for both two OOD classes via a binary discriminator. Both the reconstruction and discrimination contribute to accurate intra- and inter-class OOD detection.

3.1 CVAD architecture

Fig. 2 shows the design of CVAD. Inspired by the GAN’s architecture, we adopt the VAE architecture as the “generator” for modeling ID representations and a separate classifier as the “discriminator” to strengthen OOD discrimination.

A standard VAE module consists of two neural networks: an encoder and a decoder 

[15], with the encoder (parameterized by ) mapping the visible variables to the latent variables and the decoder (parameterized by ) sampling the visible variables given the latent variables  [13]. Given a dataset with

input vectors drawn from some underlying data distribution

, and are then learned by maximizing the variational lower bound (ELBO) , which is a lower bound to the marginal log-likelihood  [7]. However, a vanilla VAE exhibits limited potential in distinguishing unseen distributions due to the blurry reconstructions for large-size images. Thus, we adopt a modified VAE architecture – pchVAE [40] for high-quality reconstruction and better latent representations, which improves the reconstruction by adding a branch VAE on the standard VAE pipeline and then cascade the two representations for final outputs. For convenience, we use pchVAE and CVAE interchangeably.

Generator: Different from the standard VAE, CVAE has two encoders and two decoders , . To learn the high-level features, a deep and standard VAE architecture constructed by and formulates the deep latent variables by sampling parameters and of size . Meanwhile, the low-level features are learnt by the branch VAE. Instead of using the original input, branch VAE utilizes the concatenation of two intermediate features from and . Given original input variables , the input of branch VAE can be represented as . The encoder of branch VAE is simpler than whereas the decoder owns the same architecture as

. This branch VAE formulates latent Gaussian distributions with parameters

of size . After sampling, two sets of latent variables, i.e., are acquired and decoded to image contexts and finer details respectively. is the combination of and .

Discriminator: Since the CVAE itself has no awareness of distinguishing outliers, we add a binary discriminator to distinguish the reconstructed image and its counterpart with minor disturbance from the original input image . As , share very similar features with after the first-stage training of the image generator, the discriminator is much more sensitive to minor differences from the in-distribution data, enhancing the accuracy of identifying both intra-class OOD data and inter-class OOD data.

3.2 Network training

Instead of training CVAD in an adversarial way, we train the generator and the discriminator in two stages. The reason is that training with adversarial losses often leads to much sharper reconstructions but ignores the low-level information of ID data, incurring high reconstruction errors and potential dangerous decisions for medical applications. Therefore, CAVD is designed to first train the image generator and then the binary discriminator to detect OOD data. This non-adversarial training enables CVAD to inherit the merit of VAEs [15] and avoid the instability of GANs [10].

To optimize CVAE, we minimize two objectives for the primary VAE part in Eqn. 1 and the branch VAE part in Eqn. 2

, KL refers to Kullback-Leibler divergence.


Therefore, the CVAE loss can be formulated as Eqn. 3. and to balance the weights of the two individual terms.


The binary discriminator is trained to distinguish true/fake images using binary cross entropy.

Anomaly score: An anomaly score is defined in Eqn. 4 based on errors during inference and includes two parts: the reconstruction error and the probability of being the anomaly class . Instead of simply adding the two parts together, we first scale the CVAE reconstruction errors into [0,1] and get the average score value to avoid assigning imbalanced weights between the two parts:


3.3 Network Details

As illustrated in Fig. 2, CVAE has a standard VAE part which consists of , , and and a branch VAE composed by a shallow encoder and a decoder . The primary VAE is a symmetric network with five

convolutions with stride 2 and padding 1 followed by five transposed convolutions. Respectively,

stands for the first three convolution layers; refers the last two convolution layers; is for the first three transposed convolution layers and means the last two transposed convolution layers. The input of the branch VAE is the intermediate features of and the middle decoded features of . here is a convolution layer which has a same kernel with stride 2 and padding 1. shares the same decoder architecture as the standard VAE, namely,

. All convolutions and transposed-convolutions are followed by batch normalization and leaky ReLU (with slope 0.2) operations. We used a base channel size of 16 and increased number of channels by a factor of 2 with every encoder layer and decreased the number of channels to half for each decoder layer. The latent dimension

of is set as 512 and is with , i.e., 2048 dimensions.

The binary discriminator is composed of five convolution layers with the same settings as above and a final fully connected layer to make a binary prediction. After a sigmoid function, the final ID/OOD class probability is obtained.

4 Experiments

4.1 Datasets

We conducted extensive experiments, verifying the generalizability and effectiveness of our approach on multiple open-access medical image datasets for intra- and inter-class OOD detection. In total, we used four independent datasets, including three medical image datasets – RSNA Pneumonia dataset [36], inferior vena cava filters (IVC-Filter in short) on radiographs [24] and SIIM-ISIC Melanoma dataset [28] (identify melanoma in lesion images) and one natural image datasets – Bird Species111 Among the medical datasets, RSNA and SIIM datasets have binary classes – normal and abnormal, whereas IVC-Filter dataset has 14 distinct types (classes). Table 2 lists the class information and number of images for each dataset and the corresponding usage in the Details column. Bird dataset, which contains 270 bird species with 38,518 training images, was only used as inter-class OOD for detection validation. To unify the OOD detection pipeline and facilitate evaluation, we resized both the medical images and the validation inter-class OOD images to a unified size, where IVC-Filter and RSNA datasets are in gray scale with as 1 and the SIIM images are in RGB format and have 3.

4.2 Implementation

We implemented our model using Pytorch 1.5.0, Python 3.6.

were equal to 1. We ran the models on 4 NVIDIA Quadro RTX 6000 GPUs with 24 GB memory each. In our model training, we used Adam optimizer with a learning rate of 0.001, and each network was trained for 100-350 epochs.

4.3 Evaluation Metrics

We evaluated our anomaly detection model performance in terms of standard statistical metrics - (i) area under the receiver operating characteristic (AUROC, AUC in short): a performance metric for “discrimination” between ID and OOD data (close to 1 gives optimal discrimination); (ii) True Positive rate (TPR): number of samples correctly classified as OOD (higher yield indicates better performance); (iii) False positive rate (FPR): number of samples wrongly classified as OOD (lower is better). To classify ID and OOD classes, a threshold should be defined for the anomaly scores. Notably, the AUC value is threshold-invariant, while the TPR and FPR are determined by the selection of the anomaly threshold. We adopted the Geometric Mean (G-Mean) method to determine an optimal threshold for the ROC curve by tuning the decision thresholds and reported the resulted FPR and TPR values. We also reported the corresponding DIFF, which is the difference of TPR and FPR under optimal selection, i.e., DIFF=TPR-FPR (larger is better). To be fair and thorough, we ran all the experiments on both intra-class OOD and inter-class OOD to further analyze the performance of anomaly detectors on the specific type of OOD detection.

4.4 Quantitative Results

To demonstrate the model’s effectiveness, we set the vanilla AE and VAE architectures as baselines and compared our CVAD model with three state-of-the-art models with varying architectures – pchVAE [40], a classifier-based approach DeepSVDD [29], and a GAN-based method GANomaly [2]. Table 1 shows the models’ performance for the intra-class OOD detection and Table 2 primarily presents the inter-class OOD performance. The selection of in-class data, intra-class OOD and inter-class OOD data are summarized in the Details column of Table 2.

4.4.1 Results for Intra-class OOD Detection

Intra-class OOD images are the most challenging outliers to identify since they often share similarity to the ID data but belong to a different class with unique characteristics. This similarity leads to the difficulty in identifying this type of OOD data, especially for medical images. As illustrated in Fig. 1, e.g., the variations of benign and malignant skin cancer images are not as obvious as the natural objects. Still, CVAD exhibits its superiority in detecting intra-class OOD for medical images. On the RSNA dataset, CVAD achieves the best DIFF value 0.322 and AUC score 0.699 (+0.129 from DeepSVDD’s AUC score 0.570, +0.123 from GANomaly’s AUC score 0.576); for IVC-Filter, though GANomaly obtains the highest DIFF and AUC values, CVAD shows competitive performance and improves its AUC score 0.582; and for the RSNA dataset, DeepSVDD has the largest DIFF value 0.407 but CVAD reaches the second best DIFF 0.393. Moreover, CVAD acquires the optimal AUC score 0.750. Overall, CVAD performs stably and effectively for intra-class OOD detection except the sub-optimal results for IVC-Filter dataset. The reason behind this is the training data size. With 196 training images of IVC-Filter, CVAD may not be able to learn enough ID feature representations. Nevertheless, CVAD still outperforms GANomaly on IVC-Filter dataset.

Methods RSNA IVC-Filter SIIM
AE [30] 0.318 0.461 0.143 0.566 0.443 0.526 0.083 0.520 0.403 0.685 0.282 0.673
VAE [3] 0.381 0.611 0.230 0.614 0.426 0.525 0.099 0.524 0.442 0.740 0.298 0.676
pchVAE [40] 0.498 0.737 0.239 0.604 0.475 0.567 0.092 0.529 0.399 0.568 0.169 0.616
DeepSVDD [29] 0.399 0.509 0.110 0.570 0.545 0.713 0.168 0.522 0.276 0.683 0.407 0.740
GANomaly [2] 0.524 0.678 0.154 0.576 0.409 0.603 0.194 0.584 0.553 0.495 -0.058 0.418
CVAD (ours) 0.321 0.643 0.322 0.699 0.541 0.706 0.165 0.582 0.381 0.774 0.393 0.750
Table 1: Intra-class OOD detection results (FPR, TPR, DIFF and AUC values) of various anomaly detectors trained on RSNA, IVC-Filter and SIIM datasets. Best results are highlighted.
Dataset Details Methods AUROC score
InterClass1 InterClass2 InterClass3
In-class: normal (8,851)
Intra-class: pneumonia (9,555), abnormal (11,821)
InterClass1: BIRD (38,518)
InterClass2: SIIM (33,125)
InterClass3: IVC-Filter (1,258)
AE [30] 0.680 0.608 0.616
VAE [3] 0.752 0.604 0.613
pchVAE [40] 0.795 0.776 0.619
DeepSVDD [29] 0.838 0.834 0.604
GANomaly [2] 0.775 0.819 0.594
CVAD (ours) 0.865 0.806 0.706
In-class: type 11 (196)
Intra-class: type 0-10, 12,13 (1,062)
InterClass1: BIRD (38,518)
InterClass2: SIIM (33,125)
InterClass3: RSNA (30,227)
AE [30] 0.372 0.353 0.237
VAE [3] 0.666 0.400 0.706
pchVAE [40] 0.775 0.321 0.846
DeepSVDD [29] 0.864 0.979 0.889
GANomaly [2] 0.829 0.525 0.740
CVAD (ours) 0.916 0.705 0.844
In-class: benign (32,541)
Intra-class: malignant (584)
InterClass1: BIRD (38,518)
InterClass2: IVC-Filter (1,258)
InterClass3: RSNA (30,227)
AE 0.572 0.013 0.752
VAE [3] 0.712 0.759
pchVAE [40] 0.943 0.992 0.684
DeepSVDD [29] 0.986 0.992 0.804
GANomaly [2] 0.686 0.989 0.442
CVAD (ours) 0.993 0.993 0.831
Table 2: AUC scores predicted by OOD detectors for inter-class identification on RSNA, IVC-Filter and SIIM datasets. The total number of samples of each dataset is reported in the bracket of Details column. Bold indicates the best performance.
Figure 3: ROC curves of different models for intra- and inter-class OOD identification on RSNA, IVC-Filter and SIIM dataset. Performance of different models are highlighted with different colors with the corresponding AUC scores labeled in the brackets.

4.4.2 Results for Inter-class OOD detection

To fairly evaluate all the models, we tested them on multiple inter-class OOD data types and presented the corresponding AUC scores in Table. 2. As the OOD image datasets may have different image channels and image sizes from the ID training images, we adjusted the image channels and resized the images to ensure consistent input data format for evaluation222For example, to evaluate trained models on RSNA, we converted the BIRD and SIIM images to grayscale mode and resized them to the same in-distribution image size.. CVAD obtains the highest AUC values on RSNA (except for inter-class2) and SIIM datasets across three inter-class OOD detection evaluations. The inter-class OOD detection of CVAD on IVC-Filter is also satisfying with stable performance.

To further show the models’ performance difference, we plotted the Receiver operating characteristic (ROC) curves of all the datasets for all the models evaluated on four OOD situations – intra-class, inter-class1, inter-class2, inter-class3 OOD data. Fig. 3 shows the plots for RSNA, IVC-Filter and SIIM datasets with the corresponding AUC scores included. Notably, the difficulties in detecting intra-class and inter-class OOD data are reflected on the AUC scores, with most scores on inter-class OOD data are much higher than detection on intra-class OOD samples, especially in the RSNA results of Fig. 3.

4.4.3 Ablation Study

Generally, CVAD can exceed the baseline’s performance with certain improvements and show competitive performances for both intra- and inter-class OOD detection. Here we analyze the functionality of the “generator” and “discriminator” of CVAD. As CVAD utilizes pchVAE to learn latent ID representation, we also report the performances of pchVAE itself on detecting intra- and inter-class OODs in Table. 1 and Table. 2 respectively.

For intra-class OOD detection, CVAD improves DIFF value from pchVAE’s 0.239 to 0.322 (+0.083) and AUC score from pchVAE’s 0.604 to 0.699 (+0.095) on the RSNA dataset; similarly on IVC-Filter dataset, CVAD enhances the DIFF from pchVAE’s 0.092 to 0.165(+0.073), AUC from pchVAE’s 0.529 to 0.582 (+0.053); for SIIM dataset, CVAD also increases DIFF from pchVAE’s 0.169 to 0.393 (+0.224), AUC from pchVAE’s 0.616 to 0.750(+0.134). The same observation also exists in the inter-class OOD detection results in Table. 2. This performance improvement can be attributed to the discriminator’s learning with the exposure of generated OOD data samples, which enables CVAD to gain better discriminative ability than pchVAE itself.

Additionally, the standard AE and VAE are evaluated as baselines for various OOD detection. Although pchVAE reconstructs image with higher quality than VAE, it fails to exceed VAE in OOD detection. Autoencoder, which can also output good reconstruction, exhibits the weakest OOD detection accuracy according to the results reported in Table. 1 and Table. 2. In conclusion, good image reconstruction does not ensure strong OOD identification ability and adding a discriminator can be functional and contribute to discriminative learning.

4.5 Qualitative Results

Here we provide visualizations for anomaly detection of CVAD on different datasets and the reconstruction effects of CVAE.

Figure 4: Anomaly scores output by CVAD for different types of input data (experiments for RNSA dataset). Columns from left to right, ID, intra-class OOD, inter-class OOD1, inter-class OOD2, inter-class OOD3.
Figure 5: Anomaly scores for IVC-Filter dataset, from the left to right: in-distribution data, intra-class OOD, inter-class OOD1, inter-class OOD2, inter-class OOD3
Figure 6: CVAD prediction examples of SIIM dataset. From left to right, ID, intra-class OOD, inter-class OOD1, inter-class OOD2, inter-class OOD3 respectively. Anomaly scores are labeled on top of each case.

4.5.1 Anomaly Detection

Fig. 4 shows experimental results for RSNA dataset. Each column represents a specific type of input data. From left to right, they are in-distribution data, intra-class OOD data, inter-class OOD1 data, inter-class OOD2 data and inter-class OOD3 data, respectively. There are two examples for each type of data. The corresponding anomaly score predicted by CVAD is on top of each example. A high anomaly score means high possibility the data is with to be in OOD category. As can be seen in Fig. 4, the two intra-class OOD samples are alike as the in-distribution data but the inter-class OOD examples show very different appearance from in-distribution data. Correspondingly, the anomaly scores of intra-class OOD are close to the scores of ID samples and difficult to separate whereas the intra-class OOD cases with clear variations are assigned higher anomaly scores and easy to identify. This phenomenon further demonstrates the challenges of identifying intra-class OOD data. The predicted anomaly scores for IVC-Filter and SIIM experiments are present in Fig. 5 and Fig. 6, respectively.

Figure 7: Reconstruction details visualization of CVAE trained on RSNA dataset for different data types.

4.5.2 Visualization of reconstruction effects

CVAD gains good latent in-distribution features via its “generator” – CVAE, which learns both low-level and high-level representations. To demonstrate the effectiveness, we took RSNA dataset as a representative and showcased the reconstruction details in Fig. 7, with the first column for branch VAE reconstruction , the second column for standard VAE part reconstruction , the third column for ultimate reconstruction and the last column for the original input image (following the same notations indicated in Fig. 2). To further reveal the effects of CVAE on different OOD samples, we also presented example images for ID (i.e., normal class, 1st row), intra-class OOD (i.e., pneumonia or with opacity, 2nd row), inter-class OOD1 (i.e., gray-scale bird images, 3rd row), inter-class OOD2 (i.e., skin cancer images from SIIM dataset,4th row) and inter-class OOD3 (i.e., images from IVC-Filter dataset, 5th row) in Fig. 7. Compared with the intra-class medical OOD data, reconstructions on inter-class OOD inputs are more messy and dissimilar to the original OOD data, which leads to larger reconstruction errors and thus easier to distinguish. This observation reveals the varying difficulties of detecting different types of OOD data – intra-class OOD is much more challenging than inter-class OOD.

5 Conclusion

We propose an effective medical anomaly detector CVAD that can reconstruct coarse and fine image components by learning multi-scale latent representations. The high quality of generated images enhances the discriminative ability of the binary discriminator in identifying unknown OOD data. We demonstrate the OOD detection efficacy for both intra-class and inter-class OOD data on various medical and natural image datasets. Our model has no prior assumptions on the input images and application scenarios for OOD, thus can be applied to detect OOD samples in a generic way for multiple scenarios.


  • [1] D. Abati, A. Porrello, S. Calderara, and R. Cucchiara (2019)

    Latent space autoregression for novelty detection


    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 481–490. Cited by: §2.
  • [2] S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon (2018) Ganomaly: semi-supervised anomaly detection via adversarial training. In Asian conference on computer vision, pp. 622–637. Cited by: §2.1, §4.4, Table 1, Table 2.
  • [3] J. An and S. Cho (2015) Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2 (1), pp. 1–18. Cited by: §2.1, Table 1, Table 2.
  • [4] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua (2017) CVAE-gan: fine-grained image generation through asymmetric training. In Proceedings of the IEEE international conference on computer vision, pp. 2745–2754. Cited by: §1, §2.1, §2.1.
  • [5] T. Cao, C. Huang, D. Y. Hui, and J. P. Cohen (2020) A benchmark of medical out of distribution detection. arXiv preprint arXiv:2007.04250. Cited by: §1.
  • [6] R. J. W. David Rumelhart (1986) Parallel distributed processing explorations in the microstructure of cognition. Vol. 1, MIT press Cambridge, MA. Cited by: §2.1.
  • [7] E. Daxberger and J. M. Hernández-Lobato (2019) Bayesian variational autoencoders for unsupervised out-of-distribution detection. arXiv preprint arXiv:1912.05651. Cited by: §1, §2.1, §2, §3.1.
  • [8] L. Deecke, R. Vandermeulen, L. Ruff, S. Mandt, and M. Kloft (2018) Image anomaly detection with generative adversarial networks. In

    Joint european conference on machine learning and knowledge discovery in databases

    pp. 3–17. Cited by: §2.1.
  • [9] T. Fernando, H. Gammulle, S. Denman, S. Sridharan, and C. Fookes (2020) Deep learning for medical anomaly detection–a survey. arXiv preprint arXiv:2012.02364. Cited by: §1.
  • [10] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §2.1, §3.2.
  • [11] Y. Hsu, Y. Shen, H. Jin, and Z. Kira (2020) Generalized odin: detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10951–10960. Cited by: §2.2.
  • [12] H. Huang, Z. Li, L. Wang, S. Chen, B. Dong, and X. Zhou (2020) Feature space singularity for out-of-distribution detection. arXiv preprint arXiv:2011.14654. Cited by: §2.2.
  • [13] H. Huang, Z. Li, R. He, Z. Sun, and T. Tan (2018) Introvae: introspective variational autoencoders for photographic image synthesis. arXiv preprint arXiv:1807.06358. Cited by: §2.1, §2.1, §3.1.
  • [14] S. S. Khan and M. G. Madden (2014) One-class classification: taxonomy of study and review of techniques.

    The Knowledge Engineering Review

    29 (3), pp. 345–374.
    Cited by: §2.2.
  • [15] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.1, §2.1, §3.1, §3.2.
  • [16] A. Krizhevsky, V. Nair, and G. Hinton (2010) Cifar-10 (canadian institute for advanced research). URL http://www. cs. toronto. edu/kriz/cifar. html 5. Cited by: §2.1.
  • [17] G. Kwon, M. Prabhushankar, D. Temel, and G. AlRegib (2020) Backpropagated gradient representations for anomaly detection. In European Conference on Computer Vision, pp. 206–226. Cited by: §2.2.
  • [18] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2016) Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pp. 1558–1566. Cited by: §1, §2.1.
  • [19] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Cited by: §2.1.
  • [20] D. Li, D. Chen, J. Goh, and S. Ng (2018) Anomaly detection with generative adversarial networks for multivariate time series. arXiv preprint arXiv:1809.04758. Cited by: §2.1.
  • [21] S. Liang, Y. Li, and R. Srikant (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §2.2, §2.2.
  • [22] P. Liznerski, L. Ruff, R. A. Vandermeulen, B. J. Franks, M. Kloft, and K. Müller (2020) Explainable deep one-class classification. arXiv preprint arXiv:2007.01760. Cited by: §2.2.
  • [23] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §2.1.
  • [24] J. C. Ni, K. Shpanskaya, M. Han, E. H. Lee, B. H. Do, W. T. Kuo, K. W. Yeom, and D. S. Wang (2020) Deep learning for automated classification of inferior vena cava filter types on radiographs. Journal of Vascular and Interventional Radiology 31 (1), pp. 66–73. Cited by: §4.1.
  • [25] P. Perera, R. Nallapati, and B. Xiang (2019) Ocgan: one-class novelty detection using gans with constrained latent representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2898–2906. Cited by: §2.1.
  • [26] A. A. Pol, V. Berger, C. Germain, G. Cerminara, and M. Pierini (2019) Anomaly detection with conditional variational autoencoders. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 1651–1657. Cited by: §2.1.
  • [27] X. Ran, M. Xu, L. Mei, Q. Xu, and Q. Liu (2020) Detecting out-of-distribution samples via variational auto-encoder with reliable uncertainty estimation. arXiv preprint arXiv:2007.08128. Cited by: §2.1.
  • [28] V. Rotemberg, N. Kurtansky, B. Betz-Stablein, L. Caffery, E. Chousakos, N. Codella, M. Combalia, S. Dusza, P. Guitera, D. Gutman, et al. (2021) A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Scientific data 8 (1), pp. 1–8. Cited by: §4.1.
  • [29] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In International conference on machine learning, pp. 4393–4402. Cited by: §2.2, §2.2, §4.4, Table 1, Table 2.
  • [30] M. Sakurada and T. Yairi (2014) Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4–11. Cited by: Table 1, Table 2.
  • [31] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pp. 146–157. Cited by: §2.1.
  • [32] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001) Estimating the support of a high-dimensional distribution. Neural computation 13 (7), pp. 1443–1471. Cited by: §2.2.
  • [33] G. Somepalli, Y. Wu, Y. Balaji, B. Vinzamuri, and S. Feizi (2020) Unsupervised anomaly detection with adversarial mirrored autoencoders. European Conference on Computer Vision. Cited by: §2.1.
  • [34] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther (2016) Ladder variational autoencoders. arXiv preprint arXiv:1602.02282. Cited by: §2.1.
  • [35] J. Tack, S. Mo, J. Jeong, and J. Shin (2020) Csi: novelty detection via contrastive learning on distributionally shifted instances. arXiv preprint arXiv:2007.08176. Cited by: §2.2, §2.2.
  • [36] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106. Cited by: §4.1.
  • [37] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §2.1.
  • [38] Z. Xiao, Q. Yan, and Y. Amit (2020) Likelihood regret: an out-of-distribution detection score for variational auto-encoder. arXiv preprint arXiv:2003.02977. Cited by: §2.1.
  • [39] Q. Yu and K. Aizawa (2019) Unsupervised out-of-distribution detection by maximum classifier discrepancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9518–9526. Cited by: §2.2.
  • [40] D. Zimmerer, J. Petersen, and K. Maier-Hein (2019) High-and low-level image component decomposition using vaes for improved reconstruction and anomaly detection. arXiv preprint arXiv:1911.12161. Cited by: §1, §2.1, §3.1, §4.4, Table 1, Table 2.