Bayesian autoencoders with uncertainty quantification: Towards trustworthy anomaly detection

by   Bang Xiang Yong, et al.
University of Cambridge

Despite numerous studies of deep autoencoders (AEs) for unsupervised anomaly detection, AEs still lack a way to express uncertainty in their predictions, crucial for ensuring safe and trustworthy machine learning systems in high-stake applications. Therefore, in this work, the formulation of Bayesian autoencoders (BAEs) is adopted to quantify the total anomaly uncertainty, comprising epistemic and aleatoric uncertainties. To evaluate the quality of uncertainty, we consider the task of classifying anomalies with the additional option of rejecting predictions of high uncertainty. In addition, we use the accuracy-rejection curve and propose the weighted average accuracy as a performance metric. Our experiments demonstrate the effectiveness of the BAE and total anomaly uncertainty on a set of benchmark datasets and two real datasets for manufacturing: one for condition monitoring, the other for quality inspection.


Anomaly Detection Based on Selection and Weighting in Latent Space

With the high requirements of automation in the era of Industry 4.0, ano...

Bayesian Autoencoders for Drift Detection in Industrial Environments

Autoencoders are unsupervised models which have been used for detecting ...

Anomaly Detection using Autoencoders in High Performance Computing Systems

Anomaly detection in supercomputers is a very difficult problem due to t...

How to boost autoencoders?

Autoencoders are a category of neural networks with applications in nume...

Bayesian Anomaly Detection and Classification

Statistical uncertainties are rarely incorporated in machine learning al...

Do autoencoders need a bottleneck for anomaly detection?

A common belief in designing deep autoencoders (AEs), a type of unsuperv...

Coalitional Bayesian Autoencoders – Towards explainable unsupervised deep learning

This paper aims to improve the explainability of Autoencoder's (AE) pred...

1 Introduction

The use of unsupervised neural networks (NNs) such as deep autoencoders (AEs) has consistently achieved improvements over traditional machine learning (ML) models for high-dimensional anomaly detection

Chalapathy and Chawla (2019); Munir et al. (2019); Pang et al. (2021). Nonetheless, one missing ingredient is the measurement of predictive uncertainty. That is, in addition to reporting the prediction of an anomaly, can the AE tell how uncertain it is about the prediction?

The model’s prediction and its uncertainty are two different outputs: the prediction classifies a data point as inlier or anomaly by relying on an anomaly score, which typically measures its similarity or distance from a reference distribution of normality; the higher the anomaly score, the higher the chance of an abnormal observation Chandola et al. (2009). By contrast, the anomaly uncertainty measures the trustworthiness of the classification; the higher the uncertainty, the less reliable is the prediction, implying a greater chance of a predictive error. (See Fig. 1 for an illustration).

Figure 1: An uncertainty-aware anomaly detector quantifies predictive uncertainty in addition to predicting inliers and anomalies.

One popular method to quantify uncertainty for NNs is by adopting the Bayesian framework, resulting in the Bayesian autoencoder (BAE) formulation. Two types of uncertainty are of interest: epistemic uncertainty captures the uncertainty in the BAE parameters, while aleatoric uncertainty reflects the inherent randomness in the data Kiureghian and Ditlevsen (2009). However, recent works on BAEs Legrand et al. (2019); Yong et al. (2020b); Daxberger and Hernández-Lobato (2019); Tran et al. (2021); Baur et al. (2020) have only investigated the use of epistemic and aleatoric uncertainties as anomaly scores, leaving the notion of anomaly uncertainty understudied. Consequently, we are still incapable of knowing when to (dis)trust the BAEs’ predictions.

Knowing the uncertainty of each prediction is important for ensuring reliability, especially in high-stake applications, as we can anticipate whether there is an error with the prediction Gao et al. (2019). If we filter away cases of high uncertainty, we are left with trustworthy predictions of higher accuracy. In addition, these uncertain cases can be referred for further diagnosis, either by humans or equipment, thereby increasing the overall predictive performance. Therefore, in this work, we contribute in the following ways:

  1. Formulation of BAE to quantify predictive uncertainty of anomalies.

    We do so by converting negative log-likelihood (NLL) estimates from the BAE into the total anomaly uncertainty which is decomposable into epistemic and aleatoric uncertainties.

  2. Evaluation of uncertainty in anomaly detection.

    We consider the task of classifying anomalies with the option of rejecting uncertain predictions. While these evaluations have been explored in supervised learning

    Nadeem et al. (2009); Hendrickx et al. (2021); Leibig et al. (2017), we contribute by extending to unsupervised anomaly detection.

  3. Validation of proposed BAE with real use cases that demonstrate its effectiveness and value. Our experiments demonstrate the added benefits of quantifying anomaly uncertainty on a set of benchmark datasets, as well as two real datasets for manufacturing applications: one for condition monitoring and the other for quality inspection.

2 Background and related works

Anomaly detection***

Also known as novelty detection, outlier detection, and more recently, out-of-distribution detection.

is a binary classification task of identifying data points that differ significantly from a reference distribution of inliers Chandola et al. (2009). Anomaly detection has many real use cases in high-stake applications, e.g. cybersecurity Kwon et al. (2019), financial surveillance Ahmed et al. (2016), health and medical diagnosis Fernando et al. (2021), manufacturing Kamat and Sugandhi (2020)

. Many traditional ML models have been adapted to perform anomaly detection, e.g. isolation forest, one-class support vector machine, K-means clustering

Agrawal and Agrawal (2015)

. A drawback of traditional ML models is their low complexity and flexibility, limiting their scalability from handling high-dimensional data prevalent in modern databases

Thudumu et al. (2020).

Overcoming the limitations of traditional ML models, deep NNs are expressive and scalable models. Many works have studied their use for anomaly detection, consistently achieving state-of-the-art results Pang et al. (2021); Fernando et al. (2021); Kamat and Sugandhi (2020); Kwon et al. (2019); Ahmed et al. (2016)

. A key property of NNs is their flexibility of modelling; each component (e.g. activation function, layers, optimiser) can be tuned and customised to adapt to the data at hand. Furthermore, specialised hardware have expedited their adoption and scalability by parallelising and optimising computations

Li et al. (2019).

An emerging challenge with using NNs for anomaly detection is the quantification of uncertainty Thudumu et al. (2020). Uncertainty is a core component for promoting algorithmic transparency and hence, the advancement of trustworthy ML Bhatt et al. (2021). While most works have focused on anomaly scores, less is known about anomaly uncertainty. To this end, Perini et al. Perini et al. (2020) introduced the example-wise confidence method for quantifying anomaly uncertainty, however, their methods have only experimented with traditional ML models, leaving their applicability to deep AEs unclear.

The development of Bayesian neural networks (BNNs) Bishop (2007) for uncertainty quantification has led to the formulation of BAEs, a probabilistic version of the AE. Nonetheless, extant works have used uncertainty as an anomaly score, which is different from the notion of anomaly uncertainty. We list extant works as follows: Yong et al. Yong et al. (2020a) quantified epistemic and aleatoric uncertainties for detecting sensor drifts. Daxberger & Lobato Daxberger and Hernández-Lobato (2019) developed the Bayesian variational autoencoder (BVAE) to quantify uncertainty as a measure of deviation from the training distribution. Baur et al. Baur et al. (2020) developed the BAE with Monte Carlo dropout for anomaly detection in medical application. Tran et al. Tran et al. (2021) observed that BAE-generated data have higher uncertainty than the reconstructed data. Chandra et al. Chandra et al. (2021)

studied the BAE with Markov Chain Monte Carlo (MCMC) sampling for dimensionality reduction and classification tasks.

Work ML model Anomaly uncertainty
Ours BAE
Perini et al. Perini et al. (2020) Traditional
Yong et al. Yong et al. (2020a) BAE
Daxberger & Lobato Daxberger and Hernández-Lobato (2019) BVAE
Baur et al. Baur et al. (2020) BAE
Tran et al. Tran et al. (2021) BAE
Chandra et al. Chandra et al. (2021) BAE
Kriegel et al. Kriegel et al. (2011) Traditional
Kwon et al. Kwon et al. (2020) Supervised BNN
Table 1: Comparisons between related works and ours on ML models for anomaly uncertainty.

Our work stands out as the first to provide a comprehensive study on the BAE’s predictive uncertainty of an anomaly. We extend the work of Kwon et al. Kwon et al. (2020), which decomposed the classification uncertainty of supervised BNNs into epistemic and aleatoric uncertainties, to consider unsupervised anomaly detection with BAEs. Further, our work is inspired by Kriegel et al. Kriegel et al. (2011)

, who proposed converting an ensemble of anomaly scores into anomaly probabilities, and combining them via averaging. To improve the combination quality, they further proposed a customised scaling. We extend their study by (1) including BAEs, (2) quantifying anomaly uncertainty while maintaining coherence within a probabilistic framework, and (3) finding that their customised scaling improved the quality of uncertainty. Comparisons with related works are summarised in

Table 1.

3 Methods

This section details the probabilistic formulation of BAEs for quantifying anomaly uncertainty, followed by ways to evaluate them.

3.1 Bayesian autoencoder

Suppose we have a set of data , . An AE is an NN parameterised by , and consists of two parts: an encoder , for mapping input data x to a latent embedding, and a decoder for mapping the latent embedding to a reconstructed signal of the input (i.e. ) Goodfellow et al. (2016).

Bayes’ rule can be applied to the parameters of the AE to create a BAE,


where is the likelihood and

is the prior distribution of the AE parameters. The log-likelihood for a diagonal Gaussian distribution is,



is the variance of the Gaussian distribution. For simplicity, we use an isotropic Gaussian likelihood with

in this study, since the mean-squared error (MSE) function is proportional to the NLL. For the prior distribution, we use an isotropic Gaussian prior which effectively leads to regularisation.

Since Equation 1 is analytically intractable for a deep NN, various approximate methods have been developed such as Stochastic Gradient Markov Chain Monte Carlo (SGHMC) Chen et al. (2014), Monte Carlo Dropout (MCD) Gal and Ghahramani (2016), Bayes by Backprop (BBB) Blundell et al. (2015), and anchored ensembling Pearce et al. (2020) to sample from the posterior distribution. In contrast, a deterministic AE has its parameters estimated using maximum likelihood estimation (MLE) or maximum a posteriori (MAP) when regularisation is introduced. The variational autoencoder (VAE) Kingma and Welling (2013) and BAE are AEs formulated differently within a probabilistic framework: in the VAE, only the latent embedding is stochastic while the and are deterministic and the model is trained using variational inference, while, on the other hand, the BAE (similar to BNN) has distributions over all parameters of and . See Appendix A for descriptions of the posterior sampling methods used in our study.

In short, the training phase of BAE entails using one of the sampling methods to obtain a set of approximate posterior samples . Then, during the prediction phase, we use the posterior samples to compute estimates of the NLL. For brevity, we denote the posterior NLL score , as . Next, we look into using the NLL scores to quantify the anomaly uncertainty.

3.2 Quantifying anomaly uncertainties

Overview. The proposed method for quantifying anomaly uncertainty is illustrated in Fig. 2. Although we use the NLL as an anomaly score, it is not a true

probability of an anomalous outcome. Hence, we first convert the NLL scores into anomaly probabilities via the cumulative distribution function (CDF). Next, using the anomaly probabilities, we compute the epistemic and aleatoric uncertainties. Capturing both uncertainties is crucial for forming a holistic quantification of predictive uncertainty; losing one of them leads to incomplete quantification and hence a lower quality of uncertainty estimate. Summing them forms the total anomaly uncertainty. Now, we formally describe our method for obtaining the total anomaly uncertainty with the BAE.

Figure 2: Workflow for quantifying the probability of an anomalous event and the predictive uncertainty using an ensemble of -samples from the BAE posterior.

(i) Distribution of anomaly scores.


be the vector of parameters of the distribution of NLL scores from the

-th BAE posterior sample,


where Q is an arbitrary distribution. The optimal choice of distribution Q is dataset dependent, and in this study, we have experimented with setting Q as either a Gaussian, exponential or uniform distribution. For simplicity, we estimate

by applying maximum likelihood estimation on the anomaly scores computed over the training data, i.e. .

(ii) Quantification of anomaly probability. We define as the outcome of an anomaly for a new data point

, which has a Bernoulli distribution,


where the anomaly probability is modelled using the CDF,


Using the CDF to quantify anomaly probability is not uncommon; applying a min-max scaler Pedregosa et al. (2011) on the anomaly scores to rescale into is the same as fitting to a uniform distribution and using its CDF.

Alternatively, the empirical cumulative distribution function (ECDF) can be used to estimate . The ECDF essentially captures the fraction of anomaly scores of the training data with lower values than the observed test anomaly score. Moreover, the ECDF converges to the true CDF as the number of examples increases according to Glivenko-Cantelli theorem Tucker (1959). As such, the ECDF benefits from an increasing number of training data, and we can avoid having to specify a distribution Q for the anomaly scores by leveraging the ECDF.


In addition to using the CDF or ECDF when converting anomaly score to probability, we may choose to apply a customised scaling method proposed by Kriegel et al. Kriegel et al. (2011),


reflecting the beliefs that the anomaly probability is zero when the observed anomaly score is less than or equal to the average anomaly score of the training set (i.e. for ), and that the anomaly probability increases as the anomaly score gets higher than the average anomaly score.

By definition, the mean and variance of the Bernoulli distribution in Eq. 4 are:


See Fig. 3 for an illustrative example of Eq. 8 and Eq. 9.

Figure 3: Conversion of from a single sample of BAE into (upper panel) and (lower panel). Distribution Q is chosen to be either the Gaussian or uniform distribution with an option of applying the customised scaling method. The customised scaling shifts the graph to the right by treating as a reference.

Next, to predict an anomalous event, we summarise the anomaly probabilities from all BAE posterior samples via expectation,


A reasonable threshold can be set for the mean anomaly probability at any point where to get a hard prediction of an anomalous event. Moreover, as a calibrated anomaly score, its value ranges in [0,1] and facilitates better interpretation of an anomalous event than the raw anomaly score, which takes on any real number Kriegel et al. (2011).

(iii) Decomposition of anomaly uncertainty. Based on the law of total variance Weiss (2005), we decompose the total anomaly uncertainty, into its epistemic and aleatoric components,


Substituting with Eq. 8 and Eq. 9, the and can be computed as


Our method propagates the model uncertainty from samples of the BAE posterior to yield a holistic quantification of anomaly uncertainty. In contrast, when we have a deterministic AE (i.e. ), the term is zero and the remaining component is the , signifying the inability of the deterministic AE to capture epistemic uncertainty.

Note on interpretability. Reading the can be unintuitive since the variance of a Bernoulli distribution ranges in [0,0.25]. Therefore, to enhance interpretability, we multiply to range between [0,1] in this study.

3.3 Alternative ways to quantify anomaly uncertainties

We investigate two other methods for quantifying anomaly uncertainty. Firstly, by applying the method proposed by Perini et al. Perini et al. (2020) on the BAE under the assumption that the training set is not polluted with anomalies, we arrive at the following uncertainty estimate,


where is the number of training examples. Like our method, the relies on quantifying the . The difference, however, is in the conversion from anomaly probability to anomaly uncertainty, which we posit a problem with the is in the regime of high , as this leads to extreme values of uncertainty estimate. In effect, it is prone to over- and underconfidence when the training examples are abundant, limiting its scalability to larger datasets. In addition, the does not account for epistemic uncertainty, leading to a poorer uncertainty estimate.

The second alternative way to measure anomaly uncertainty is by taking the variance across the samples of NLL scores from the BAE or VAE,


We shall refer to this method as . Note that is not applicable with the deterministic AE since the number of sampled predictions is 1 ( =0). See Fig. 4 for an illustration of the various methods for quantifying anomaly uncertainty.

Figure 4: Visualisation of (a) as anomaly scores, and measures of uncertainties using (b) , (c) , (d) , (e) , and (f) using a BAE-Ensemble on a toy dataset with 2 features. Green dots are training examples while darker regions indicate higher values. and complement each other, forming reasonable regions of uncertainty with the . By contrast, is overconfident in regions surrounding the training examples.

3.4 Evaluating predictive uncertainty in anomaly detection

We evaluate the quality of uncertainty in the task of classifying anomalies with a reject option. Specifically, in addition to classifying a data point as either inlier or anomaly, the BAE has the option of withholding its prediction by not assigning any label. The rationale is that predictions with high uncertainty are likely to be erroneous; hence rather than accepting all predictions in a high-stake application, it is safer to accept only predictions of low uncertainty, which are more accurate Gao et al. (2019).

In this setup, we use the accuracy-rejection curve (ARC) proposed by Nadeem et al. Nadeem et al. (2009)

, which plots the accuracy of the retained predictions against the rejection rate evaluated at multiple anomaly uncertainty thresholds. In effect, the ARC illustrates the trade-off between accuracy and rejection rate. As a measure of accuracy, while any metrics for binary classification can be used, we prefer the geometric mean of the sensitivity and specificity (

) for evaluating classifiers on an imbalanced dataset Kuncheva et al. (2019). Then, to summarise the model’s performance, we compute the weighted average accuracy


where is the number of uncertainty threshold settings and is the rejection rate in percentage. For the anomaly uncertainty to be skillful, the should be higher than the baseline , , where , (i.e. not using the anomaly uncertainty). Hence, an important metric is the difference between the and the baseline,


which measures the average improvement in accuracy when uncertainty is included. It is possible for the to be negative in value, indicating a drop in performance due to the poor utility of uncertainty in rejecting erroneous predictions.

4 Experiment setup

The full code for reproducing the results will be released upon acceptance of this paper.

4.1 Datasets

We use a collection of benchmark datasets, Outlier Detection Datasets (ODDS)

Rayana (2016), which are common in extant studies Campos et al. (2016); Chen et al. (2017). In addition, we use two publicly available datasets for industrial applications: ZeMA Schneider et al. (2018) for condition monitoring and STRATH Tachtatzis et al. (2019) for quality inspection. In the ZeMA dataset, there are 4 tasks; each task uses sensor measurements as inputs for detecting anomalies in hydraulic subsystems, namely, (i) the cooler, (ii) the valve, (iii) the pump, and (iv) the accumulator. While, in the STRATH dataset, the tasks are to detect defective parts manufactured from a radial forging facility. Different sensors are used for each task: (i) a position sensor (L-ACTpos), (ii) a speed sensor (A-ACTspd), (iii) a servo (Feedback-SPA), and (iv) all sensors from tasks (i-iii). See Helwig et al. (2015); Luo and Harris (2020) for detailed descriptions of the facility setup and data collection of ZeMA and STRATH datasets.

For all datasets, we split the inliers into train-test sets with a ratio of 70:30 and include all anomalies in the test set. We use the min-max scaler Pedregosa et al. (2011) to transform the input features into [0,1]. The resulting dimensions of the datasets is tabulated in Table 2. Additional details of preprocessing are included in Appendix B.

Datasets & tasks Train (inliers) Test (inliers) Test (anomalies) Features
(i) Cardio 1324 331 176 21
(ii) Lympho 113 29 6 18
(iii) Optdigits 4052 1014 150 64
(iv) Pendigits 5371 1343 156 16
(v) Thyroid 2943 736 93 6
(vi) Ionosphere 180 45 126 33
(vii) Pima 400 100 268 8
(viii) Vowels 1124 282 50 12
(i) Cooler 518 223 242 60
(ii) Valve 787 338 30 60
(iii) Pump 854 367 22 60
(iv) Accumulator 419 180 272 60
(i) L-ACTpos 51 23 7 562
(ii) A-ACTspd 51 23 7 562
(iii) Feedback-SPA 51 23 7 562
(iv) All sensors 51 23 7 1686
Table 2: Number of examples in train and test sets, and number of features for each task in ODDS, ZeMA and STRATH datasets.

4.2 Model hyperparameters

We fit the deterministic AE, VAE, and BAE models to the train set using the Adam optimiser Kingma and Ba (2014)

for 100 epochs with a fixed learning rate of 0.001 for STRATH. For ODDS and ZeMA, we do not use a fixed learning rate but instead employ an automatic learning rate finder

Smith (2017). The number of posterior samples is set to for the VAE, BAE-MCD and BAE-BBB, while for the BAE-Ensemble. The prior scaling term (weight decay) for all models is . We specify the architecture of the encoder in Table 3. U-Net skip connections Ronneberger et al. (2015) are implemented in all models except for those in STRATH tasks (i) and (ii).

Layer Output channels/nodes Kernel Strides
Linear Input dimensions - -
Linear Input dimensions - -
Linear Latent dimensions - -
(a) ODDS

(b) ZeMA and STRATH
Layer Output channels/nodes Kernel Strides
Conv1D 10 8 2
Conv1D 20 2 2
Reshape - - -
Linear 1000 - -
Linear Latent dimensions - -
Table 3:

Encoder architecture for ODDS, ZeMA and STRATH datasets. We set the latent dimensions to be half of the flattened input dimensions. The decoder is a reflection of the encoder, in which the Conv1D layers are replaced by Conv1D-Transpose layers. We apply layer normalisation after each Conv1D and linear layer for ODDS and STRATH. The leaky ReLu

Maas et al. (2013)

is used as the activation function with a slope of 0.01 while the sigmoid function is used at the decoder’s final layer.

We repeat the experiment 10 times with different random seeds. For each run, we vary the choice of anomaly score distribution Q to be either the Gaussian, exponential, or uniform distribution. Due to the large combinations of methods, we report results where Q is implicitly chosen (unless stated) to maximise the criterion.

5 Results and discussion

Incorrect predictions have higher uncertainty. From Fig. 5, we observe that inaccurate predictions are usually accompanied by higher scores, demonstrating the use of to anticipate both Type I and Type II predictive errors Banerjee et al. (2009), in the absence of true labels. Nonetheless, this is not always the case; for example, in Fig. 4(a)(iii) the VAE assigns similar or higher scores to correct predictions. Likewise, the deterministic AE does the same for ZeMA task (iv). In such cases, the uncertainty quality is poor, highlighting the ongoing research challenge of obtaining reliable uncertainty estimates. Hence, a careful assessment of uncertainty quality is required before deploying in production.

(a) ODDS
(b) ZeMA
Figure 5: Box plots of for correct predictions (TP+TN) and erroneous predictions (FP+FN) using various models. (TP = true positives, TN = true negatives, FP = true positives, FN = true negatives).

Rejecting uncertain predictions may yield higher predictive accuracy. Strikingly, the detection performance can rise to > 90% after rejecting 40% of uncertain predictions for most tasks (Fig. 6). Without rejection, however, for instance, in Fig. 5(c)(iii), most models scored much lower . Note that not all models yield positive performance gains as we reject predictions of high uncertainty, shown by a constant or deteriorating trend, e.g. BAE-MCD on Fig. 5(a)(ii) and Fig. 5(b)(iv)).

(a) ODDS
(b) ZeMA
Figure 6:

ARCs for comparisons of deterministic AE, VAE and BAEs with posterior approximated under various techniques. Mean and standard error of GSS are evaluated on (a) ODDS, (b) ZeMA and (c) STRATH datasets over 10 experiment runs.

is used as the rejection criterion with the Q distribution chosen based on the best score.

Should we use the or method to quantify anomaly uncertainty? Earlier, we mentioned that (Eq. 14) does not scale well with the number of training examples. We notice it has a poor mean score of 0.1% on the ZeMA dataset (Table 4) while having a better score on the STRATH dataset, which has fewer training examples than ZeMA. Further inspection confirms the performance of is poor and does not scale well with number of training examples Fig. 7. This lack of scalability makes the less practical when there is an abundance of training data, or when data is incrementally collected. On the other hand, we do not observe such behaviours with the as it performs well even as training examples increase in number.

Figure 7: Mean and standard error of against the number of training examples in ODDS, ZeMA and STRATH datasets. Statistics are calculated over 10 experiment runs and 5 types of AEs. Dotted line at is the baseline.

All in all, outperforms and . Despite its implementation simplicity, the is unreliable as a measure of anomaly uncertainty as we find that yields the lowest scores for both ZeMA and STRATH datasets in Table 4. The also performs better than and , proving the benefit of combining both types of uncertainties compared to using either one of them.

Deterministic AE - 84.6(+4.3) 84.6(+4.3) 80.2(0.0) -
VAE 81.3(+1.8) 83.8(+4.1) 84.3(+4.6) 79.4(-0.3) 78.4(+0.1)
BAE-MCD 81.4(+1.7) 83.8(+3.8) 84.2(+4.2) 79.9(0.0) 77.9(-1.3)
BAE-BBB 86.7(+4.0) 87.8(+4.9) 89.2(+6.3) 83.2(0.0) 80.5(-0.7)
BAE-Ensemble 85.2(+1.5) 86.4(+3.0) 88.0(+4.6) 84.0(+0.3) 81.2(-0.8)
Mean 83.6(+2.2) 85.3(+4.0) 86.1(+4.8) 81.3(0.0) 79.5(-0.7)
(a) ODDS

(b) ZeMA
Deterministic AE - 88.1(+2.9) 88.1(+2.9) 85.4(+0.4) -
VAE 89.4(+1.2) 92.6(+5.0) 92.9(+5.4) 88.2(+0.2) 89.4(+1.4)
BAE-MCD 82.8(-0.1) 86.1(+3.8) 85.8(+3.6) 83.2(+0.3) 84.4(+1.8)
BAE-BBB 89.8(+2.4) 91.9(+3.5) 92.6(+4.1) 88.0(-0.4) 86.5(-0.1)
BAE-Ensemble 94.9(+1.3) 95.1(+1.6) 96.7(+3.2) 93.5(0.0) 94.0(+1.1)
Mean 89.2(+1.2) 90.8(+3.4) 91.2(+3.8) 87.7(+0.1) 88.6(+1.0)

Deterministic AE - 83.9(+6.3) 83.9(+6.3) 81.4(+2.5) -
VAE 82.6(+3.7) 84.1(+6.2) 84.2(+6.7) 81.2(+4.1) 81.0(+1.6)
BAE-MCD 83.7(+5.7) 84.2(+6.2) 84.5(+6.5) 80.2(+4.2) 79.0(+0.3)
BAE-BBB 82.6(+5.5) 82.4(+5.8) 83.2(+6.7) 80.4(+6.4) 78.3(+0.1)
BAE-Ensemble 81.9(+2.5) 84.9(+6.7) 84.7(+6.5) 81.8(+4.9) 80.0(0.0)
Mean 82.7(+4.4) 83.9(+6.2) 84.1(+6.5) 81.0(+4.4) 79.6(+0.5)
Table 4: Mean () is evaluated evaluated on (a) ODDS, (b) ZeMA and (c) STRATH datasets. Each dataset consists of multiple tasks and 10 experiment runs. Uncertainty estimation method with the highest mean score is bolded. Values are shown in percentage.

Deterministic AE vs stochastic AEs (VAE and BAE). We discuss differences between AE models when using the to estimate the predictive uncertainty. The deterministic AE, which quantifies only the aleatoric component, stands as a baseline against the stochastic AEs. For most models, the stochastic AEs have a higher than the deterministic AE, showing the benefits of accounting for model uncertainty in a probabilistic framework. However, this improvement does not hold for some models; one reason could be the lower quality of posterior sampling Yao et al. (2019); Pearce et al. (2020). On the other hand, the BAE-Ensemble outperforms the baseline AE and the VAE, scoring the highest on ZeMA and STRATH, and second highest on ODDS.

Besides the performance gain, the BAE-Ensemble is relatively easy to implement compared to other stochastic AEs. It does not require any changes to the architecture or the optimiser of a deterministic AE, allowing a seamless transition from a deterministic AE to a stochastic AE. While the BAE does require times more computations than the deterministic AE, the computations are highly parallelisable since we can compute each ensemble member independently. Hence, the BAE can leverage the development of dedicated hardware for NN computations, e.g. edge computing devices for industrial environments Li et al. (2019).

Effect of anomaly probability conversion. Quantifying depends on the method of anomaly probability conversion (e.g. using either the CDF or ECDF, and whether to apply customised scaling); the switch from one method of conversion to the other strongly influences the quality of uncertainty. Notably, we find that customised scaling does improve the quality of uncertainty (see Table 5) as we notice large gains in the percentage of positive experiment runs after applying the customised scaling.

Conversion method % (Positives/Total runs)
ECDF 18.2% (73/400) 41.5% (83/200) 48.5% (97/200)
ECDF + Scale 83.0% (332/400) 94.5% (189/200) 89.5% (179/200)
Gaussian 37.8% (151/400) 47.5% (95/200) 50.5% (101/200)
Gaussian + Scale 80.5% (322/400) 69.0% (138/200) 90.0% (180/200)
Exponen. 16.2% (65/400) 48.5% (97/200) 41.0% (82/200)
Exponen. + Scale 82.5% (330/400) 89.0% (178/200) 44.0% (88/200)
Uniform 45.8% (183/400) 22.0% (44/200) 50.5% (101/200)
Uniform + Scale 50.0% (200/400) 25.5% (51/200) 61.5% (123/200)
Table 5: Percentage of experiment runs where >0 when using while varying the method of anomaly probability conversion. Percentage greater than 80% are in bold.

In cases where

did not perform as well, we suggest one reason is the improper choice of distribution of NLL scores for the dataset. For instance, although the exponential distribution worked reliably well for ODDS and ZeMA, it did not work well for the STRATH dataset. Instead of committing to a specific distribution, we can use the ECDF, a non-parametric method, which works reliably well for all datasets. We report additional results in

Appendix C.

6 Conclusion

In this work, we have formulated the BAEs to estimate the predictive uncertainty for anomaly detection. Our method computes the and , and adding the two uncertainties forms the .

Our experiments have evaluated the quality of uncertainty in the regime where the model is given the option to reject predictions of high uncertainty, and have demonstrated improvements using the proposed methods on multiple publicly available datasets. The best performing BAE consistently outperform the deterministic AE, highlighting the benefit of the Bayesian formulation.

Several directions for future work are possible. There is a need to further understand the conditions in which the anomaly uncertainty may fail, for instance, in adversarial attacks. Comprehensive case studies are necessary to understand the practical challenges in deploying BAEs. Future work should also explore the use of anomaly uncertainty for incremental and active learning.


The work reported here was supported by the European Metrology Programme for Innovation and Research (EMPIR) under the project Metrology for the Factory of the Future (MET4FOF), project number 17IND12 and part-sponsored by Research England’s Connecting Capability Fund award CCF18-7157: Promoting the Internet of Things via Collaboration between HEIs and Industry (Pitch-In).


  • S. Agrawal and J. Agrawal (2015) Survey on anomaly detection using data mining techniques. Procedia Computer Science 60, pp. 708–713. External Links: ISSN 1877-0509, Document, Link Cited by: §2.
  • M. Ahmed, A. N. Mahmood, and Md. R. Islam (2016) A survey of anomaly detection techniques in financial domain. Future Generation Computer Systems 55, pp. 278–288. External Links: ISSN 0167-739X, Document, Link Cited by: §2, §2.
  • A. Banerjee, U. Chitnis, S. Jadhav, J. Bhawalkar, and S. Chaudhury (2009)

    Hypothesis testing, type i and type ii errors

    Industrial psychiatry journal 18 (2), pp. 127. Cited by: §5.
  • C. Baur, B. Wiestler, S. Albarqouni, and N. Navab (2020) Bayesian skip-autoencoders for unsupervised hyperintense anomaly detection in high resolution brain mri. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1905–1909. Cited by: §1, Table 1, §2.
  • V. Bewick, L. Cheek, and J. Ball (2004)

    Statistics review 13: receiver operating characteristic curves

    Critical care 8 (6), pp. 1–5. Cited by: Appendix C.
  • U. Bhatt, J. Antorán, Y. Zhang, Q. V. Liao, P. Sattigeri, R. Fogliato, G. Melançon, R. Krishnan, J. Stanley, O. Tickoo, et al. (2021) Uncertainty as a form of transparency: measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 401–413. Cited by: §2.
  • C. Bishop (2007) Pattern recognition and machine learning. Springer. Cited by: §2.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §A.1, Appendix A, §3.1.
  • G. O. Campos, A. Zimek, J. Sander, R. J. Campello, B. Micenková, E. Schubert, I. Assent, and M. E. Houle (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data mining and knowledge discovery 30 (4), pp. 891–927. Cited by: §4.1.
  • R. Chalapathy and S. Chawla (2019) Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §1.
  • V. Chandola, A. Banerjee, and V. Kumar (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §1, §2.
  • R. Chandra, M. Jain, M. Maharana, and P. N. Krivitsky (2021) Revisiting bayesian autoencoders with mcmc. arXiv preprint arXiv:2104.05915. Cited by: Table 1, §2.
  • J. Chen, S. Sathe, C. Aggarwal, and D. Turaga (2017) Outlier detection with autoencoder ensembles. In Proceedings of the 2017 SIAM international conference on data mining, pp. 90–98. Cited by: §4.1.
  • T. Chen, E. Fox, and C. Guestrin (2014) Stochastic gradient hamiltonian monte carlo. In International Conference on machine learning, pp. 1683–1691. Cited by: §3.1.
  • E. Daxberger and J. M. Hernández-Lobato (2019) Bayesian variational autoencoders for unsupervised out-of-distribution detection. arXiv preprint arXiv:1912.05651. Cited by: §1, Table 1, §2.
  • T. Fernando, H. Gammulle, S. Denman, S. Sridharan, and C. Fookes (2021) Deep learning for medical anomaly detection – a survey. ACM Comput. Surv. 54 (7). External Links: ISSN 0360-0300, Link, Document Cited by: §2, §2.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning, pp. 1050–1059. Cited by: Appendix A, §3.1.
  • J. Gao, J. Yao, and Y. Shao (2019) Towards reliable learning for high stakes applications. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3614–3621. Cited by: §1, §3.4.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Cited by: §3.1.
  • N. Helwig, E. Pignanelli, and A. Schütze (2015) Condition monitoring of a complex hydraulic system using multivariate statistics. In 2015 IEEE International Instrumentation and Measurement Technology Conference (I2MTC) Proceedings, Vol. , pp. 210–215. External Links: Document, ISSN 1091-5281 Cited by: §4.1.
  • K. Hendrickx, L. Perini, D. Van der Plas, W. Meert, and J. Davis (2021) Machine learning with a reject option: a survey. arXiv preprint arXiv:2107.11277. Cited by: item 2.
  • P. Kamat and R. Sugandhi (2020) Anomaly detection for predictive maintenance in industry 4.0-a survey. In E3S Web of Conferences, Vol. 170, pp. 02007. Cited by: §2, §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.1.
  • A. D. Kiureghian and O. Ditlevsen (2009) Aleatory or epistemic? does it matter?. Structural Safety 31 (2), pp. 105–112. Note: Risk Acceptance and Risk Communication External Links: ISSN 0167-4730 Cited by: §1.
  • H. Kriegel, P. Kroger, E. Schubert, and A. Zimek (2011) Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 13–24. Cited by: Table 1, §2, §3.2, §3.2.
  • S. Kullback and R. A. Leibler (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §A.1.
  • L. I. Kuncheva, Á. Arnaiz-González, J. Díez-Pastor, and I. A. Gunn (2019) Instance selection improves geometric mean accuracy: a study on imbalanced data classification. Progress in Artificial Intelligence 8 (2), pp. 215–228. Cited by: §3.4.
  • D. Kwon, H. Kim, J. Kim, S. C. Suh, I. Kim, and K. J. Kim (2019) A survey of deep learning-based network anomaly detection. Cluster Computing 22 (1), pp. 949–961. Cited by: §2, §2.
  • Y. Kwon, J. Won, B. J. Kim, and M. C. Paik (2020) Uncertainty quantification using bayesian neural networks in classification: application to biomedical image segmentation. Computational Statistics & Data Analysis 142, pp. 106816. Cited by: Table 1, §2.
  • A. Legrand, H. Trannois, and A. Cournier (2019) Use of uncertainty with autoencoder neural networks for anomaly detection. In

    2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)

    pp. 32–35. Cited by: §1.
  • C. Leibig, V. Allken, M. S. Ayhan, P. Berens, and S. Wahl (2017) Leveraging uncertainty information from deep neural networks for disease detection. Scientific reports 7 (1), pp. 1–14. Cited by: item 2.
  • E. Li, L. Zeng, Z. Zhou, and X. Chen (2019) Edge ai: on-demand accelerating deep neural network inference via edge computing. IEEE Transactions on Wireless Communications 19 (1), pp. 447–457. Cited by: §2, §5.
  • Y. Luo and P. Harris (2020) Uncertainty in data analysis for strath testbed. In 2020 IEEE International Workshop on Metrology for Industry 4.0 & IoT, pp. 95–100. Cited by: §4.1.
  • A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: Table 3.
  • M. Munir, M. A. Chattha, A. Dengel, and S. Ahmed (2019) A comparative analysis of traditional and deep learning-based anomaly detection methods for streaming data. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 561–566. Cited by: §1.
  • M. S. A. Nadeem, J. Zucker, and B. Hanczar (2009) Accuracy-rejection curves (arcs) for comparing classification methods with a reject option. In Machine Learning in Systems Biology, pp. 65–81. Cited by: item 2, §3.4.
  • G. Pang, C. Shen, L. Cao, and A. V. D. Hengel (2021) Deep learning for anomaly detection: a review. ACM Computing Surveys 54 (2). External Links: ISSN 0360-0300, Link, Document Cited by: §1, §2.
  • T. Pearce, F. Leibfried, and A. Brintrup (2020) Uncertainty in neural networks: approximately bayesian ensembling. In International conference on artificial intelligence and statistics, pp. 234–244. Cited by: Appendix A, §3.1, §5.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.2, §4.1.
  • L. Perini, V. Vercruyssen, and J. Davis (2020) Quantifying the confidence of anomaly detectors in their example-wise predictions. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 227–243. Cited by: Table 1, §2, §3.3.
  • S. Rayana (2016) ODDS library. Stony Brook University, Department of Computer Sciences. External Links: Link Cited by: §4.1.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: ISBN 978-3-319-24574-4 Cited by: §4.2.
  • T. Schneider, S. Klein, and M. Bastuck (2018) Cited by: §4.1.
  • L. N. Smith (2017) Cyclical learning rates for training neural networks. In

    2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

    pp. 464–472. Cited by: §4.2.
  • C. Tachtatzis, G. Gourlay, I. Andonovic, and O. Panni (2019) Cited by: §4.1.
  • S. Thudumu, P. Branch, J. Jin, and J. J. Singh (2020) A comprehensive survey of anomaly detection techniques for high dimensional big data. Journal of Big Data 7 (1), pp. 1–30. Cited by: §2, §2.
  • B. Tran, S. Rossi, D. Milios, P. Michiardi, E. V. Bonilla, and M. Filippone (2021) Model selection for bayesian autoencoders. arXiv preprint arXiv:2106.06245. Cited by: §1, Table 1, §2.
  • H. G. Tucker (1959) A generalization of the glivenko-cantelli theorem. The Annals of Mathematical Statistics 30 (3), pp. 828–830. Cited by: §3.2.
  • J. W. Tukey et al. (1977) Exploratory data analysis. Vol. 2, Reading, Mass.. Cited by: Appendix B.
  • N. A. Weiss (2005) A course in probability. Pearson Addison-Wesley. Cited by: §3.2.
  • J. Yao, W. Pan, S. Ghosh, and F. Doshi-Velez (2019) Quality of uncertainty quantification for bayesian neural network inference. arXiv preprint arXiv:1906.09686. Cited by: §5.
  • B. X. Yong, Y. Fathy, and A. Brintrup (2020a) Bayesian autoencoders for drift detection in industrial environments. In 2020 IEEE International Workshop on Metrology for Industry 4.0 & IoT, pp. 627–631. Cited by: Table 1, §2.
  • B. X. Yong, T. Pearce, and A. Brintrup (2020b) Bayesian autoencoders: analysing and fixing the bernoulli likelihood for out-of-distribution detection. In ICML Workshop on Uncertainty and Robustness in Deep Learning, Cited by: §1.

Appendix A BAE posterior sampling methods

We describe the following methods to sample from the BAE: BBB Blundell et al. (2015), MCD Gal and Ghahramani (2016) and anchored ensembling Pearce et al. (2020).

a.1 Variational inference

BBB and MCD are examples of variational inference. The idea of variational inference is to approximate the posterior distribution over parameters by introducing a variational distribution.


where is the vector of parameters of the variational distribution, . The objective of training is thus, to minimise the Kullback-Leibner (KL) divergence Kullback and Leibler (1951), which is a measure of similarity between two distributions: the variational distribution and the true posterior. This yields an optimised variational distribution to be sampled from during prediction.


The KL divergence can be approximated with samples of AE parameters from the variational distribution . Note that minimising the KL divergence is the same as maximising the log evidence lower bound (ELBO).


where the weight decay, scales the KL divergence of the prior, effectively controlling the strength of regularisation. Note that the first term of the sum is the log-likelihood of a sample from the variational distribution.

Now, we discuss popular models of the variational distribution. In BBB, we use a diagonal Gaussian distribution as the variational distribution, while the prior is a mixture of 2 diagonal Gaussian distributions,

where is the noise term introduced for implementing the reparameterisation trick Blundell et al. (2015)

, necessary for backpropagating the gradients during optimisation. The KL prior loss is


where , and are parameters of the prior, which we fix in our experiments as 0.5, 1.0 and 0.1, respectively.

On the other hand, the MCD, which is also type of variational inference, is implemented by adding a dropout layer after each linear or convolutional layer. The corresponding variational distribution is a mixture of Gaussian and Bernoulli distributions.

where is the dropout probability and refers to the AE parameters when they are not dropped out. Further, the

is a hyperparameter and is not updated during training. We set

in our experiments. Lastly, the KL prior term can be approximated as


a.2 Anchored ensembling

In anchored ensembling, posteriors are approximated by Bayesian inference under the family of methods called randomised maximum a posteriori (MAP) sampling, where model parameters are regularised by values drawn from a distribution (so-called anchor distribution), which can be set equal to the prior.

Assume our ensemble consists of independent autoencoders and each j-th autoencoder contains a set of parameters, where . In anchored ensembling for approximating the posterior distribution, the ‘anchored weights‘ for each autoencoder are unique and sampled during initialisation from a prior distribution and remain fixed throughout the training procedure.

The autoencoders are trained by minimising the loss function, which is the negative sum of log-likelihood (based on i.i.d assumption) and log-prior where both are assumed to be Gaussian. For each member of the ensemble, the loss to be optimised is


where is a hyperparameter for scaling the regulariser term arising from the prior.

Appendix B Data preprocessing

ODDS. For each dataset in the ODDS collection, we do not apply any preprocessing steps aside from min-max scaling and data splitting.

For ZeMA and STRATH, the inputs to the AE have the dimensions of , where is the batch size, is the sequence length, and is the number of sensors.

ZeMA. For each datapoint in the batch, The labels of each hydraulic subsystem’s condition are available in the dataset, and we label the best working conditions as inliers while the remaining states as anomalies. We have chosen the following sensor-subsystem pairs in our analysis: temperature sensor (TS4) for the cooler, valve and accumulator, and pressure sensor (PS6) for the pump. We downsample the pressure sensor to 1Hz; no resampling is applied on the temperature data, hence and .

STRATH. The radial forging process is divided into the heating and forging phases. We segment the sensors data to consider only the forging phase. We downsample the data by 10 fold, reducing the sequence length to for all tasks. For tasks (i-iii), and for task (iv). For each forged part, measurements of its geometric dimensions are available as quality indicators; we focus our analysis on the 38 diameter@200 dimension. To flag parts as anomalies, we first obtain the absolute difference between the measured and nominal dimensions, and subsequently, we apply the Tukey’s fences method Tukey and others (1977). The remaining parts are labelled as inliers. The data preprocessing steps are outlined in Fig. 8.

Figure 8: Data preprocessing pipeline for STRATH data set

As mentioned, we apply min-max scaling for all datasets. For ODDS, we apply the min-max scaling to each feature independently. By contrast, for ZeMA and STRATH, we obtain the min-max values from the train set for each sensor independently, instead of each feature, to retain the shape of signal. Note that we prevent train-test bias when by fitting the scaler to the train set only instead of the entire dataset.

Appendix C Additional results

Results of using the area under the receiver operating characteristic curve (AUROC) Bewick et al. (2004) as a measure of classifier performance for ARCs are reported in Fig. 9 and Table 6. In addition, we report the ARCs to compare the choices of distribution Q (see Fig. 10).

(a) ODDS
(b) ZeMA
Figure 9: ARCs for comparisons of deterministic AE, VAE and BAEs with posterior approximated under various techniques. Mean and standard error of AUROC are evaluated on (a) ODDS, (b) ZeMA and (c) STRATH datasets over 10 experiment runs. is used as the rejection criterion with the Q distribution chosen based on the best score.
Deterministic AE - 91.6(+1.8) 91.6(+1.8) 90.1(+0.2) -
VAE 89.0(+1.2) 89.8(+2.0) 90.0(+2.2) 88.0(+0.2) 86.4(-1.3)
BAE-MCD 91.5(+0.8) 92.6(+1.9) 92.6(+2.0) 90.5(-0.2) 88.3(-2.4)
BAE-BBB 92.8(+1.4) 93.5(+2.1) 93.9(+2.5) 91.6(+0.3) 89.7(-1.7)
BAE-Ensemble 93.2(+0.9) 93.8(+1.5) 94.3(+2.0) 92.4(+0.2) 90.2(-2.1)
Mean 91.6(+1.1) 92.3(+1.9) 92.5(+2.1) 90.5(+0.1) 88.6(-1.9)
(a) ODDS

(b) ZeMA
Deterministic AE - 92.7(+2.6) 92.7(+2.6) 90.8(+0.6) -
VAE 94.7(+1.3) 95.7(+2.3) 95.8(+2.4) 94.0(+0.5) 94.9(+1.4)
BAE-MCD 91.7(+1.8) 92.8(+3.0) 92.9(+3.0) 90.4(+0.6) 92.2(+2.4)
BAE-BBB 96.0(+1.4) 96.3(+1.6) 96.5(+1.8) 95.3(+0.6) 95.1(+0.4)
BAE-Ensemble 97.5(+1.5) 97.5(+1.6) 97.8(+1.8) 96.2(+0.3) 97.0(+1.0)
Mean 95.0(+1.5) 95.0(+2.2) 95.1(+2.3) 93.3(+0.5) 94.8(+1.3)

Deterministic AE - 89.4(+3.1) 89.4(+3.1) 88.5(+2.2) -
VAE 89.4(+2.9) 89.6(+3.1) 89.6(+3.1) 88.7(+2.2) 88.2(+1.7)
BAE-MCD 89.3(+3.0) 89.5(+3.2) 89.4(+3.1) 88.2(+1.9) 86.9(+0.6)
BAE-BBB 88.9(+3.5) 88.6(+3.2) 88.4(+3.1) 87.6(+2.2) 86.8(+1.5)
BAE-Ensemble 89.1(+2.3) 89.8(+3.0) 89.8(+3.0) 89.0(+2.2) 87.7(+0.9)
Mean 89.2(+2.9) 89.4(+3.1) 89.3(+3.1) 88.4(+2.1) 87.4(+1.2)
Table 6: Mean () is evaluated on (a) ODDS, (b) ZeMA and (c) STRATH datasets. Each dataset consists of multiple tasks and 10 experiment runs. Uncertainty estimation method with the highest mean score is bolded. Values are shown in percentage.

(a) ODDS

(b) ZeMA

Figure 10: ARCs for comparisons of anomaly probability conversions. Mean and standard error of GSS are evaluated on (a) ODDS, (b), ZeMA and (c) STRATH datasets over 10 experiment runs. Results are shown for the BAE-Ensemble model using as the rejection criterion.