1 Introduction
The use of unsupervised neural networks (NNs) such as deep autoencoders (AEs) has consistently achieved improvements over traditional machine learning (ML) models for highdimensional anomaly detection
Chalapathy and Chawla (2019); Munir et al. (2019); Pang et al. (2021). Nonetheless, one missing ingredient is the measurement of predictive uncertainty. That is, in addition to reporting the prediction of an anomaly, can the AE tell how uncertain it is about the prediction?The model’s prediction and its uncertainty are two different outputs: the prediction classifies a data point as inlier or anomaly by relying on an anomaly score, which typically measures its similarity or distance from a reference distribution of normality; the higher the anomaly score, the higher the chance of an abnormal observation Chandola et al. (2009). By contrast, the anomaly uncertainty measures the trustworthiness of the classification; the higher the uncertainty, the less reliable is the prediction, implying a greater chance of a predictive error. (See Fig. 1 for an illustration).
One popular method to quantify uncertainty for NNs is by adopting the Bayesian framework, resulting in the Bayesian autoencoder (BAE) formulation. Two types of uncertainty are of interest: epistemic uncertainty captures the uncertainty in the BAE parameters, while aleatoric uncertainty reflects the inherent randomness in the data Kiureghian and Ditlevsen (2009). However, recent works on BAEs Legrand et al. (2019); Yong et al. (2020b); Daxberger and HernándezLobato (2019); Tran et al. (2021); Baur et al. (2020) have only investigated the use of epistemic and aleatoric uncertainties as anomaly scores, leaving the notion of anomaly uncertainty understudied. Consequently, we are still incapable of knowing when to (dis)trust the BAEs’ predictions.
Knowing the uncertainty of each prediction is important for ensuring reliability, especially in highstake applications, as we can anticipate whether there is an error with the prediction Gao et al. (2019). If we filter away cases of high uncertainty, we are left with trustworthy predictions of higher accuracy. In addition, these uncertain cases can be referred for further diagnosis, either by humans or equipment, thereby increasing the overall predictive performance. Therefore, in this work, we contribute in the following ways:

Formulation of BAE to quantify predictive uncertainty of anomalies.
We do so by converting negative loglikelihood (NLL) estimates from the BAE into the total anomaly uncertainty which is decomposable into epistemic and aleatoric uncertainties.

Evaluation of uncertainty in anomaly detection.
We consider the task of classifying anomalies with the option of rejecting uncertain predictions. While these evaluations have been explored in supervised learning
Nadeem et al. (2009); Hendrickx et al. (2021); Leibig et al. (2017), we contribute by extending to unsupervised anomaly detection. 
Validation of proposed BAE with real use cases that demonstrate its effectiveness and value. Our experiments demonstrate the added benefits of quantifying anomaly uncertainty on a set of benchmark datasets, as well as two real datasets for manufacturing applications: one for condition monitoring and the other for quality inspection.
2 Background and related works
Anomaly detection^{*}^{*}*
Also known as novelty detection, outlier detection, and more recently, outofdistribution detection.
is a binary classification task of identifying data points that differ significantly from a reference distribution of inliers Chandola et al. (2009). Anomaly detection has many real use cases in highstake applications, e.g. cybersecurity Kwon et al. (2019), financial surveillance Ahmed et al. (2016), health and medical diagnosis Fernando et al. (2021), manufacturing Kamat and Sugandhi (2020). Many traditional ML models have been adapted to perform anomaly detection, e.g. isolation forest, oneclass support vector machine, Kmeans clustering
Agrawal and Agrawal (2015). A drawback of traditional ML models is their low complexity and flexibility, limiting their scalability from handling highdimensional data prevalent in modern databases
Thudumu et al. (2020).Overcoming the limitations of traditional ML models, deep NNs are expressive and scalable models. Many works have studied their use for anomaly detection, consistently achieving stateoftheart results Pang et al. (2021); Fernando et al. (2021); Kamat and Sugandhi (2020); Kwon et al. (2019); Ahmed et al. (2016)
. A key property of NNs is their flexibility of modelling; each component (e.g. activation function, layers, optimiser) can be tuned and customised to adapt to the data at hand. Furthermore, specialised hardware have expedited their adoption and scalability by parallelising and optimising computations
Li et al. (2019).An emerging challenge with using NNs for anomaly detection is the quantification of uncertainty Thudumu et al. (2020). Uncertainty is a core component for promoting algorithmic transparency and hence, the advancement of trustworthy ML Bhatt et al. (2021). While most works have focused on anomaly scores, less is known about anomaly uncertainty. To this end, Perini et al. Perini et al. (2020) introduced the examplewise confidence method for quantifying anomaly uncertainty, however, their methods have only experimented with traditional ML models, leaving their applicability to deep AEs unclear.
The development of Bayesian neural networks (BNNs) Bishop (2007) for uncertainty quantification has led to the formulation of BAEs, a probabilistic version of the AE. Nonetheless, extant works have used uncertainty as an anomaly score, which is different from the notion of anomaly uncertainty. We list extant works as follows: Yong et al. Yong et al. (2020a) quantified epistemic and aleatoric uncertainties for detecting sensor drifts. Daxberger & Lobato Daxberger and HernándezLobato (2019) developed the Bayesian variational autoencoder (BVAE) to quantify uncertainty as a measure of deviation from the training distribution. Baur et al. Baur et al. (2020) developed the BAE with Monte Carlo dropout for anomaly detection in medical application. Tran et al. Tran et al. (2021) observed that BAEgenerated data have higher uncertainty than the reconstructed data. Chandra et al. Chandra et al. (2021)
studied the BAE with Markov Chain Monte Carlo (MCMC) sampling for dimensionality reduction and classification tasks.
Work  ML model  Anomaly uncertainty 

Ours  BAE  ✓ 
Perini et al. Perini et al. (2020)  Traditional  ✓ 
Yong et al. Yong et al. (2020a)  BAE  ✗ 
Daxberger & Lobato Daxberger and HernándezLobato (2019)  BVAE  ✗ 
Baur et al. Baur et al. (2020)  BAE  ✗ 
Tran et al. Tran et al. (2021)  BAE  ✗ 
Chandra et al. Chandra et al. (2021)  BAE  ✗ 
Kriegel et al. Kriegel et al. (2011)  Traditional  ✗ 
Kwon et al. Kwon et al. (2020)  Supervised BNN  ✗ 
Our work stands out as the first to provide a comprehensive study on the BAE’s predictive uncertainty of an anomaly. We extend the work of Kwon et al. Kwon et al. (2020), which decomposed the classification uncertainty of supervised BNNs into epistemic and aleatoric uncertainties, to consider unsupervised anomaly detection with BAEs. Further, our work is inspired by Kriegel et al. Kriegel et al. (2011)
, who proposed converting an ensemble of anomaly scores into anomaly probabilities, and combining them via averaging. To improve the combination quality, they further proposed a customised scaling. We extend their study by (1) including BAEs, (2) quantifying anomaly uncertainty while maintaining coherence within a probabilistic framework, and (3) finding that their customised scaling improved the quality of uncertainty. Comparisons with related works are summarised in
Table 1.3 Methods
This section details the probabilistic formulation of BAEs for quantifying anomaly uncertainty, followed by ways to evaluate them.
3.1 Bayesian autoencoder
Suppose we have a set of data , . An AE is an NN parameterised by , and consists of two parts: an encoder , for mapping input data x to a latent embedding, and a decoder for mapping the latent embedding to a reconstructed signal of the input (i.e. ) Goodfellow et al. (2016).
Bayes’ rule can be applied to the parameters of the AE to create a BAE,
(1) 
where is the likelihood and
is the prior distribution of the AE parameters. The loglikelihood for a diagonal Gaussian distribution is,
(2) 
where
is the variance of the Gaussian distribution. For simplicity, we use an isotropic Gaussian likelihood with
in this study, since the meansquared error (MSE) function is proportional to the NLL. For the prior distribution, we use an isotropic Gaussian prior which effectively leads to regularisation.Since Equation 1 is analytically intractable for a deep NN, various approximate methods have been developed such as Stochastic Gradient Markov Chain Monte Carlo (SGHMC) Chen et al. (2014), Monte Carlo Dropout (MCD) Gal and Ghahramani (2016), Bayes by Backprop (BBB) Blundell et al. (2015), and anchored ensembling Pearce et al. (2020) to sample from the posterior distribution. In contrast, a deterministic AE has its parameters estimated using maximum likelihood estimation (MLE) or maximum a posteriori (MAP) when regularisation is introduced. The variational autoencoder (VAE) Kingma and Welling (2013) and BAE are AEs formulated differently within a probabilistic framework: in the VAE, only the latent embedding is stochastic while the and are deterministic and the model is trained using variational inference, while, on the other hand, the BAE (similar to BNN) has distributions over all parameters of and . See Appendix A for descriptions of the posterior sampling methods used in our study.
In short, the training phase of BAE entails using one of the sampling methods to obtain a set of approximate posterior samples . Then, during the prediction phase, we use the posterior samples to compute estimates of the NLL. For brevity, we denote the posterior NLL score , as . Next, we look into using the NLL scores to quantify the anomaly uncertainty.
3.2 Quantifying anomaly uncertainties
Overview. The proposed method for quantifying anomaly uncertainty is illustrated in Fig. 2. Although we use the NLL as an anomaly score, it is not a true
probability of an anomalous outcome. Hence, we first convert the NLL scores into anomaly probabilities via the cumulative distribution function (CDF). Next, using the anomaly probabilities, we compute the epistemic and aleatoric uncertainties. Capturing both uncertainties is crucial for forming a holistic quantification of predictive uncertainty; losing one of them leads to incomplete quantification and hence a lower quality of uncertainty estimate. Summing them forms the total anomaly uncertainty. Now, we formally describe our method for obtaining the total anomaly uncertainty with the BAE.
(i) Distribution of anomaly scores.
Let
be the vector of parameters of the distribution of NLL scores from the
th BAE posterior sample,(3) 
where Q is an arbitrary distribution. The optimal choice of distribution Q is dataset dependent, and in this study, we have experimented with setting Q as either a Gaussian, exponential or uniform distribution. For simplicity, we estimate
by applying maximum likelihood estimation on the anomaly scores computed over the training data, i.e. .(ii) Quantification of anomaly probability. We define as the outcome of an anomaly for a new data point
, which has a Bernoulli distribution,
(4) 
where the anomaly probability is modelled using the CDF,
(5) 
Using the CDF to quantify anomaly probability is not uncommon; applying a minmax scaler Pedregosa et al. (2011) on the anomaly scores to rescale into is the same as fitting to a uniform distribution and using its CDF.
Alternatively, the empirical cumulative distribution function (ECDF) can be used to estimate . The ECDF essentially captures the fraction of anomaly scores of the training data with lower values than the observed test anomaly score. Moreover, the ECDF converges to the true CDF as the number of examples increases according to GlivenkoCantelli theorem Tucker (1959). As such, the ECDF benefits from an increasing number of training data, and we can avoid having to specify a distribution Q for the anomaly scores by leveraging the ECDF.
(6) 
In addition to using the CDF or ECDF when converting anomaly score to probability, we may choose to apply a customised scaling method proposed by Kriegel et al. Kriegel et al. (2011),
(7) 
reflecting the beliefs that the anomaly probability is zero when the observed anomaly score is less than or equal to the average anomaly score of the training set (i.e. for ), and that the anomaly probability increases as the anomaly score gets higher than the average anomaly score.
By definition, the mean and variance of the Bernoulli distribution in Eq. 4 are:
(8)  
(9) 
Next, to predict an anomalous event, we summarise the anomaly probabilities from all BAE posterior samples via expectation,
(10) 
A reasonable threshold can be set for the mean anomaly probability at any point where to get a hard prediction of an anomalous event. Moreover, as a calibrated anomaly score, its value ranges in [0,1] and facilitates better interpretation of an anomalous event than the raw anomaly score, which takes on any real number Kriegel et al. (2011).
(iii) Decomposition of anomaly uncertainty. Based on the law of total variance Weiss (2005), we decompose the total anomaly uncertainty, into its epistemic and aleatoric components,
(11) 
Substituting with Eq. 8 and Eq. 9, the and can be computed as
(12)  
(13) 
Our method propagates the model uncertainty from samples of the BAE posterior to yield a holistic quantification of anomaly uncertainty. In contrast, when we have a deterministic AE (i.e. ), the term is zero and the remaining component is the , signifying the inability of the deterministic AE to capture epistemic uncertainty.
Note on interpretability. Reading the can be unintuitive since the variance of a Bernoulli distribution ranges in [0,0.25]. Therefore, to enhance interpretability, we multiply to range between [0,1] in this study.
3.3 Alternative ways to quantify anomaly uncertainties
We investigate two other methods for quantifying anomaly uncertainty. Firstly, by applying the method proposed by Perini et al. Perini et al. (2020) on the BAE under the assumption that the training set is not polluted with anomalies, we arrive at the following uncertainty estimate,
(14) 
where is the number of training examples. Like our method, the relies on quantifying the . The difference, however, is in the conversion from anomaly probability to anomaly uncertainty, which we posit a problem with the is in the regime of high , as this leads to extreme values of uncertainty estimate. In effect, it is prone to over and underconfidence when the training examples are abundant, limiting its scalability to larger datasets. In addition, the does not account for epistemic uncertainty, leading to a poorer uncertainty estimate.
The second alternative way to measure anomaly uncertainty is by taking the variance across the samples of NLL scores from the BAE or VAE,
(15) 
We shall refer to this method as . Note that is not applicable with the deterministic AE since the number of sampled predictions is 1 ( =0). See Fig. 4 for an illustration of the various methods for quantifying anomaly uncertainty.
3.4 Evaluating predictive uncertainty in anomaly detection
We evaluate the quality of uncertainty in the task of classifying anomalies with a reject option. Specifically, in addition to classifying a data point as either inlier or anomaly, the BAE has the option of withholding its prediction by not assigning any label. The rationale is that predictions with high uncertainty are likely to be erroneous; hence rather than accepting all predictions in a highstake application, it is safer to accept only predictions of low uncertainty, which are more accurate Gao et al. (2019).
In this setup, we use the accuracyrejection curve (ARC) proposed by Nadeem et al. Nadeem et al. (2009)
, which plots the accuracy of the retained predictions against the rejection rate evaluated at multiple anomaly uncertainty thresholds. In effect, the ARC illustrates the tradeoff between accuracy and rejection rate. As a measure of accuracy, while any metrics for binary classification can be used, we prefer the geometric mean of the sensitivity and specificity (
) for evaluating classifiers on an imbalanced dataset Kuncheva et al. (2019). Then, to summarise the model’s performance, we compute the weighted average accuracy(16) 
where is the number of uncertainty threshold settings and is the rejection rate in percentage. For the anomaly uncertainty to be skillful, the should be higher than the baseline , , where , (i.e. not using the anomaly uncertainty). Hence, an important metric is the difference between the and the baseline,
(17) 
which measures the average improvement in accuracy when uncertainty is included. It is possible for the to be negative in value, indicating a drop in performance due to the poor utility of uncertainty in rejecting erroneous predictions.
4 Experiment setup
The full code for reproducing the results will be released upon acceptance of this paper.
4.1 Datasets
We use a collection of benchmark datasets, Outlier Detection Datasets (ODDS)
Rayana (2016), which are common in extant studies Campos et al. (2016); Chen et al. (2017). In addition, we use two publicly available datasets for industrial applications: ZeMA Schneider et al. (2018) for condition monitoring and STRATH Tachtatzis et al. (2019) for quality inspection. In the ZeMA dataset, there are 4 tasks; each task uses sensor measurements as inputs for detecting anomalies in hydraulic subsystems, namely, (i) the cooler, (ii) the valve, (iii) the pump, and (iv) the accumulator. While, in the STRATH dataset, the tasks are to detect defective parts manufactured from a radial forging facility. Different sensors are used for each task: (i) a position sensor (LACTpos), (ii) a speed sensor (AACTspd), (iii) a servo (FeedbackSPA), and (iv) all sensors from tasks (iiii). See Helwig et al. (2015); Luo and Harris (2020) for detailed descriptions of the facility setup and data collection of ZeMA and STRATH datasets.For all datasets, we split the inliers into traintest sets with a ratio of 70:30 and include all anomalies in the test set. We use the minmax scaler Pedregosa et al. (2011) to transform the input features into [0,1]. The resulting dimensions of the datasets is tabulated in Table 2. Additional details of preprocessing are included in Appendix B.
Datasets & tasks  Train (inliers)  Test (inliers)  Test (anomalies)  Features 

ODDS  
(i) Cardio  1324  331  176  21 
(ii) Lympho  113  29  6  18 
(iii) Optdigits  4052  1014  150  64 
(iv) Pendigits  5371  1343  156  16 
(v) Thyroid  2943  736  93  6 
(vi) Ionosphere  180  45  126  33 
(vii) Pima  400  100  268  8 
(viii) Vowels  1124  282  50  12 
ZeMA  
(i) Cooler  518  223  242  60 
(ii) Valve  787  338  30  60 
(iii) Pump  854  367  22  60 
(iv) Accumulator  419  180  272  60 
STRATH  
(i) LACTpos  51  23  7  562 
(ii) AACTspd  51  23  7  562 
(iii) FeedbackSPA  51  23  7  562 
(iv) All sensors  51  23  7  1686 
4.2 Model hyperparameters
We fit the deterministic AE, VAE, and BAE models to the train set using the Adam optimiser Kingma and Ba (2014)
for 100 epochs with a fixed learning rate of 0.001 for STRATH. For ODDS and ZeMA, we do not use a fixed learning rate but instead employ an automatic learning rate finder
Smith (2017). The number of posterior samples is set to for the VAE, BAEMCD and BAEBBB, while for the BAEEnsemble. The prior scaling term (weight decay) for all models is . We specify the architecture of the encoder in Table 3. UNet skip connections Ronneberger et al. (2015) are implemented in all models except for those in STRATH tasks (i) and (ii).


Encoder architecture for ODDS, ZeMA and STRATH datasets. We set the latent dimensions to be half of the flattened input dimensions. The decoder is a reflection of the encoder, in which the Conv1D layers are replaced by Conv1DTranspose layers. We apply layer normalisation after each Conv1D and linear layer for ODDS and STRATH. The leaky ReLu
Maas et al. (2013)is used as the activation function with a slope of 0.01 while the sigmoid function is used at the decoder’s final layer.
We repeat the experiment 10 times with different random seeds. For each run, we vary the choice of anomaly score distribution Q to be either the Gaussian, exponential, or uniform distribution. Due to the large combinations of methods, we report results where Q is implicitly chosen (unless stated) to maximise the criterion.
5 Results and discussion
Incorrect predictions have higher uncertainty. From Fig. 5, we observe that inaccurate predictions are usually accompanied by higher scores, demonstrating the use of to anticipate both Type I and Type II predictive errors Banerjee et al. (2009), in the absence of true labels. Nonetheless, this is not always the case; for example, in Fig. 4(a)(iii) the VAE assigns similar or higher scores to correct predictions. Likewise, the deterministic AE does the same for ZeMA task (iv). In such cases, the uncertainty quality is poor, highlighting the ongoing research challenge of obtaining reliable uncertainty estimates. Hence, a careful assessment of uncertainty quality is required before deploying in production.
Rejecting uncertain predictions may yield higher predictive accuracy. Strikingly, the detection performance can rise to > 90% after rejecting 40% of uncertain predictions for most tasks (Fig. 6). Without rejection, however, for instance, in Fig. 5(c)(iii), most models scored much lower . Note that not all models yield positive performance gains as we reject predictions of high uncertainty, shown by a constant or deteriorating trend, e.g. BAEMCD on Fig. 5(a)(ii) and Fig. 5(b)(iv)).
ARCs for comparisons of deterministic AE, VAE and BAEs with posterior approximated under various techniques. Mean and standard error of GSS are evaluated on (a) ODDS, (b) ZeMA and (c) STRATH datasets over 10 experiment runs.
is used as the rejection criterion with the Q distribution chosen based on the best score.Should we use the or method to quantify anomaly uncertainty? Earlier, we mentioned that (Eq. 14) does not scale well with the number of training examples. We notice it has a poor mean score of 0.1% on the ZeMA dataset (Table 4) while having a better score on the STRATH dataset, which has fewer training examples than ZeMA. Further inspection confirms the performance of is poor and does not scale well with number of training examples Fig. 7. This lack of scalability makes the less practical when there is an abundance of training data, or when data is incrementally collected. On the other hand, we do not observe such behaviours with the as it performs well even as training examples increase in number.
All in all, outperforms and . Despite its implementation simplicity, the is unreliable as a measure of anomaly uncertainty as we find that yields the lowest scores for both ZeMA and STRATH datasets in Table 4. The also performs better than and , proving the benefit of combining both types of uncertainties compared to using either one of them.



Deterministic AE vs stochastic AEs (VAE and BAE). We discuss differences between AE models when using the to estimate the predictive uncertainty. The deterministic AE, which quantifies only the aleatoric component, stands as a baseline against the stochastic AEs. For most models, the stochastic AEs have a higher than the deterministic AE, showing the benefits of accounting for model uncertainty in a probabilistic framework. However, this improvement does not hold for some models; one reason could be the lower quality of posterior sampling Yao et al. (2019); Pearce et al. (2020). On the other hand, the BAEEnsemble outperforms the baseline AE and the VAE, scoring the highest on ZeMA and STRATH, and second highest on ODDS.
Besides the performance gain, the BAEEnsemble is relatively easy to implement compared to other stochastic AEs. It does not require any changes to the architecture or the optimiser of a deterministic AE, allowing a seamless transition from a deterministic AE to a stochastic AE. While the BAE does require times more computations than the deterministic AE, the computations are highly parallelisable since we can compute each ensemble member independently. Hence, the BAE can leverage the development of dedicated hardware for NN computations, e.g. edge computing devices for industrial environments Li et al. (2019).
Effect of anomaly probability conversion. Quantifying depends on the method of anomaly probability conversion (e.g. using either the CDF or ECDF, and whether to apply customised scaling); the switch from one method of conversion to the other strongly influences the quality of uncertainty. Notably, we find that customised scaling does improve the quality of uncertainty (see Table 5) as we notice large gains in the percentage of positive experiment runs after applying the customised scaling.
Conversion method  % (Positives/Total runs)  

ODDS  ZeMA  STRATH  
ECDF  18.2% (73/400)  41.5% (83/200)  48.5% (97/200) 
ECDF + Scale  83.0% (332/400)  94.5% (189/200)  89.5% (179/200) 
Gaussian  37.8% (151/400)  47.5% (95/200)  50.5% (101/200) 
Gaussian + Scale  80.5% (322/400)  69.0% (138/200)  90.0% (180/200) 
Exponen.  16.2% (65/400)  48.5% (97/200)  41.0% (82/200) 
Exponen. + Scale  82.5% (330/400)  89.0% (178/200)  44.0% (88/200) 
Uniform  45.8% (183/400)  22.0% (44/200)  50.5% (101/200) 
Uniform + Scale  50.0% (200/400)  25.5% (51/200)  61.5% (123/200) 
In cases where
did not perform as well, we suggest one reason is the improper choice of distribution of NLL scores for the dataset. For instance, although the exponential distribution worked reliably well for ODDS and ZeMA, it did not work well for the STRATH dataset. Instead of committing to a specific distribution, we can use the ECDF, a nonparametric method, which works reliably well for all datasets. We report additional results in
Appendix C.6 Conclusion
In this work, we have formulated the BAEs to estimate the predictive uncertainty for anomaly detection. Our method computes the and , and adding the two uncertainties forms the .
Our experiments have evaluated the quality of uncertainty in the regime where the model is given the option to reject predictions of high uncertainty, and have demonstrated improvements using the proposed methods on multiple publicly available datasets. The best performing BAE consistently outperform the deterministic AE, highlighting the benefit of the Bayesian formulation.
Several directions for future work are possible. There is a need to further understand the conditions in which the anomaly uncertainty may fail, for instance, in adversarial attacks. Comprehensive case studies are necessary to understand the practical challenges in deploying BAEs. Future work should also explore the use of anomaly uncertainty for incremental and active learning.
Acknowledgments
The work reported here was supported by the European Metrology Programme for Innovation and Research (EMPIR) under the project Metrology for the Factory of the Future (MET4FOF), project number 17IND12 and partsponsored by Research England’s Connecting Capability Fund award CCF187157: Promoting the Internet of Things via Collaboration between HEIs and Industry (PitchIn).
References
 Survey on anomaly detection using data mining techniques. Procedia Computer Science 60, pp. 708–713. External Links: ISSN 18770509, Document, Link Cited by: §2.
 A survey of anomaly detection techniques in financial domain. Future Generation Computer Systems 55, pp. 278–288. External Links: ISSN 0167739X, Document, Link Cited by: §2, §2.

Hypothesis testing, type i and type ii errors
. Industrial psychiatry journal 18 (2), pp. 127. Cited by: §5.  Bayesian skipautoencoders for unsupervised hyperintense anomaly detection in high resolution brain mri. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1905–1909. Cited by: §1, Table 1, §2.

Statistics review 13: receiver operating characteristic curves
. Critical care 8 (6), pp. 1–5. Cited by: Appendix C.  Uncertainty as a form of transparency: measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 401–413. Cited by: §2.
 Pattern recognition and machine learning. Springer. Cited by: §2.
 Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §A.1, Appendix A, §3.1.
 On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data mining and knowledge discovery 30 (4), pp. 891–927. Cited by: §4.1.
 Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §1.
 Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §1, §2.
 Revisiting bayesian autoencoders with mcmc. arXiv preprint arXiv:2104.05915. Cited by: Table 1, §2.
 Outlier detection with autoencoder ensembles. In Proceedings of the 2017 SIAM international conference on data mining, pp. 90–98. Cited by: §4.1.
 Stochastic gradient hamiltonian monte carlo. In International Conference on machine learning, pp. 1683–1691. Cited by: §3.1.
 Bayesian variational autoencoders for unsupervised outofdistribution detection. arXiv preprint arXiv:1912.05651. Cited by: §1, Table 1, §2.
 Deep learning for medical anomaly detection – a survey. ACM Comput. Surv. 54 (7). External Links: ISSN 03600300, Link, Document Cited by: §2, §2.
 Dropout as a bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning, pp. 1050–1059. Cited by: Appendix A, §3.1.

Towards reliable learning for high stakes applications.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 3614–3621. Cited by: §1, §3.4.  Deep learning. MIT Press. Cited by: §3.1.
 Condition monitoring of a complex hydraulic system using multivariate statistics. In 2015 IEEE International Instrumentation and Measurement Technology Conference (I2MTC) Proceedings, Vol. , pp. 210–215. External Links: Document, ISSN 10915281 Cited by: §4.1.
 Machine learning with a reject option: a survey. arXiv preprint arXiv:2107.11277. Cited by: item 2.
 Anomaly detection for predictive maintenance in industry 4.0a survey. In E3S Web of Conferences, Vol. 170, pp. 02007. Cited by: §2, §2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.1.
 Aleatory or epistemic? does it matter?. Structural Safety 31 (2), pp. 105–112. Note: Risk Acceptance and Risk Communication External Links: ISSN 01674730 Cited by: §1.
 Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 13–24. Cited by: Table 1, §2, §3.2, §3.2.
 On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §A.1.
 Instance selection improves geometric mean accuracy: a study on imbalanced data classification. Progress in Artificial Intelligence 8 (2), pp. 215–228. Cited by: §3.4.
 A survey of deep learningbased network anomaly detection. Cluster Computing 22 (1), pp. 949–961. Cited by: §2, §2.
 Uncertainty quantification using bayesian neural networks in classification: application to biomedical image segmentation. Computational Statistics & Data Analysis 142, pp. 106816. Cited by: Table 1, §2.

Use of uncertainty with autoencoder neural networks for anomaly detection.
In
2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)
, pp. 32–35. Cited by: §1.  Leveraging uncertainty information from deep neural networks for disease detection. Scientific reports 7 (1), pp. 1–14. Cited by: item 2.
 Edge ai: ondemand accelerating deep neural network inference via edge computing. IEEE Transactions on Wireless Communications 19 (1), pp. 447–457. Cited by: §2, §5.
 Uncertainty in data analysis for strath testbed. In 2020 IEEE International Workshop on Metrology for Industry 4.0 & IoT, pp. 95–100. Cited by: §4.1.
 Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: Table 3.
 A comparative analysis of traditional and deep learningbased anomaly detection methods for streaming data. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 561–566. Cited by: §1.
 Accuracyrejection curves (arcs) for comparing classification methods with a reject option. In Machine Learning in Systems Biology, pp. 65–81. Cited by: item 2, §3.4.
 Deep learning for anomaly detection: a review. ACM Computing Surveys 54 (2). External Links: ISSN 03600300, Link, Document Cited by: §1, §2.
 Uncertainty in neural networks: approximately bayesian ensembling. In International conference on artificial intelligence and statistics, pp. 234–244. Cited by: Appendix A, §3.1, §5.
 Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.2, §4.1.
 Quantifying the confidence of anomaly detectors in their examplewise predictions. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 227–243. Cited by: Table 1, §2, §3.3.
 ODDS library. Stony Brook University, Department of Computer Sciences. External Links: Link Cited by: §4.1.
 Unet: convolutional networks for biomedical image segmentation. In Medical Image Computing and ComputerAssisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: ISBN 9783319245744 Cited by: §4.2.
 Cited by: §4.1.

Cyclical learning rates for training neural networks.
In
2017 IEEE Winter Conference on Applications of Computer Vision (WACV)
, pp. 464–472. Cited by: §4.2.  Cited by: §4.1.
 A comprehensive survey of anomaly detection techniques for high dimensional big data. Journal of Big Data 7 (1), pp. 1–30. Cited by: §2, §2.
 Model selection for bayesian autoencoders. arXiv preprint arXiv:2106.06245. Cited by: §1, Table 1, §2.
 A generalization of the glivenkocantelli theorem. The Annals of Mathematical Statistics 30 (3), pp. 828–830. Cited by: §3.2.
 Exploratory data analysis. Vol. 2, Reading, Mass.. Cited by: Appendix B.
 A course in probability. Pearson AddisonWesley. Cited by: §3.2.
 Quality of uncertainty quantification for bayesian neural network inference. arXiv preprint arXiv:1906.09686. Cited by: §5.
 Bayesian autoencoders for drift detection in industrial environments. In 2020 IEEE International Workshop on Metrology for Industry 4.0 & IoT, pp. 627–631. Cited by: Table 1, §2.
 Bayesian autoencoders: analysing and fixing the bernoulli likelihood for outofdistribution detection. In ICML Workshop on Uncertainty and Robustness in Deep Learning, Cited by: §1.
Appendix A BAE posterior sampling methods
We describe the following methods to sample from the BAE: BBB Blundell et al. (2015), MCD Gal and Ghahramani (2016) and anchored ensembling Pearce et al. (2020).
a.1 Variational inference
BBB and MCD are examples of variational inference. The idea of variational inference is to approximate the posterior distribution over parameters by introducing a variational distribution.
(18) 
where is the vector of parameters of the variational distribution, . The objective of training is thus, to minimise the KullbackLeibner (KL) divergence Kullback and Leibler (1951), which is a measure of similarity between two distributions: the variational distribution and the true posterior. This yields an optimised variational distribution to be sampled from during prediction.
(19) 
The KL divergence can be approximated with samples of AE parameters from the variational distribution . Note that minimising the KL divergence is the same as maximising the log evidence lower bound (ELBO).
(20) 
where the weight decay, scales the KL divergence of the prior, effectively controlling the strength of regularisation. Note that the first term of the sum is the loglikelihood of a sample from the variational distribution.
Now, we discuss popular models of the variational distribution. In BBB, we use a diagonal Gaussian distribution as the variational distribution, while the prior is a mixture of 2 diagonal Gaussian distributions,
where is the noise term introduced for implementing the reparameterisation trick Blundell et al. (2015)
, necessary for backpropagating the gradients during optimisation. The KL prior loss is
(21) 
where , and are parameters of the prior, which we fix in our experiments as 0.5, 1.0 and 0.1, respectively.
On the other hand, the MCD, which is also type of variational inference, is implemented by adding a dropout layer after each linear or convolutional layer. The corresponding variational distribution is a mixture of Gaussian and Bernoulli distributions.
where is the dropout probability and refers to the AE parameters when they are not dropped out. Further, the
is a hyperparameter and is not updated during training. We set
in our experiments. Lastly, the KL prior term can be approximated as(22) 
a.2 Anchored ensembling
In anchored ensembling, posteriors are approximated by Bayesian inference under the family of methods called randomised maximum a posteriori (MAP) sampling, where model parameters are regularised by values drawn from a distribution (socalled anchor distribution), which can be set equal to the prior.
Assume our ensemble consists of independent autoencoders and each jth autoencoder contains a set of parameters, where . In anchored ensembling for approximating the posterior distribution, the ‘anchored weights‘ for each autoencoder are unique and sampled during initialisation from a prior distribution and remain fixed throughout the training procedure.
The autoencoders are trained by minimising the loss function, which is the negative sum of loglikelihood (based on i.i.d assumption) and logprior where both are assumed to be Gaussian. For each member of the ensemble, the loss to be optimised is
(23) 
where is a hyperparameter for scaling the regulariser term arising from the prior.
Appendix B Data preprocessing
ODDS. For each dataset in the ODDS collection, we do not apply any preprocessing steps aside from minmax scaling and data splitting.
For ZeMA and STRATH, the inputs to the AE have the dimensions of , where is the batch size, is the sequence length, and is the number of sensors.
ZeMA. For each datapoint in the batch, The labels of each hydraulic subsystem’s condition are available in the dataset, and we label the best working conditions as inliers while the remaining states as anomalies. We have chosen the following sensorsubsystem pairs in our analysis: temperature sensor (TS4) for the cooler, valve and accumulator, and pressure sensor (PS6) for the pump. We downsample the pressure sensor to 1Hz; no resampling is applied on the temperature data, hence and .
STRATH. The radial forging process is divided into the heating and forging phases. We segment the sensors data to consider only the forging phase. We downsample the data by 10 fold, reducing the sequence length to for all tasks. For tasks (iiii), and for task (iv). For each forged part, measurements of its geometric dimensions are available as quality indicators; we focus our analysis on the 38 diameter@200 dimension. To flag parts as anomalies, we first obtain the absolute difference between the measured and nominal dimensions, and subsequently, we apply the Tukey’s fences method Tukey and others (1977). The remaining parts are labelled as inliers. The data preprocessing steps are outlined in Fig. 8.
As mentioned, we apply minmax scaling for all datasets. For ODDS, we apply the minmax scaling to each feature independently. By contrast, for ZeMA and STRATH, we obtain the minmax values from the train set for each sensor independently, instead of each feature, to retain the shape of signal. Note that we prevent traintest bias when by fitting the scaler to the train set only instead of the entire dataset.
Appendix C Additional results
Results of using the area under the receiver operating characteristic curve (AUROC) Bewick et al. (2004) as a measure of classifier performance for ARCs are reported in Fig. 9 and Table 6. In addition, we report the ARCs to compare the choices of distribution Q (see Fig. 10).


