Autoencoder Attractors for Uncertainty Estimation

by   Steve Dias Da Cruz, et al.

The reliability assessment of a machine learning model's prediction is an important quantity for the deployment in safety critical applications. Not only can it be used to detect novel sceneries, either as out-of-distribution or anomaly sample, but it also helps to determine deficiencies in the training data distribution. A lot of promising research directions have either proposed traditional methods like Gaussian processes or extended deep learning based approaches, for example, by interpreting them from a Bayesian point of view. In this work we propose a novel approach for uncertainty estimation based on autoencoder models: The recursive application of a previously trained autoencoder model can be interpreted as a dynamical system storing training examples as attractors. While input images close to known samples will converge to the same or similar attractor, input samples containing unknown features are unstable and converge to different training samples by potentially removing or changing characteristic features. The use of dropout during training and inference leads to a family of similar dynamical systems, each one being robust on samples close to the training distribution but unstable on new features. Either the model reliably removes these features or the resulting instability can be exploited to detect problematic input samples. We evaluate our approach on several dataset combinations as well as on an industrial application for occupant classification in the vehicle interior for which we additionally release a new synthetic dataset.


page 1

page 4

page 5


Robustness Guarantees for Bayesian Inference with Gaussian Processes

Bayesian inference and Gaussian processes are widely used in application...

Know Your Limits: Monotonicity Softmax Make Neural Classifiers Overconfident on OOD Data

A crucial requirement for reliable deployment of deep learning models fo...

Dropout Strikes Back: Improved Uncertainty Estimation via Diversity Sampled Implicit Ensembles

Modern machine learning models usually do not extrapolate well, i.e., th...

Out-of-distribution Detection and Generation using Soft Brownian Offset Sampling and Autoencoders

Deep neural networks often suffer from overconfidence which can be partl...

Uncertainty-Based Out-of-Distribution Classification in Deep Reinforcement Learning

Robustness to out-of-distribution (OOD) data is an important goal in bui...

Learning to Localize in New Environments from Synthetic Training Data

Most existing approaches for visual localization either need a detailed ...

Code Repositories



view repo

I Introduction

Assessing the reliability of machine learning models’ predictions is an important challenge for the deployment and applicability of statistical methods. This additional information allows the possibility to detect novel and exotic sceneries during the lifetime of a deployed model on which the model’s predictions trustability can be determined. This knowledge also gives hints whether the collected training data needs to be extended or modified, e.g. in the case of active learning

[gal2017deep] and continuous learning [kading2016fine]. Recent activities investigated the possibility for estimating the uncertainty in the case of deep learning based methods [damianou2013deep, kupinski2003ideal, louizos2017multiplicative, amini2020deep]. Monte Carlo (MC) dropout, i.e. using dropout during training and enabling the latter during inference for multiple runs, has been shown to produce good uncertainty quantification [gal2016dropout] on several tasks while limiting the additional overheat during training and inference.

[height=3.2cm]teaser/in.png InputIter 1Iter 2Iter 3Iter 4Iter 5Iter 6Iter 7

(a) Reconstructions of a same test sample from (GTSRB)

[height=3.2cm]teaser/out.png InputIter 1Iter 2Iter 3Iter 4Iter 5Iter 6Iter 7

(b) Reconstructions of a same OOD sample from


Fig. 1: Multiple recursive reconstructions (from left to right) of identical samples (first column) from and by our novel model. Notice the evolution in the reconstructions over each iterative step for the OOD sample.

It has been shown that recursive applications of autoencoders, which are trained under the standard training regime, can be viewed as a dynamical system [radhakrishnan2020overparameterized]. In mathematics [strogatz2000], the analysis of fixed points, attractors and their basins of attraction are important tools to analyze and understand dynamical systems and their behavior. This iterative process can be viewed as associative memory [radhakrishnan2020overparameterized] to retrieve perturbed training samples, but the models need to be trained long enough to ensure that the training samples become attractors. To the best of our knowledge, the recursive application of autoencoders and their attractors have not been investigated in view of generalization and uncertainty estimation.

Our contribution consists of the extension of the recursive application of autoencoder models, thus dynamical systems and attractors, in view of generalization capacities. We combine this strategy with MC dropout and we exploit characteristics of both design choices to determine whether new input samples are close or far from the training distribution by analyzing the behavior of multiple inferences, as shown in Fig. 1: the test sample is converging to a similar attractor, while the out-of-distribution (OOD) sample converges to different attractors of different classes. We show that uncertainty estimation is improved compared to vanilla MC dropout and deep ensemble models across three metrics and in view of the entropy distribution. Our ablation study shows that the recursive application is key to the success of our approach. Our analysis is performed on several commonly used OOD dataset combinations as well as on an industrial application. We consider occupant classification in the vehicle interior and highlight some additional challenges. To this end we release a synthetic dataset for uncertainty estimation which will extend the existing SVIRO [DiasDaCruz2020SVIRO] dataset for occupant classification.

Ii Related Works

Attractors: There are several types of models achieving associative memory, e.g. discrete and continuous Hopfield Networks [ramsauer2020hopfield, krotov2016dense, hopfield1982neural] and Predictive Coding [salvatori2021associative]. The former needs an energy function to be defined, while the latter is biologically inspired. However, we focus on associative memory achieved by the recursive application of autoencoder models [radhakrishnan2020overparameterized], previously trained with gradient descent, due to their elegant simplicity and analogy to dynamical systems, which has been investigated extensively in mathematics and physics [strogatz2000]. While a few works investigate properties of this model design [jiang2020associative, radhakrishnan2020overparameterized, radhakrishnan2019memorization], only one [hadjahmadi2019robust]

considers attractors for classification and uncertainty estimation. However, the latter adopts this only for speech recognition with respect to noise robustness and combines it with a hidden Markov model. We, on the contrary, apply this methodology to computer vision and assess the robustness against novel classes and unseen samples from either new datasets or the test distribution.

Uncertainty estimation: A lot of research [abdar2021review]

is focusing on estimating the uncertainty of a model’s prediction regarding OOD or anomaly detection, both of which are tightly related. However, only a few works consider the use of autoencoder models for assessing uncertainty: Autoencoders can be combined with normalizing flow

[bohm2020probabilistic], refactor ideas from compressed sensing [grover2019uncertainty] or use properties of Variational Autoencoders [ran2021detecting, xiao2020likelihood]. More commonly, autoencoders are used for non image based datasets [vartouni2018anomaly, xu2018unsupervised, oh2018residual]. Other deep learning approaches are based on evidential learning [NEURIPS2018_a981f2b7, amini2020deep], Bayesian methods [mackay1995probable], Variational Bayes [blundell2015weight] or on Hamiltonian Monte-Carlo [chen2014stochastic]. Also non deep-learning approaches have shown significant success, but are less scalable, as for example Gaussian Processes [rasmussen2003gaussian]

or approaches based on support vector machines

[noori2015uncertainty]. Since our approach borrows ideas from MC dropout [gal2016dropout], we limit our comparison against the latter and the commonly used deep learning golden standard of using an ensemble of trained models [lakshminarayanan2017simple, vyas2018out].

Iii Method

We start by introducing both approaches, dynamical systems based on autoencoders and their attractors and uncertainty estimation by MC dropout. Next we introduce our method, which we call Monte-Carlo Attractor Autoencoder (MCA-AE), combining both of the aforementioned design choices.

Iii-a Preliminaries - Attractors

A good overview on the basic analysis of autoencoders, associative memory and attractors is provided in [radhakrishnan2020overparameterized]. Let be an autoencoder trained under the standard training regime, i.e. minimizing the reconstruction loss between input and target , i.e. , where is a reconstruction loss of choice. Consider an input sample , an index set for some and the sequence , where (k times) denotes compositions of applied to . A point is a fixed point of if , where we allow the equality to be weakened, i.e. for some small , because the reconstruction will never be perfect. The sequence then converges to . A fixed point is an attractor of if there exists an open neighborhood around such that for all the sequence converges to if . The set of all such points is called the basin of attraction of for . Even disturbed training samples converge to the initial training sample [radhakrishnan2020overparameterized]. We show that this property can be used to generalize to test samples, when they are close enough to the training distribution. If the latter is violated, the sample might not be stable in its convergence, which will be exploited by our next design choice.

Iii-B Preliminaries - MC Dropout

The use of dropout during training and inferences, called Monte Carlo (MC) dropout, has been introduced [gal2016dropout]

to model uncertainty in neural networks without sacrificing complexity or test accuracy for several machine learning tasks. For standard classification or regression models, an individual binary mask is sampled for each layer (except the last layer) for each new training and test sample. Consequently, neurons are dropped randomly such that during inference we sample a function

from a family, or distribution of functions , i.e. . Uncertainty and reliability can then be assessed by performing multiple runs for the same input sample , i.e. retrieve for for some . The models predictive distribution for an input sample can then be assessed by computing . Uncertainty can be summarized by computing the normalized entropy [laves2020calibration]

of the probability vector

, i.e. , where is the number of classes. We use the latter in all our experiments to compute the uncertainty of the prediction and decide based on its value whether a sample is rejected or accepted for prediction or whether the sample is in- or out-of-distribution.

Iii-C Mca-Ae

Our introduced method is a combination of both previously detailed model designs. Instead of training the autoencoder model under a standard training regime as done by related works thus far, we train the model using dropout and enabling dropout during inference as well. This causes an interesting model feature: if we repeat the recursive application of the trained autoencoder several times for the same input sample , then each iteration uses a different function from the same distributions of functions . Hence, we obtain different, but similar, dynamical systems for inference which should behave similarly for training and test samples, but not consistently for novel feature variations in the input. Each iteration can hence converge to a different attractor, potentially of different classes. The latter is useful to detect inconsistencies and hence uncertainty: if the model converges to attractors of the same class we can assume a trustful prediction, if it converges to attractors of different classes the convergence is unreliable.

MCA-AE: Let be an input sample and be the family of functions consisting of autoencoders learned by using dropout during training and enabling it during inference as well. We repeat the recursion times, sampling each time a new for each recursion . This results in a predictive distribution , where is the number of compositions performed for each recursion. As a reminder, for a fixed the dropout mask is the same for each recursive step . The latter implies that the dropout mask needs to be implemented manually such that it can be fixed for multiple inferences. Since we are adopting this strategy for autoencoders, we refrain from using dropout in the latent space. Classification of the resulting iteratively reconstructed sample is performed in the latent space of the

th iteration. For the latter we use a MLP classifier with a single hidden layer of the same size as the latent dimension. To summarize this heuristic:

1:  Train autoencoder model using dropout to get
2:  Enable dropout for inference
3:  Define the number of recursions
4:  Train classifier in latent space after recursions
5:  Define the number of inferences per sample
6:  Define uncertainty threshold
7:  for each input sample  do
8:     for  to  do
9:        for  to  do
10:           if  then
11:              Sample a new dropout mask and keep it fixed
12:              This gets you , where
13:           end if
14:            {encoding}
15:            {decoding}
16:        end for

{probability distribution of classification}

18:     end for
21:     if  then
23:     else
24:        Reject sample
25:     end if
26:  end for

For training samples to become attractors it is necessary to train the autoencoders for a large number of epochs, i.e. we used 25000. A lot of hyperparameters are defined for the inference process instead of the training process. The number of recursions and the number of different runs is independent from the training. The classifier can be chosen after the autoencoder training. The uncertainty threshold needs to be adapted according to the use case and it is a tradeoff between the required sensitivity and precision.

Iv Experiments

We evaluate our method on two scenarios: First, we want to assess the predictive uncertainty where the model should provide a high uncertainty in case it wrongly classifies a sample. This is made more difficult in the case of the vehicle interior: unseen objects should be classified as empty seats, i.e. the model should only identify known classes and neglect everything else. Our results will show that this is a challenging task. Second, the model should differentiate between in- and out-of-distribution (OOD) samples. In the case of training on MNIST and evaluating on Fashion-MNIST, the model cannot perform a correct prediction and it should detect the OOD as such. This is also the case when images from a new vehicle interior are provided as input to the model. All training and evaluation scripts can be found in

our implementation (link).

Iv-a Evaluation metrics

According to standard evaluation criterions adopted in related works, we evaluate our models using the Area Under the Receiver Operating Characteristic curve (AUROC), Area Under the Precision-Recall curve (AUPR) and the false positive rate at 95% true positive rate (FPR95%). For OOD evaluation we use approximately 50% of the samples from the test set

and 50% from the test set from . For further details and interpretations of the metrics we refer to [hendrycks17baseline, hendrycks2018deep, davis2006relationship, manning1999foundations, liu2018open].

Iv-B Datasets

We use several commonly used computer vision datasets for training and use the corresponding test data as in-distribution sets : MNIST [lecun1998gradient], Fashion-MNIST [xiao2017fashion], SVHN [37648] and GTSRB [Houben-IJCNN-2013] (which we reduce to use 10 classes only). For out-of-distribution we use a subset of all not coming from the training distribution and the test datasets from Omniglot [lake2015human], CIFAR10 [krizhevsky2009learning], LSUN [yu2015lsun] (for which we use the train split) and Places365 [zhou2017places]. We use approximately the same number of samples from and by sampling each class uniformly. An overview is provided in Table I.

Dataset Classes and Uncertainty
MNIST 10 2500 10000
Fashion 10 2500 26032
SVHN 10 2500 10000
GTSRB 10 2006 3208
CIFAR10 10 2500 -
Omniglot 660 2636 -
LSUN 10 2500 -
Places365 365 2555 -
SVIRO-U Adults (A) 7 1337 2617
SVIRO-U Seats (S) 8 - 490
SVIRO-U Objects (O) 8 - 1622
SVIRO-U A,S 26 - 896
SVIRO-U A,O 7 - 1421
SVIRO-U A,S,O 30 - 1676
SVIRO Tesla 21 - 2000
TABLE I: Overview of the number of classes and samples for OOD or uncertainty estimation for the different datasets used.

In addition to these commonly used datasets, we release an extension for SVIRO [DiasDaCruz2020SVIRO] called SVIRO-Uncertainty. For each of the 3 seat positions in the vehicle interior rear bench the model should classify which object is occupying it, with empty being one possible choice. We created two training datasets for the Sharan vehicle, one using adult passengers only (4384 sceneries and 8 classes) and one using adults, child seats and infant seats (3515 samples and 64 classes - not used for training in this work). We created fine-grained test sets to asses the reliability on several difficulty levels: 1) only unseen adults, 2) only unseen child and infant seats, 3) unseen adults and unseen child and infant seats, 4) unknown random everyday objects (e.g. dog, plants, bags, washing machine, instruments, tv, skateboard, paintings, …), 5) unseen adults and unknown everyday objects and 6) unseen adults, unseen child and infant seats and unknown everyday objects. The dataset can been downloaded (link). Besides the uncertainty estimation within the same vehicle interior, one can use images from unseen vehicle interiors from SVIRO to further test the models reliability on the same task, but in novel environments, i.e. vehicle interiors. Example images are provided in Fig. 2.

Fig. 2: Examples from the SVIRO-Uncertainty dataset. First row are training samples of adults only. Second row are test samples of unseen adults, but also child-seats and everyday objects which should be classified as empty.
MCA-AE (Ours) MC Dropout Ensemble of 10 models
MNIST 90.1 ±0.6 99.6 ±0.1 28.0 ±2.4
CIFAR10 91.5 ±1.1 92.4 ±0.9 34.0 ±4.7
Fashion 90.0 ±1.6 91.1 ±1.3 37.0 ±3.2
Omniglot 95.5 ±1.0 96.0 ±0.8 22.0 ±6.0
SVHN 94.9 ±0.9 95.4 ±0.7 22.4 ±5.1
Fashion 82.5 ±0.4 96.4 ±0.1 64.4 ±4.3 96.4 ±0.1
CIFAR10 93.9 ±1.8 95.8 ±1.1 34.3 ±3.1
MNIST 90.2 ±0.5 90.6 ±0.5 35.7 ±2.4
Omniglot 97.9 ±0.4 98.1 ±0.3 9.3 ±2.3
SVHN 95.6 ±1.3 94.8 ±0.5 23.0 ±2.6
SVHN 84.0 ±0.6 93.1 ±0.4 68.7 ±2.0
CIFAR10 77.6 ±0.7 80.5 ±0.6 83.3 ±1.4
GTSRB 75.4 ±2.2 80.7 ±5.4 81.2 ±0.7
LSUN 79.2 ±0.7 81.9 ±0.7 80.1 ±1.9
Places365 79.2 ±0.5 81.5 ±0.4 79.5 ±1.9
GTSRB 89.3 ±2.4 98.8 ±0.3 50.9 ±6.0
CIFAR10 91.4 ±0.6 90.3 ±0.8 42.0 ±3.3
LSUN 93.0 ±0.7 92.2 ±0.7 36.5 ±4.4
Places365 92.3 ±0.7 91.3 ±0.7 38.8 ±3.4
SVHN 91.3 ±0.7 90.7 ±0.8 44.5 ±3.7
CIFAR10 95.4 ±0.6 93.3 ±1.0 26.9 ±3.4
GTSRB 95.8 ±1.0 94.9 ±1.1 25.1 ±6.9
LSUN 94.8 ±0.5 92.7 ±0.7 31.5 ±2.7
Places365 95.4 ±0.5 93.3 ±0.7 27.3 ±2.8
SVHN 92.4 ±1.6 88.6 ±2.3 40.1 ±7.6
Adults (A) 95.2 ±1.7 99.9 ±0.1 8.9 ±3.1
Seats (S) 54.0 ±7.5 8.9 ±4.2 88.8 ±10.8
Objects (O) 68.9 ±3.1 84.1 ±5.5 85.3 ±3.3
A,S 58.8 ±2.6 48.6 ±6.4 93.2 ±1.1
A,O 78.8 ±1.5 76.1 ±3.4 93.5 ±0.9
A,S,O 62.2 ±2.0 56.4 ±4.7 88.7 ±3.1
Tesla (OOD) 88.6 ±2.0 97.4 ±0.5 44.4 ±44.4

Comparison (in percentage) of our method against MC dropout and an ensemble of models. We repeated the experiments for 10 runs and report the mean values together with their standard deviation. If

, then we report the result on the test set of only. Arrows indicate whether larger or smaller is better. Best results are highlighted in grey. The last block is a comparison on the fine-grained splits on the newly released SVIRO-Uncertainty. All but adults should be classified as empty.

Iv-C Training and evaluation details

We compare our method against MC dropout and an ensemble of models using the same architecture as the autoencoder encoder part, but with an additional classification head. We trained our MCA-AE models for 25000 epochs, but fewer epochs might produce good results as well. We did not perform an ablation study with respect to the number of epochs needed. Further, we did not check whether the training samples are truly fixed point and attractors because of the computational overhead: This could be done by computing the largest eigenvalue of the Jacobian matrix for each training sample and checking whether its greater than 1. The autoencoder model was trained as a denoiser

[xie2012image] (blur, random noise, brightness and contrast augmentation were used) to facilitate and robustify the recursive autoencoder application. Consequently, to have a fair benchmark, MC dropout and ensemble models used the same augmented images during training. The latter were trained for 1000 epochs. All methods used Adam, a learning rate of and a batch size of 64. For training on MNIST and Fashion-MNIST we used a latent space of 10, while for all others we used a latent space of 64. We used SSIM [bergmann2018improving] for computing the reconstruction loss. We used 250 samples per class for training and treat all datasets as grayscale images. All images were centre cropped and resized to 64 pixels. We used a dropout rate of 0.33 for all methods. Model and training details can be found in the implementation.

For MCA-AE and MC dropout we used inferences and we used an ensemble of models to assess uncertainty and the OOD estimation. We repeated each training for 10 runs for MCA-AE and MC Dropout and for 100 runs to get the ensembles of models. We used recursions for MCA-AE, but this value depends on the dataset used and it is subject to a hyperparameter search. In our case, the models converged fast for test and slow for OOD samples, see Fig. 3. Hence, more iterations did not provide an improvement.

Iv-D Uncertainty estimation and out-of-distribution detection


InputIter 1Iter 2Iter 3Iter 4

(a) : Fashion-MNIST

InputIter 1Iter 2Iter 3Iter 4

(b) : MNIST

InputIter 1Iter 2Iter 3Iter 4

(c) : SVHN

InputIter 1Iter 2Iter 3Iter 4

(d) : GTSRB

InputIter 1Iter 2Iter 3Iter 4

(e) : SVIRO-U
Fig. 3: Multiple recursive reconstructions (from left to right) of identical samples (first column) from and by our novel MCA-AE model. Notice the evolution in the reconstruction results over each iterative step for the OOD samples. converge more robustly compared to reconstructions.
(a) MCA-AE (Ours)
(b) MC Dropout
(c) Ensemble of 10 models
Fig. 4: Comparison of entropy histograms between (GTSRB, filled bars with blue) and several (not filled bars and coloured according to dataset used) for different model architectures (a), (b) and (c) . MCA-AE provides the best separation between and across the entire datasets.

We report the summary of our results for uncertainty and OOD detection in Table II. An interesting observation is the result that our approach performs significantly better when the visual complexity is increased (GTSRB, SVIRO), while the performance of MC Dropout and ensemble of models decreases on those setups. On the other side, on visually much simpler datasets (MNIST, Fashion-MNIST, SVHN) the performance of MC dropout and ensemble of models performs best. Thus, our method seems to be more beneficial for higher visual complexity, but this behavior should be investigated in detail in future work. Another interesting observation is that our approach provides better OOD estimations for the unseen Tesla vehicle from SVIRO. It can be observed that the different SVIRO-Uncertainty splits are much more challenging and undergo a large performance gap for all methods.

We computed the histograms of the entropies for each and and report the results in Fig. 4 when trained on GTSRB. The results show that the entropy distribution between and several are best separated by our approach. The distributions of the different are more similar then for the other models. To quantify this, we computed the sum of the Wasserstein distances between and all (TD, larger is better, as we want them to be different) separately and the sum of the distances between CIFAR10 and all other (OD, smaller is better, as we want them to be similar). We then computed the mean and standard deviation across 10 runs. The results are reported in Table III and show that our method best separates uncertainty between and . Further, all are most similar between each other.

MCA-AE (Ours) MC Dropout Ensemble
TABLE III: We computed the sum of the Wasserstein distances between and all (TD ) separately and the sum of the distances between CIFAR10 and all other (OD ) over 10 runs. We report mean and standard deviation.

Iv-E Ablation study

We want to highlight that the performance of our method is improved due to the recursive application of the previously trained autoencoder. To this end we provide additional results where we compare the performance if no recursion is applied. We repeat the evaluation from the previous section and report the performance in Table IV. By comparing the results against Table II, it becomes apparent that the recursive application significantly improves uncertainty and OOD estimation.

MNIST Fashion
MNIST Omniglot
Fashion Fashion
Fashion CIFAR10
Fashion MNIST 88.2 ±3.3 89.3 ±3.0
Fashion Omniglot
Fashion SVHN
SVHN Places365
GTSRB GTSRB 85.7 ±1.3 95.9 ±0.6 67.3 ±2.9
GTSRB Places365
SVIRO-U Places365
SVIRO-U Adults (A)
SVIRO-U Seats (S) 74.6 ±37.6
SVIRO-U Objects (O)
TABLE IV: OOD and uncertainty estimation when no recursion is applied. In most cases the results are worse compared to 2 recursions - see Table II. In case they are better, we mark them grey.

In Fig 3 we report the reconstructions after 1, 2, 3 and 4 iterative steps. We repeat this for models trained on different and show that reconstructions converge over time (and much slower) to training samples. We hence believe that considering the trajectory of the latent space representation over several steps can be an additional indicator whether an input sample is in- or out-of-distribution. It becomes also visible that the reconstruction converges robustly to similar classes for samples, but to different classes for .

V Discussion and Limitations

From a mathematical point of view dynamical systems are defined by natural phenomena or mechanical systems one wants to investigate and understand. Hence, designing or influencing the dynamical system of interest is usually not a possibility. An interesting observation is that the latter phenomenon is not the case for the recursive application of an autoencoder which is then interpreted as a dynamical system. Since we train the autoencoder in the first step, the resulting dynamical behavior and its attractors can be influenced by our previously defined autoencoder training procedure. We believe that it is an interesting direction for future work to analyze this interrelationship. Further, the effect of the number of epochs needed to obtain good results should be investigated. The basins of attraction can be studied after the autoencoder model is trained, such that potentially this information could be used to further improve robustness, interpretability and uncertainty estimation. We believe that the trajectory of the latent space representation over several iterations can give hints about the model robustness. Finally, while we fix the dropout mask for one recursion and each iterative step (but using a different one for each new recursion), it would also be possible to sample a new function for each iterative step within a recursion.

Vi Conclusion

Our results on several datasets show that the recursive application of autoencoder models, viewed as dynamical systems, together with an MC dropout approach provides good uncertainty and out-of-distributions estimations. Our model design choices improve the performance, particularly for computer vision datasets of higher visual complexity. Our ablation study highlights that the success is mainly due to the recursion and the entropy histograms underline the improved separability compared to MC dropout and an ensemble of models.


The first author is supported by the Luxembourg National Research Fund (FNR) under grant number 13043281. The second author is supported by DECODE (01IW21001).