icpr2022autoencoderattractors
None
view repo
The reliability assessment of a machine learning model's prediction is an important quantity for the deployment in safety critical applications. Not only can it be used to detect novel sceneries, either as outofdistribution or anomaly sample, but it also helps to determine deficiencies in the training data distribution. A lot of promising research directions have either proposed traditional methods like Gaussian processes or extended deep learning based approaches, for example, by interpreting them from a Bayesian point of view. In this work we propose a novel approach for uncertainty estimation based on autoencoder models: The recursive application of a previously trained autoencoder model can be interpreted as a dynamical system storing training examples as attractors. While input images close to known samples will converge to the same or similar attractor, input samples containing unknown features are unstable and converge to different training samples by potentially removing or changing characteristic features. The use of dropout during training and inference leads to a family of similar dynamical systems, each one being robust on samples close to the training distribution but unstable on new features. Either the model reliably removes these features or the resulting instability can be exploited to detect problematic input samples. We evaluate our approach on several dataset combinations as well as on an industrial application for occupant classification in the vehicle interior for which we additionally release a new synthetic dataset.
READ FULL TEXT VIEW PDFNone
Assessing the reliability of machine learning models’ predictions is an important challenge for the deployment and applicability of statistical methods. This additional information allows the possibility to detect novel and exotic sceneries during the lifetime of a deployed model on which the model’s predictions trustability can be determined. This knowledge also gives hints whether the collected training data needs to be extended or modified, e.g. in the case of active learning
[gal2017deep] and continuous learning [kading2016fine]. Recent activities investigated the possibility for estimating the uncertainty in the case of deep learning based methods [damianou2013deep, kupinski2003ideal, louizos2017multiplicative, amini2020deep]. Monte Carlo (MC) dropout, i.e. using dropout during training and enabling the latter during inference for multiple runs, has been shown to produce good uncertainty quantification [gal2016dropout] on several tasks while limiting the additional overheat during training and inference.It has been shown that recursive applications of autoencoders, which are trained under the standard training regime, can be viewed as a dynamical system [radhakrishnan2020overparameterized]. In mathematics [strogatz2000], the analysis of fixed points, attractors and their basins of attraction are important tools to analyze and understand dynamical systems and their behavior. This iterative process can be viewed as associative memory [radhakrishnan2020overparameterized] to retrieve perturbed training samples, but the models need to be trained long enough to ensure that the training samples become attractors. To the best of our knowledge, the recursive application of autoencoders and their attractors have not been investigated in view of generalization and uncertainty estimation.
Our contribution consists of the extension of the recursive application of autoencoder models, thus dynamical systems and attractors, in view of generalization capacities. We combine this strategy with MC dropout and we exploit characteristics of both design choices to determine whether new input samples are close or far from the training distribution by analyzing the behavior of multiple inferences, as shown in Fig. 1: the test sample is converging to a similar attractor, while the outofdistribution (OOD) sample converges to different attractors of different classes. We show that uncertainty estimation is improved compared to vanilla MC dropout and deep ensemble models across three metrics and in view of the entropy distribution. Our ablation study shows that the recursive application is key to the success of our approach. Our analysis is performed on several commonly used OOD dataset combinations as well as on an industrial application. We consider occupant classification in the vehicle interior and highlight some additional challenges. To this end we release a synthetic dataset for uncertainty estimation which will extend the existing SVIRO [DiasDaCruz2020SVIRO] dataset for occupant classification.
Attractors: There are several types of models achieving associative memory, e.g. discrete and continuous Hopfield Networks [ramsauer2020hopfield, krotov2016dense, hopfield1982neural] and Predictive Coding [salvatori2021associative]. The former needs an energy function to be defined, while the latter is biologically inspired. However, we focus on associative memory achieved by the recursive application of autoencoder models [radhakrishnan2020overparameterized], previously trained with gradient descent, due to their elegant simplicity and analogy to dynamical systems, which has been investigated extensively in mathematics and physics [strogatz2000]. While a few works investigate properties of this model design [jiang2020associative, radhakrishnan2020overparameterized, radhakrishnan2019memorization], only one [hadjahmadi2019robust]
considers attractors for classification and uncertainty estimation. However, the latter adopts this only for speech recognition with respect to noise robustness and combines it with a hidden Markov model. We, on the contrary, apply this methodology to computer vision and assess the robustness against novel classes and unseen samples from either new datasets or the test distribution.
Uncertainty estimation: A lot of research [abdar2021review]
is focusing on estimating the uncertainty of a model’s prediction regarding OOD or anomaly detection, both of which are tightly related. However, only a few works consider the use of autoencoder models for assessing uncertainty: Autoencoders can be combined with normalizing flow
[bohm2020probabilistic], refactor ideas from compressed sensing [grover2019uncertainty] or use properties of Variational Autoencoders [ran2021detecting, xiao2020likelihood]. More commonly, autoencoders are used for non image based datasets [vartouni2018anomaly, xu2018unsupervised, oh2018residual]. Other deep learning approaches are based on evidential learning [NEURIPS2018_a981f2b7, amini2020deep], Bayesian methods [mackay1995probable], Variational Bayes [blundell2015weight] or on Hamiltonian MonteCarlo [chen2014stochastic]. Also non deeplearning approaches have shown significant success, but are less scalable, as for example Gaussian Processes [rasmussen2003gaussian]or approaches based on support vector machines
[noori2015uncertainty]. Since our approach borrows ideas from MC dropout [gal2016dropout], we limit our comparison against the latter and the commonly used deep learning golden standard of using an ensemble of trained models [lakshminarayanan2017simple, vyas2018out].We start by introducing both approaches, dynamical systems based on autoencoders and their attractors and uncertainty estimation by MC dropout. Next we introduce our method, which we call MonteCarlo Attractor Autoencoder (MCAAE), combining both of the aforementioned design choices.
A good overview on the basic analysis of autoencoders, associative memory and attractors is provided in [radhakrishnan2020overparameterized]. Let be an autoencoder trained under the standard training regime, i.e. minimizing the reconstruction loss between input and target , i.e. , where is a reconstruction loss of choice. Consider an input sample , an index set for some and the sequence , where (k times) denotes compositions of applied to . A point is a fixed point of if , where we allow the equality to be weakened, i.e. for some small , because the reconstruction will never be perfect. The sequence then converges to . A fixed point is an attractor of if there exists an open neighborhood around such that for all the sequence converges to if . The set of all such points is called the basin of attraction of for . Even disturbed training samples converge to the initial training sample [radhakrishnan2020overparameterized]. We show that this property can be used to generalize to test samples, when they are close enough to the training distribution. If the latter is violated, the sample might not be stable in its convergence, which will be exploited by our next design choice.
The use of dropout during training and inferences, called Monte Carlo (MC) dropout, has been introduced [gal2016dropout]
to model uncertainty in neural networks without sacrificing complexity or test accuracy for several machine learning tasks. For standard classification or regression models, an individual binary mask is sampled for each layer (except the last layer) for each new training and test sample. Consequently, neurons are dropped randomly such that during inference we sample a function
from a family, or distribution of functions , i.e. . Uncertainty and reliability can then be assessed by performing multiple runs for the same input sample , i.e. retrieve for for some . The models predictive distribution for an input sample can then be assessed by computing . Uncertainty can be summarized by computing the normalized entropy [laves2020calibration]of the probability vector
, i.e. , where is the number of classes. We use the latter in all our experiments to compute the uncertainty of the prediction and decide based on its value whether a sample is rejected or accepted for prediction or whether the sample is in or outofdistribution.Our introduced method is a combination of both previously detailed model designs. Instead of training the autoencoder model under a standard training regime as done by related works thus far, we train the model using dropout and enabling dropout during inference as well. This causes an interesting model feature: if we repeat the recursive application of the trained autoencoder several times for the same input sample , then each iteration uses a different function from the same distributions of functions . Hence, we obtain different, but similar, dynamical systems for inference which should behave similarly for training and test samples, but not consistently for novel feature variations in the input. Each iteration can hence converge to a different attractor, potentially of different classes. The latter is useful to detect inconsistencies and hence uncertainty: if the model converges to attractors of the same class we can assume a trustful prediction, if it converges to attractors of different classes the convergence is unreliable.
MCAAE: Let be an input sample and be the family of functions consisting of autoencoders learned by using dropout during training and enabling it during inference as well. We repeat the recursion times, sampling each time a new for each recursion . This results in a predictive distribution , where is the number of compositions performed for each recursion. As a reminder, for a fixed the dropout mask is the same for each recursive step . The latter implies that the dropout mask needs to be implemented manually such that it can be fixed for multiple inferences. Since we are adopting this strategy for autoencoders, we refrain from using dropout in the latent space. Classification of the resulting iteratively reconstructed sample is performed in the latent space of the
th iteration. For the latter we use a MLP classifier with a single hidden layer of the same size as the latent dimension. To summarize this heuristic:
For training samples to become attractors it is necessary to train the autoencoders for a large number of epochs, i.e. we used 25000. A lot of hyperparameters are defined for the inference process instead of the training process. The number of recursions and the number of different runs is independent from the training. The classifier can be chosen after the autoencoder training. The uncertainty threshold needs to be adapted according to the use case and it is a tradeoff between the required sensitivity and precision.
We evaluate our method on two scenarios: First, we want to assess the predictive uncertainty where the model should provide a high uncertainty in case it wrongly classifies a sample. This is made more difficult in the case of the vehicle interior: unseen objects should be classified as empty seats, i.e. the model should only identify known classes and neglect everything else. Our results will show that this is a challenging task. Second, the model should differentiate between in and outofdistribution (OOD) samples. In the case of training on MNIST and evaluating on FashionMNIST, the model cannot perform a correct prediction and it should detect the OOD as such. This is also the case when images from a new vehicle interior are provided as input to the model. All training and evaluation scripts can be found in
our implementation (link).According to standard evaluation criterions adopted in related works, we evaluate our models using the Area Under the Receiver Operating Characteristic curve (AUROC), Area Under the PrecisionRecall curve (AUPR) and the false positive rate at 95% true positive rate (FPR95%). For OOD evaluation we use approximately 50% of the samples from the test set
and 50% from the test set from . For further details and interpretations of the metrics we refer to [hendrycks17baseline, hendrycks2018deep, davis2006relationship, manning1999foundations, liu2018open].We use several commonly used computer vision datasets for training and use the corresponding test data as indistribution sets : MNIST [lecun1998gradient], FashionMNIST [xiao2017fashion], SVHN [37648] and GTSRB [HoubenIJCNN2013] (which we reduce to use 10 classes only). For outofdistribution we use a subset of all not coming from the training distribution and the test datasets from Omniglot [lake2015human], CIFAR10 [krizhevsky2009learning], LSUN [yu2015lsun] (for which we use the train split) and Places365 [zhou2017places]. We use approximately the same number of samples from and by sampling each class uniformly. An overview is provided in Table I.
Dataset  Classes  and  Uncertainty 

MNIST  10  2500  10000 
Fashion  10  2500  26032 
SVHN  10  2500  10000 
GTSRB  10  2006  3208 
CIFAR10  10  2500   
Omniglot  660  2636   
LSUN  10  2500   
Places365  365  2555   
SVIROU Adults (A)  7  1337  2617 
SVIROU Seats (S)  8    490 
SVIROU Objects (O)  8    1622 
SVIROU A,S  26    896 
SVIROU A,O  7    1421 
SVIROU A,S,O  30    1676 
SVIRO Tesla  21    2000 
In addition to these commonly used datasets, we release an extension for SVIRO [DiasDaCruz2020SVIRO] called SVIROUncertainty. For each of the 3 seat positions in the vehicle interior rear bench the model should classify which object is occupying it, with empty being one possible choice. We created two training datasets for the Sharan vehicle, one using adult passengers only (4384 sceneries and 8 classes) and one using adults, child seats and infant seats (3515 samples and 64 classes  not used for training in this work). We created finegrained test sets to asses the reliability on several difficulty levels: 1) only unseen adults, 2) only unseen child and infant seats, 3) unseen adults and unseen child and infant seats, 4) unknown random everyday objects (e.g. dog, plants, bags, washing machine, instruments, tv, skateboard, paintings, …), 5) unseen adults and unknown everyday objects and 6) unseen adults, unseen child and infant seats and unknown everyday objects. The dataset can been downloaded (link). Besides the uncertainty estimation within the same vehicle interior, one can use images from unseen vehicle interiors from SVIRO to further test the models reliability on the same task, but in novel environments, i.e. vehicle interiors. Example images are provided in Fig. 2.
MCAAE (Ours)  MC Dropout  Ensemble of 10 models  
AUROC  AUPR  FPR  AUROC  AUPR  FPR  AUROC  AUPR  FPR  
MNIST  
MNIST  90.1 ±0.6  99.6 ±0.1  28.0 ±2.4  
CIFAR10  91.5 ±1.1  92.4 ±0.9  34.0 ±4.7  
Fashion  90.0 ±1.6  91.1 ±1.3  37.0 ±3.2  
Omniglot  95.5 ±1.0  96.0 ±0.8  22.0 ±6.0  
SVHN  94.9 ±0.9  95.4 ±0.7  22.4 ±5.1  
Fashion  
Fashion  82.5 ±0.4  96.4 ±0.1  64.4 ±4.3  96.4 ±0.1  
CIFAR10  93.9 ±1.8  95.8 ±1.1  34.3 ±3.1  
MNIST  90.2 ±0.5  90.6 ±0.5  35.7 ±2.4  
Omniglot  97.9 ±0.4  98.1 ±0.3  9.3 ±2.3  
SVHN  95.6 ±1.3  94.8 ±0.5  23.0 ±2.6  
SVHN  
SVHN  84.0 ±0.6  93.1 ±0.4  68.7 ±2.0  
CIFAR10  77.6 ±0.7  80.5 ±0.6  83.3 ±1.4  
GTSRB  75.4 ±2.2  80.7 ±5.4  81.2 ±0.7  
LSUN  79.2 ±0.7  81.9 ±0.7  80.1 ±1.9  
Places365  79.2 ±0.5  81.5 ±0.4  79.5 ±1.9  
GTSRB  
GTSRB  89.3 ±2.4  98.8 ±0.3  50.9 ±6.0  
CIFAR10  91.4 ±0.6  90.3 ±0.8  42.0 ±3.3  
LSUN  93.0 ±0.7  92.2 ±0.7  36.5 ±4.4  
Places365  92.3 ±0.7  91.3 ±0.7  38.8 ±3.4  
SVHN  91.3 ±0.7  90.7 ±0.8  44.5 ±3.7  
SVIROU  
CIFAR10  95.4 ±0.6  93.3 ±1.0  26.9 ±3.4  
GTSRB  95.8 ±1.0  94.9 ±1.1  25.1 ±6.9  
LSUN  94.8 ±0.5  92.7 ±0.7  31.5 ±2.7  
Places365  95.4 ±0.5  93.3 ±0.7  27.3 ±2.8  
SVHN  92.4 ±1.6  88.6 ±2.3  40.1 ±7.6  
Adults (A)  95.2 ±1.7  99.9 ±0.1  8.9 ±3.1  
Seats (S)  54.0 ±7.5  8.9 ±4.2  88.8 ±10.8  
Objects (O)  68.9 ±3.1  84.1 ±5.5  85.3 ±3.3  
A,S  58.8 ±2.6  48.6 ±6.4  93.2 ±1.1  
A,O  78.8 ±1.5  76.1 ±3.4  93.5 ±0.9  
A,S,O  62.2 ±2.0  56.4 ±4.7  88.7 ±3.1  
Tesla (OOD)  88.6 ±2.0  97.4 ±0.5  44.4 ±44.4 
Comparison (in percentage) of our method against MC dropout and an ensemble of models. We repeated the experiments for 10 runs and report the mean values together with their standard deviation. If
, then we report the result on the test set of only. Arrows indicate whether larger or smaller is better. Best results are highlighted in grey. The last block is a comparison on the finegrained splits on the newly released SVIROUncertainty. All but adults should be classified as empty.We compare our method against MC dropout and an ensemble of models using the same architecture as the autoencoder encoder part, but with an additional classification head. We trained our MCAAE models for 25000 epochs, but fewer epochs might produce good results as well. We did not perform an ablation study with respect to the number of epochs needed. Further, we did not check whether the training samples are truly fixed point and attractors because of the computational overhead: This could be done by computing the largest eigenvalue of the Jacobian matrix for each training sample and checking whether its greater than 1. The autoencoder model was trained as a denoiser
[xie2012image] (blur, random noise, brightness and contrast augmentation were used) to facilitate and robustify the recursive autoencoder application. Consequently, to have a fair benchmark, MC dropout and ensemble models used the same augmented images during training. The latter were trained for 1000 epochs. All methods used Adam, a learning rate of and a batch size of 64. For training on MNIST and FashionMNIST we used a latent space of 10, while for all others we used a latent space of 64. We used SSIM [bergmann2018improving] for computing the reconstruction loss. We used 250 samples per class for training and treat all datasets as grayscale images. All images were centre cropped and resized to 64 pixels. We used a dropout rate of 0.33 for all methods. Model and training details can be found in the implementation.For MCAAE and MC dropout we used inferences and we used an ensemble of models to assess uncertainty and the OOD estimation. We repeated each training for 10 runs for MCAAE and MC Dropout and for 100 runs to get the ensembles of models. We used recursions for MCAAE, but this value depends on the dataset used and it is subject to a hyperparameter search. In our case, the models converged fast for test and slow for OOD samples, see Fig. 3. Hence, more iterations did not provide an improvement.





We report the summary of our results for uncertainty and OOD detection in Table II. An interesting observation is the result that our approach performs significantly better when the visual complexity is increased (GTSRB, SVIRO), while the performance of MC Dropout and ensemble of models decreases on those setups. On the other side, on visually much simpler datasets (MNIST, FashionMNIST, SVHN) the performance of MC dropout and ensemble of models performs best. Thus, our method seems to be more beneficial for higher visual complexity, but this behavior should be investigated in detail in future work. Another interesting observation is that our approach provides better OOD estimations for the unseen Tesla vehicle from SVIRO. It can be observed that the different SVIROUncertainty splits are much more challenging and undergo a large performance gap for all methods.
We computed the histograms of the entropies for each and and report the results in Fig. 4 when trained on GTSRB. The results show that the entropy distribution between and several are best separated by our approach. The distributions of the different are more similar then for the other models. To quantify this, we computed the sum of the Wasserstein distances between and all (TD, larger is better, as we want them to be different) separately and the sum of the distances between CIFAR10 and all other (OD, smaller is better, as we want them to be similar). We then computed the mean and standard deviation across 10 runs. The results are reported in Table III and show that our method best separates uncertainty between and . Further, all are most similar between each other.
MCAAE (Ours)  MC Dropout  Ensemble  

OD  
TD 
We want to highlight that the performance of our method is improved due to the recursive application of the previously trained autoencoder. To this end we provide additional results where we compare the performance if no recursion is applied. We repeat the evaluation from the previous section and report the performance in Table IV. By comparing the results against Table II, it becomes apparent that the recursive application significantly improves uncertainty and OOD estimation.
AUROC  AUPR  FPR  
MNIST MNIST  
MNIST CIFAR10  
MNIST Fashion  
MNIST Omniglot  
MNIST SVHN  
Fashion Fashion  
Fashion CIFAR10  
Fashion MNIST  88.2 ±3.3  89.3 ±3.0  
Fashion Omniglot  
Fashion SVHN  
SVHN SVHN  
SVHN CIFAR10  
SVHN GTSRB  
SVHN LSUN  
SVHN Places365  
GTSRB GTSRB  85.7 ±1.3  95.9 ±0.6  67.3 ±2.9 
GTSRB CIFAR10  
GTSRB LSUN  
GTSRB Places365  
GTSRB SVHN  
SVIROU CIFAR10  
SVIROU GTSRB  
SVIROU LSUN  
SVIROU Places365  
SVIROU SVHN  
SVIROU Adults (A)  
SVIROU Seats (S)  74.6 ±37.6  
SVIROU Objects (O)  
SVIROU A,S  
SVIROU A,O  
SVIROU A,S,O  
SVIROU Tesla (OOD) 
In Fig 3 we report the reconstructions after 1, 2, 3 and 4 iterative steps. We repeat this for models trained on different and show that reconstructions converge over time (and much slower) to training samples. We hence believe that considering the trajectory of the latent space representation over several steps can be an additional indicator whether an input sample is in or outofdistribution. It becomes also visible that the reconstruction converges robustly to similar classes for samples, but to different classes for .
From a mathematical point of view dynamical systems are defined by natural phenomena or mechanical systems one wants to investigate and understand. Hence, designing or influencing the dynamical system of interest is usually not a possibility. An interesting observation is that the latter phenomenon is not the case for the recursive application of an autoencoder which is then interpreted as a dynamical system. Since we train the autoencoder in the first step, the resulting dynamical behavior and its attractors can be influenced by our previously defined autoencoder training procedure. We believe that it is an interesting direction for future work to analyze this interrelationship. Further, the effect of the number of epochs needed to obtain good results should be investigated. The basins of attraction can be studied after the autoencoder model is trained, such that potentially this information could be used to further improve robustness, interpretability and uncertainty estimation. We believe that the trajectory of the latent space representation over several iterations can give hints about the model robustness. Finally, while we fix the dropout mask for one recursion and each iterative step (but using a different one for each new recursion), it would also be possible to sample a new function for each iterative step within a recursion.
Our results on several datasets show that the recursive application of autoencoder models, viewed as dynamical systems, together with an MC dropout approach provides good uncertainty and outofdistributions estimations. Our model design choices improve the performance, particularly for computer vision datasets of higher visual complexity. Our ablation study highlights that the success is mainly due to the recursion and the entropy histograms underline the improved separability compared to MC dropout and an ensemble of models.
The first author is supported by the Luxembourg National Research Fund (FNR) under grant number 13043281. The second author is supported by DECODE (01IW21001).