UQGAN: A Unified Model for Uncertainty Quantification of Deep Classifiers trained via Conditional GANs

by   Philipp Oberdiek, et al.

We present an approach to quantifying both aleatoric and epistemic uncertainty for deep neural networks in image classification, based on generative adversarial networks (GANs). While most works in the literature that use GANs to generate out-of-distribution (OoD) examples only focus on the evaluation of OoD detection, we present a GAN based approach to learn a classifier that exhibits proper uncertainties for OoD examples as well as for false positives (FPs). Instead of shielding the entire in-distribution data with GAN generated OoD examples which is state-of-the-art, we shield each class separately with out-of-class examples generated by a conditional GAN and complement this with a one-vs-all image classifier. In our experiments, in particular on CIFAR10, we improve over the OoD detection and FP detection performance of state-of-the-art GAN-training based classifiers. Furthermore, we also find that the generated GAN examples do not significantly affect the calibration error of our classifier and result in a significant gain in model accuracy.


page 6

page 13


Fast-converging Conditional Generative Adversarial Networks for Image Synthesis

Building on top of the success of generative adversarial networks (GANs)...

Revisiting Classifier Two-Sample Tests

The goal of two-sample tests is to assess whether two samples, S_P ∼ P^n...

Improving Model Compatibility of Generative Adversarial Networks by Boundary Calibration

Generative Adversarial Networks (GANs) is a powerful family of models th...

Scene Uncertainty and the Wellington Posterior of Deterministic Image Classifiers

We propose a method to estimate the uncertainty of the outcome of an ima...

Visual Indeterminacy in Generative Neural Art

Why are GANs such powerful tools for making art? This essay argues that ...

Building robust classifiers through generation of confident out of distribution examples

Deep learning models are known to be overconfident in their predictions ...

Conditional Sampling With Monotone GANs

We present a new approach for sampling conditional measures that enables...

1 Introduction

Deep learning has shown outstanding performance in image classification tasks KrizhevskySH12; SimonyanZ14a; HeZRS16. However, due to the enormous capacity and the inability of rejecting examples, deep neural networks (DNNs) are not capable of expressing their uncertainties appropriately. DNNs have demonstrated the tendency to overfit training data SrivastavaHKSS14 and to be easily fooled into wrong class predictions with high confidence GoodfellowSS14; GuoPSW17

. Also, importantly, ReLU networks have been proven to show high confidence far away from training data

HeinAB19. This is contrary to the behavior that would be expected naturally by humans, which is to be uncertain when being confronted with new data examples that have not been observed during training.

Several approaches to predictive uncertainty quantification have been introduced in recent years, considering uncertainty in a Bayesian sense BlundellCKW15; GalG15a; KendallG17 as well as from a frequentist’s point of view HendrycksG17; DeVriesT18; PadhyNRLSL20

. A common evaluation protocol is to discriminate between true and false positives (FPs) by means of a given uncertainty quantification. For an introduction to uncertainty in machine learning, we refer to

HullermeierW21, for a survey on uncertainty quantification methods for DNNs see GawlikowskiTALH21.

By design of ordinary DNNs for image classification, their uncertainty is often studied on in-distribution examples AshukhaLMV20

. The task of out-of-distribution (OoD) detection (or novelty detection) is oftentimes considered separately from uncertainty quantification

LiangLS18; SnoekOFLNSDRN19; MundtPMR19. Thus, OoD detection in deep learning has spawned an own line of research and method development. Among others, changes in architecture DeVriesT18

, loss function

SensoyKK18; AmersfoortSTG20 and the incorporation of data serving as OoD proxy HendrycksMD19 have been considered in the literature. Generative adversarial networks (GANs) have been used to replace that proxy by artificially generated data examples. In LeeLLS18, examples of OoD data are created such that they shield the in-distribution regime from the OoD regime. Note that this often requires pre-training of the generator. E.g. in the aforementioned work, the authors constructed OoD data in their 2D example to pretrain the generator. While showing promising results, such GAN-based approaches mostly predict a single score for OoD detection and do not yield a principled approach to uncertainty quantification distinguishing in-distribution uncertainty between classes and out-of-distribution uncertainty.

In this work, we propose to use GANs to, instead of shielding all classes at once, shield each class separately from the out-of-class (OoC) regime (also cf. Figure 1). Instead of maximizing an uncertainty measure, like softmax entropy, we combine this with a one-vs-all classifier in the final DNN layer. This is learned jointly with a class-conditional generator for out-of-class data in an adversarial framework. The resulting classifiers are used to model (class conditional) likelihoods. Via Bayes rule we define posterior class probabilities in a principled way. Our work thus makes the following novel contributions:

  1. We introduce a GAN-based model yielding a classifier with complete uncertainty quantification.

  2. Our model allows to distinguish uncertainty between classes (in large sample limit, if properly learned, approaching aleatoric uncertainty) from OoD uncertainty (approaching epistemic uncertainty).

  3. By a conditional GAN trained with a Wasserstein-based loss function, we achieve class shielding in low dimensions without any pre-training of the generator.

  4. In higher dimensions, we use a class conditional autoencoder and train the GAN on the latent space. This is coherent with the conditional GAN, allows us to use less complex generators and critics and reduces the influence of adversarial directions.

  5. We improve over the OoD detection and FP detection performance of state-of-the-art GAN-training based classifiers.

We present in-depth numerical experiments with our method on MNIST and CIFAR10, accompanied with various OoD datasets. We outperform other approaches, also GAN-based ones, in terms of OoD detection and FP detection performance on CIFAR10. Also on MNIST we achieve superior OoD detection performance. Noteworthily, on the more challenging CIFAR10 dataset, we achieve significantly stronger model accuracy compared to other approaches based on the same network architecture.

2 Related Work

The works by XiaCWHS15; HendrycksG17; DeVriesT18

can be considered as early baseline methods for the task of OoD detection. They are frequentist approaches relying on confidence scores gathered from model outputs. The problem is oftentimes the usage of the softmax activation function, which leads to overconfident predictions, in particular far away from training data, or to the decoupling of the confidence score from the original classification model during test time. Our proposed method does not make use of the softmax activation function and can produce unified uncertainty estimates during test time without the requirement of auxiliary confidence scores.

Many methods use perturbed training examples LeeLLS18Unified; RenLFSPDDL19; LiangLS18

or auxiliary outlier datasets

HendrycksMD19; KongR21. LeeLLS18Unified use them for their confidence score based on class conditional Gaussian distributions, while LiangLS18 and RenLFSPDDL19 are utilizing them to increase the separability between in- and out-of-distribution examples. HendrycksMD19 use a hand picked auxiliary outlier dataset during model training and KongR21

use it for selecting a suitable GAN discriminator during training which then serves for OoD detection. The common problem with using perturbed examples is the sensitivity to hyperparameter selection which might render the resulting examples uninformative. Additionally, auxiliary outlier datasets cannot always be considered readily available and pose the problem of covering only a small proportion of the real world. In contrast, our method is able to produce OoC examples that are very close to the in-distribution but still distinguishable from it, thus we do not require explicit data perturbation or any auxiliary outlier datasets.

Bayesian approaches GalG15a; BlundellCKW15; LakshminarayananPB17 provide a strong theoretical foundation for uncertainty quantification and OoD detection. LakshminarayananPB17 propose deep ensembles that approximate a distribution over models by averaging predictions of multiple independently trained models. GalG15a are utilizing Monte-Carlo (MC) sampling with dropout applied to each layer. In BlundellCKW15 a variational learning algorithm for approximating the intractable posterior distribution over network weights has been proposed. While the theoretical foundation is strong, these methods often require changing the architecture, restricting the model space and/or increased computational cost. While making use of Bayes-rule, we are staying in a frequentist setting and are not dependent on sampling or ensemble techniques. This reduces computational cost and enables our model to produce high quality aleatoric and epistemic uncertainty estimates with a single forward pass. Also, our proposed framework does not change the network architecture, except for the output layer activation function, and thus makes it compatible with previously published techniques.

One-vs-All methods in the context of OoD detection have been recently studied by FranchiBADB20; PadhyNRLSL20; SaitoS21. In the work by FranchiBADB20 an ensemble of binary neural networks is trained to perform one-vs-all classification on the in-distribution data which are then weighted by a standard softmax classifier. PadhyNRLSL20 use a DNN with a single sigmoid binary output for every class and explore the possibility of training the one-vs-all network with a distance based loss function instead of the binary cross entropy. Domain adaptation is considered in SaitoS21, where they utilize a one-vs-all classifier for a first OoD detection step before classifying into known classes with a second model. Their training objective is also accompanied by a hard negative mining on the in-distribution data. All these methods use the maximum predicted probability as a score for OoD detection and/or classification and do not aggregate the other probabilities into a single score like the method proposed by us. They also do not distinguish into different kinds of uncertainties as in our work. Lastly their training objectives are only based on in-distribution data. Generated OoD data as used in the present work is not considered.

More recently generative model based methods SchleglSWSL17; LeeLLS18; SricharanS18; SunZL19; VernekarGDPASC19 have shown strong performance on the task of out-of-distribution detection by supplying classification models with synthesized out-of-distribution examples. SchleglSWSL17 utilize the latent space of a GAN by gradient based reconstruction of an input example. In the work by LeeLLS18

, a GAN architecture with an additional classification model is built. The classification model is trained to output a uniform distribution over classes on GAN examples close to the in-distribution. This approach is further improved by

SricharanS18 who show improvements on the task of out-of-distribution detection. The generalization to distant regions in the sample space and the quality of generated boundary examples is however questionable VernekarGDPASC19

. A similar approach using a normalizing flow and randomly sampled latent vectors is proposed by

GrcicBS21. The high level idea of the architectures proposed in the previously mentioned works is similar to the one proposed by us. However, other works are not able to approximate the boundary of data distributions with multiple modes as shown by VernekarGDPASC19. Due to the fact that our GAN is class conditional and trained on a low dimensional latent space, we are able to follow multiple distribution modes resulting from different classes. We improve the in-distribution shielding by using a low-dimensional regularizer and have an additional advantage in terms of computational cost as our cGAN model architecture can be chosen with considerably smaller size due to it being trained in the latent space. Furthermore, these methods do not yield separate uncertainty scores for FP and OoD examples.

Generating OoD data based on lower dimensional latent representations has been explored in VernekarGADSC19a; SensoyKCS20. VernekarGADSC19a utilize a variational Autoencoder (vAE) to produce examples that are inside the encoded manifold (type I) as well as outside of it (type II). SensoyKCS20

also use a vAE and train a GAN in the resulting latent space, assigning generated examples to the OoD domain in order to estimate a Dirichlet distribution on the class predictions. Utilizing a vAE has the advantage that one can make assumptions on the distribution of the latent space. The downside, however, is that vAE are harder to train due to a vanishing Kullback-Leibler-Divergence (KL-Divergence), which is commonly applied to keep the posterior close to the prior distribution

FuLLGCC19. While the work of SensoyKCS20 is the most similar to our method, we improve on several shortcomings. First of all we employ class conditional models to improve diversity and class shielding. Additionally, we do not rely on a vAE which makes our approach more stable. Finally, we are able to distinguish aleatoric and epistemic uncertainty while the method by SensoyKCS20 is not. It assigns the same type of uncertainty to OoD and FP examples.

3 Method

In this section we introduce our one-vs-all classifier and the GAN architecture including its losses and regularizers.

3.1 One-vs-All Classification

We start by formulating our classification model as a one-vs-all classifier. Let model the probability that for a given class , an example with features is OoC. Analogously, for a given class models the probability of being in-class. We can consider as a likelihood proxy for and predict class labels by applying Bayes-rule:


with being the estimated relative class frequency (see Appendix A for a theoretic argument on this choice of likelihood proxy). We denote the right hand side of Equation 1 by . Using , we estimate the probability of an example being in-distribution by defining


which yields a quantification of epistemic uncertainty via . For aleatoric uncertainty estimation, we consider the Shannon entropy of the predicted class probabilities


To model , we use a DNN with

output neurons equipped with sigmoid activations. For each class output

, the data corresponding to class serves as in-class data and all other data as OoC data. Hence, a basic variant of our training objective is given by a weighted empirical binary cross entropy


Therein, is estimated from the in-class training set and is weighting the OoC loss to counter potential class imbalance. Similarly weighs the OoC loss compared to the in-class loss.

In addition, we generate for each class OoC examples with a conditional GAN. A joint training of classifier and GAN is introduced in the upcoming Section 3.2.

For comparison, both model architectures, with and without MC-Dropout, are discussed in the numerical experiments. Note that this demonstrates the compatibility of our model with existing methods.

3.2 GAN Architecture

Similar to LeeLLS18, we combine our classification model with a GAN and train them alternatingly. The Wasserstein GAN with gradient penalty proposed by GulrajaniAADC17 serves a basis for our conditional GAN (cGAN). Additionally, we condition the generator as well as the critic on the class labels to generate class specific OoC examples. Inspired by SensoyKCS20; VernekarGADSC19a who utilized the latent space of a variational autoencoder (vAE), we also train our GAN model on the latent space of an autoencoder. Thus, we do not produce adversarial noise in our examples and can use less complex generators and critics. In contrast to those works, we use a conditional vanilla autoencoder (cAE) instead of the vAE as we observed a more stable training. Prior to the cGAN training, we train the cAE on in-distribution data and then freeze the weights during the cGAN and classifier training. The optimization objective of the cAE is given as pixel-wise binary cross-entropy


with the decoded latent variable, being the encoded example, the -th pixel of example and the number of pixels belonging to . Therein, the pixel values are assumed to be in the interval , while .

Figure 2: Overview of the proposed architecture. Before training the GAN objective together with the classifier, the cAE is pretrained on the in-distribution training dataset. After that the weights of and are frozen.

The cGAN is trained using the objective function


with the conditional critic, , the latent embedding produced by the conditional generator, noise from a uniform distribution, a class label and the gradient penalty from GulrajaniAADC17. Integrating the classification objective into the cGAN objective, we alternate between





being an interpolation factor between real out-of-class examples and generated OoC examples and

being an additional regularization loss for the generated latent codes with hyperparameter , which we introduce in Section 3.3. The latent embeddings produced by the cGAN are decoded with the pretrained cAE, thus . That is, the cGAN is trained on the latent space while the classification model is trained on the original feature space. Our entire GAN-architecture is visualized in Figure 2.

3.3 Low-Dimensional Regularizer

In low dimensional latent spaces we found it to be advantageous to apply an additional regularizer to the generated latent embeddings to improve class shielding. Let be the latent embedding of an example with its corresponding class label and all generated latent codes with the same class label and normalized to origin . We encourage the generator to produce latent codes that more uniformly shield the class by maximizing the average angular distance between all , which corresponds to minimizing


with being the dot-product and . The logarithm introduces an exponential scaling for very small angular distances, encouraging a more evenly spread distribution of the generated latent codes. This loss is then averaged over all class labels and training data examples


with the number of examples with class label .

During our experiments we studied different regularizer losses such as manhatten/euclidean distance, infinity norm and standard cosine similarity. In experiments, we found that for our purpose

Equation 10 performed best. We argue that this can be attributed to the independence of the latent space value range, which can have a large impact on the -norm distance metrics and to the exponential scaling for very small angular distances.

4 Experiments

We compare our method with the following related works:

  • Maximum softmax probability baseline by HendrycksG17

  • Entropy of the predictive class distribution

  • Bayes-by-Backprop BlundellCKW15

  • MC-Dropout by GalG15a

  • Deep-Ensembles proposed by LakshminarayananPB17

  • Confident Classifier proposed by LeeLLS18

  • GEN by SensoyKCS20

While the first two methods are simple baselines, the next three ones are Bayesian ones and the final two methods are GAN-based (cf. Section 2). Following the publications SensoyKCS20; VernekarGADSC19a; RenLFSPDDL19; HendrycksG17, we consider two setups using the MNIST and CIFAR10 datasets as in-distribution, respectively. Similar to SensoyKCS20; VernekarGADSC19a

and others, we split the datasets class-wise into two non-overlapping sets, i.e., MNIST 0-4 / 5-9 and CIFAR10 0-4 / 5-9. While the first half serves as in-distribution data, the second half constitutes OoD cases close to the training data and therefore difficult to detect. For the MNIST 0-4 dataset, we consider the MNIST 5-9, EMNIST-Letters, Fashion-MNIST, Omniglot, SVHN and CIFAR10 datasets as OoD examples. For the CIFAR10 0-4 dataset, we use CIFAR10 5-9, LSUN, SVHN, Fashion-MNIST and MNIST as OoD examples. These selections yield compositions of training and OoD examples with strongly varying difficulty for state-of-the-art OoD detection. Besides that, we examine our method’s behavior on a 2D toy example with two overlapping Gaussians (having trivial covariance structure), see 

Figure 1. Additionally, we split the official training sets into 80% / 20% training / validation sets, where the latter are used for hyperparameter tuning and model selection.

Like related works, we utilize the LeNet-5 architecture on MNIST and a ResNet-18 on CIFAR10 as classification models. To ensure fair conditions, we re-implemented all aforementioned methods while following the authors recommendations for hyperparameters and their reference implementations. For methods involving more complex architectures, e.g. a GAN or a VAE as in LeeLLS18; SensoyKCS20

, we used the proposed architectures for those components, while for the sake of comparability sticking to our choice of classifier models. All implementations are based on PyTorch

PaszkeGMLB19 and will be published after the review phase. For each method, we selected the network checkpoint with maximal validation accuracy during training. For a more detailed overview of the used hyperparameters, we refer to Appendix C.

For evaluation we use the following well established metrics:

  • Classification accuracy on the in-distribution datasets.

  • Area under the Receiver Operating Characteristic Curve (AUROC). We apply the AUROC to the binary classification tasks in-/out-of-distribution via Equation 2 and TP/FP (Success/Failure) via Equation 3.

  • Expected Calibration Error (ECE) NaeiniCH15 applied to estimated class probabilities for in-distribution examples , computed on bins.

  • Area under the Precision Recall Curve (AUPR) w.r.t. the binary in-/out-of-distribution decision in Equation 2. As the AUPR is sensitive to the choice of the positive class of that binary classification problem, we further distinguish between AUPR-In and AUPR-Out. For AUPR-in the in-distribution class is the positive one, while for AUPR-Out the out-of-distribution class is the positive one.

  • FPR @ 95% TPR computes the False Positive Rate (FPR) at the decision threshold on the OoD score from Equation 2 ensuring a True Positive Rate (TPR) of 95%.

a) b)
Figure 3: Generated OoD examples by our approach. a) MNIST examples for digit classes 0-4 (top to bottom). b) CIFAR10 examples for classes airplane, automobile, bird, cat, deer (top to bottom).
Method In-Distribution Out-of-Distribution
Ours with MC-Dropout
One-vs-All Baseline
Max. Softmax HendrycksG17
Bayes-by-Backprop BlundellCKW15
MC-Dropout GalG15a
Deep-Ensembles LakshminarayananPB17
Confident Classifier LeeLLS18
GEN SensoyKCS20
Entropy Oracle
One-vs-All Oracle
Table 1: Results for MNIST (0-4) as in-distribution vs {MNIST (5-9), EMNIST-Letters, Omniglot, Fashion-MNIST, SVHN, CIFAR10} as out-of-distribution datasets

Before discussing our results on MNIST and CIFAR10, we briefly discuss our findings on the 2D example. As can be seen in Figure 1 the generated OoC examples are nicely shielding the respective in-distribution classes. OoC examples of one class can be in-distribution examples of other classes. This is an intended feature and to this end, the loss term for the synthesized OoC examples in Equation 8 is class conditional. This feature is supposed to make our likelihood proxy signal low density in the OoC regime. One can also observe that the estimated epistemic and aleatoric uncertainties are complementary, resulting in a high aleatoric uncertainty in the overlapping region of the Gaussians while also having a low epistemic uncertainty there. This is one of the main advantages that sets our approach apart from related methods. Results for a slightly more challenging 2D toy example on the two moons dataset are presented in Figure 4. We now demonstrate that this result generalizes to higher dimensional problems.

For MNIST and CIFAR10, Figure 3 shows the OoC examples produced at the end of the generator training. Due to using a conditional GAN and an AE, we are able to generate OoC examples (instead of only out-of-distribution as in related works) during test time. It can be seen that the resulting examples resemble a lot of semantic similarities with the original class while still being distinguishable from them.

All presented results are computed on the respective (official) test sets of the datasets. We also conducted an extensive parameter study on the validation sets, which is summarized in Appendix D. The conclusion of this parameter study is that the performance of our framework is in general stable w.r.t. the choice of the hyperparameters. Increasing positively impacts the model performance up to a certain maximum. The best performance is obtained by choosing latent dimensions such that the cAE is able to compute reconstructions of good visual quality. Across all datasets, choosing achieves the best detection scores, also indicating a positive influence of the generated OoC examples on the model’s classification accuracy.

Method In-Distribution Out-of-Distribution
Ours with MC-Dropout
One-vs-All Baseline
Max. Softmax HendrycksG17
Bayes-by-Backprop BlundellCKW15
MC-Dropout GalG15a
Deep-Ensembles LakshminarayananPB17
Confident Classifier LeeLLS18
GEN SensoyKCS20
Entropy Oracle
One-vs-All Oracle
Table 2: Results for CIFAR10 (0-4) as in-distribution vs {CIFAR10 (5-9), LSUN, SVHN, Fashion-MNIST, MNIST} as out-of-distribution datasets
Method Accuracy TP Accuracy FP Accuracy OoD
Ours + MC-Dropout
One-vs-All Baseline
Max. Softmax
Confident Classifier
Entropy Oracle
One-vs-All Oracle
Table 3:

Results obtained from aggregating predicted uncertainty estimates in a gradient boosting model which was then trained on a validation set.

Tables 2 and 1 present the results of our numerical experiments. The scores in the columns of the section In-Distribution are solely computed on the respective in-distribution dataset, i.e., MNIST 0-4 and CIFAR10 0-4, respectively. The scores in the columns of the section Out-of-Distribution are displaying the OoD detection performance when presenting examples from the respective in-distribution dataset as well as from the entirety of all assigned OoD datasets. Note that we did not apply any balancing in the OoD datasets but included the respective test sets as (see Appendix C for the sizes of the test sets) is. As an upper bound on the OoD detection performance we also supply results for two oracle models, supplied with the real OoD training datasets they are evaluated on. One of them is trained with the standard softmax and binary-cross-entropy and the other one with our proposed loss function, cf. Equation 8

We first discuss the in-distribution performance of our method. W.r.t. MNIST, the results given in the left section of Table 1 show that we are on par with state-of-the-art GAN-based approaches while still having a similar ECE, only being surpassed by MC-Dropout and the other baselines by a fairly small margin. However, considering the respective CIFAR10 results in the left section of Table 2, we clearly outperform state-of-the-art GAN-based methods as well as all other baseline methods by a large margin. Noteworthily, we achieve an accuracy of which is to percent points (pp.) above the other classifiers and an AUROC S/F of which is to pp. higher than for the other methods. This corresponds to a relative improvement of in accuracy and in AUROC S/F compared to the second best method. This shows that our model can utilize the GAN-generated OoC examples indeed to better localize aleatoric uncertainty, therefore better separating success and failure, and at the same time improving the classifier’s generalization. Furthermore, we observe that the calibration errors yield mid-tier results compared to the other methods. This signals that, although we incorporate generated OoC examples impurifying the distribution of training data presented to the classifier, empirically there is no evidence that this harms the learned classifier, neither w.r.t. calibration, nor w.r.t. separation.

Considering the OoD results from the right-hand sections of Tables 2 and 1, the superiority of our method compared to the other ones is now consistent over both in-distribution datasets. On the MNIST dataset we are outperforming previously published works, especially considering the AUPR-In and FPR @ 95% TPR metrics, with a and relative improvement over the second best method, respectively. This is consistent with the results for CIFAR10 as in-distribution dataset where we achieve for both AUROC and AUPR-In a relative improvement of over the second best method.

Comparing our results with the ones of the oracles, two observations become apparent. Firstly, in some OoD experiments the GAN-generated OoC examples achieve results fairly close to the ones of the oracles while also in some of them there is still room for improvement left (in particular w.r.t. FPR@95%TPR). Secondly, GAN-generated OoC examples can help improve generalization (in terms of classification accuracy) while real OoD data might be too far away from the in-distribution data.

As a final experiment, we perform CIFAR10-based OoD and FP detection in the wild, i.e., we perform both tasks jointly while presenting in-distribution and OoD data in the same mix as in Table 2

to the classifier. To this end, we applied gradient boosting to the uncertainty scores provided by the respective methods to predict TP, FP and OoD. For the Bayesian approaches we considered the sample mean’s entropy and variance. The corresponding results are given in

Table 3 in terms of class-wise accuracy. The main observations are that our method outperforms the other GAN-based methods and that our method including dropout achieves the overall best performance. It can be observed that the Entropy Oracle performs very strong while using only a single uncertainty score. At a second glance, this is not surprising since an FP mostly involves the confusion of two classes while training the DNN to output maximal entropy on OoD examples is likely to result in the confusion of up to five classes, therefore yielding different entropy levels. We present a more detailed version of this experiment in Appendix E where we also also take the estimated class probabilities into account.

An OoD-dataset-wise breakdown of the results in Tables 2 and 1 is provided in Appendix F. For MNIST, this breakdown reveals that our method performs particularly well in the difficult task of separating MNIST 0-4 and MNIST 5-9. On the other MNIST-related tasks we achieve mid-tier results, being slightly behind the other GAN-based methods. However, with regards to CIFAR10 we are consistently outperforming the other methods by large margins.

5 Conclusion

In this work, we introduced a GAN-based model yielding a one-vs-all classifier with complete uncertainty quantification. Our model distinguishes uncertainty between classes (in large sample limit approaching aleatoric uncertainty) from OoD uncertainty (approaching epistemic uncertainty). We have demonstrated in numerical experiments that our model sets a new state-of-the-art w.r.t. OoD as well as FP detection. The generated OoC examples do not harm the training success in terms of calibration, but even improve it in terms of accuracy. We have seen that further incorporating MC dropout to account for model uncertainty can further improve the results.


M.R. acknowledges useful discussions with Hanno Gottschalk.


Appendix A Theoretical Consideration of the One-vs-All Classifier

In the in-distribution regime, i.e., for , it might be more convenient to view our one-vs-all classifier as an ensemble of binary classifiers


Therein, denotes “not ”, meaning any other class than , and . Each classifier corresponding to class learns to indicate whether it is more likely that a given (where denotes the marginal distribution w.r.t. the input feature space) belongs to class or not to class . Under the assumption that

has the universal approximation property and that an empirical risk minimizer can be found (the former is true while the latter is NP-hard, however these are common assumptions in statistical learning theory

bookUML, vanHandel), in the large sample limit would converge to


if the training data for the classifiers was sampled i.i.d. according to and we trained with the binary cross entropy loss (Equation 4 without the weight factors and the class priors). However, since we balance the classes in Equation 4, our classifier receives data sampled from where is constant for all . Hence, the classifier, if properly learned, converges to


This justifies the selection of our one vs. all classifier as an approximation to the likelihood. However, this approximating property immediately disappears, once the generated examples from the GAN are also used as training data for the classifier. In Section 4, we demonstrate empirically that the GAN training does not reduce the overall classification accuracy. It also does neither worsen the capability of the classifier to separate TPs and FPs via aleatoric uncertainty, nor does it affect the calibration error, cf. Tables 2 and 1. In our numerical experiments on CIFAR10 in the main part of this work, we even find a significant improvement in terms of classification accuracy and separation of TPs and FPs compared to the one-vs-all baseline as well as the one-vs-all oracle. This reveals which positive effects GAN-examples can have not only on OoD detection but also on FP detection as well as classification accuracy.

Appendix B Two Moons - Toy Example

Figure 4: Two toy examples of the two moons dataset with different variance. From left to right: 1. OoD heatmap with orange indicating a high probability of being OoD and white for in-distribution; 2. Aleatoric uncertainty (entropy over Equation 1

) with orange indicating high and white low uncertainty; 3. Gaussian kernel density estimate of the GAN examples. Triangles indicate GAN OoC examples and crosses correspond to the in-distribution data. The data underlying the bottom row has higher variance than the one in underlying the top row.

As a more challenging 2D example, we also present a result on the two moons dataset. In Figure 4, the results of an experiment with two separable classes is shown. In the top row of the figure, the training data is class-wise separable. In that case, we observe that the decision boundary also belongs to the OoD regime (top right panel), which is true as there is no in-distribution data present. Our model is able to learn this since the classes are shielded tightly enough such that the generated OoC examples are in part also located in the vicinity of the decision boundary. To better visualize the distribution of generated data, we depict estimated densities of the generated OoC data in the top right panel. For the aleatoric uncertainty in the top center panel, we observe that due to numerical issues, aleatoric uncertainty increases further away from the in-distribution data. However this can be accounted for by first considering epistemic uncertainty and then the aleatoric one. By this procedure, most of the examples close to the decision boundary would be correctly classified as OoD which is also correct since there is only a minor amount of aleatoric uncertainty involved in this example due to moderate sample size not being reflected by the data.

In the bottom row example, an experiment analogous to the top row but with a noisier version of the data is presented. The bottom left panel shows that the epistemic uncertainty on the decision boundary between the two classes clearly decreases in comparison to the top right panel. At the same time the bottom center panel shows that the gain in aleatoric uncertainty compared to the top center panel.

Note that for data points far away from the in-distribution regime, all in take values close to zero. In practice this requires the inclusion of a a small in the denominator to circumvent numerical problems in the logarithmic loss terms. In our experiments this also results in a high aleatoric uncertainty far away from the in-distribution as all estimated probabilities uniformly take the the lower bound’s value . However, a joint consideration of aleatoric and epistemic uncertainty disentangles this, since a high estimated probability of being OoD means that the estimate of aleatoric uncertainty can be neglected. This also becomes evident in the center panels where one can observe high aleatoric uncertainty outside the in-distribution regime, which can however be masked out by the OoD probability .

Appendix C Hyperparameter Settings for Experiments

For the Bayes-by-Backprop implementation we use spike-and-slap priors in combination with diagonal Gaussian posterior distributions as described in BlundellCKW15. MC-Dropout uses a 50% dropout probability on all weight layers. Both mentioned methods average their predictions over 50 forward passes. The deep-ensembles were built by averaging 5 networks. Implementations of Confident Classifier and GEN use the architectures and hyperparameters recommended by the authors and we followed their reference code where possible. Parameter studies showed that our method is mostly stable w.r.t. the hyperparameter selection. We used as proposed in GulrajaniAADC17, for MNIST and for CIFAR10, , for MNIST and for CIFAR10. The latent dimension for MNIST was set to while using

dimensions for CIFAR10. We used batch stochastic gradient descent with the ADAM

adam optimizer and a batch size of . The learning rate was initialized to for the classification model and for the GAN while linearly decaying them to over the course of all training iterations. Training on the MNIST dataset required generator iterations while taking iterations for the CIFAR10 dataset (one iteration is considered to be one batch). As recommended in GulrajaniAADC17

, we use batch normalization only in the generator, while the critic as well as the classifier do not use any type of layer normalization. We also adopt the alternating training scheme from the just mentioned work. For each generator iteration the critic as well as classifier are performing

optimization steps on the same batch. We do apply some mild data augmentation by using random horizontal flipping where appropriate. The test set sizes used for computing the numerical results can be found in Table 4.

Dataset Test-Set Size
CIFAR10 0-4
CIFAR10 5-9
Table 4: Test set sizes used for computing the numerical results.

Appendix D Parameter Study

In order to examine the influence of hyperparameter selection onto our framework we conducted an extensive experimental study. We displayed all mentioned evaluation metrics from

Section 4 while varying , , and the chosen latent dimension for the cAE and cGAN. For in Equation 7, which controls the influence of the classifier predictions onto the generated OoC examples, Figure 5 shows clearly that for MNIST and for CIFAR10 are locally optimal values w.r.t. maximum performance. While larger tend to increase the in-distribution accuracy slightly it greatly decreases all other evaluation metrics. A very interesting observation in Figure 6 about the influence of from Equation 8 is that both extremes () are greatly decreasing the results. This shows clearly that the generated OoC examples have a positive effect on the OoD detection performance and in-distribution separability. In terms of the dimensionality of the latent space, Figure 7 shows that and dimensions are the optimal values for MNIST and CIFAR10, respectively. This is coherent with the visual quality of the examples decoded by the cAE, which does not improve much with higher dimensions. Analysing the influence of in Figure 8 one can observe a positive effect on the results on the MNIST dataset by increasing the value of the parameter up to where we reach a local maximum. It is also very apparent, that for the results are comparatively bad, emphasizing the role of the low dimensional regularizer in our model. For CIFAR10 the effect is not as clear as for the MNIST dataset, but Figure 8 also shows a locally optimal setting of

. We believe that the higher latent dimension required for the CIFAR10 dataset and thus the curse of dimensionality is the main factor behind this finding.

Figure 5: Parameter study over in Equation 7. For MNIST hyperparameters were fixed at , , and for CIFAR10 at , , . All seeds were also the same for all experiments. All metrics were computed on the validation sets of MNIST 0-4 / CIFAR10 0-4 as in-distribution datasets and the entirety of all assigned OoD datasets as defined in Section 4.
Figure 6: Parameter study over in Equation 8. For MNIST, hyperparameters were fixed at , , and for CIFAR10 at , , . All seeds were also the same for all experiments. All metrics were computed on the validation sets of MNIST 0-4 / CIFAR10 0-4 as in-distribution datasets and the entirety of all assigned OoD datasets as defined in Section 4.
Figure 7: Parameter study over latent dimensions of in Equation 7. For MNIST, hyperparameters were fixed at , , and for CIFAR10 at , , . All seeds were also the same for all experiments. All metrics were computed on the validation sets of MNIST 0-4 / CIFAR10 0-4 as in-distribution datasets and the entirety of all assigned OoD datasets as defined in Section 4.
Figure 8: Parameter study over in Equation 7. For MNIST, hyperparameters were fixed at , , and for CIFAR10 at , , . All seeds were also the same for all experiments. All metrics were computed on the validation sets of MNIST 0-4 / CIFAR10 0-4 as in-distribution datasets and the entirety of all assigned OoD datasets as defined in Section 4.

Appendix E Joint Detection of OoD and FP

Method Uncertainty Scores and Predicted Probabilities Uncertainty Scores Only
Accuracy TP Accuracy FP Accuracy OoD Accuracy TP Accuracy FP Accuracy OoD
Ours with MC-Dropout
One-vs-All Baseline
Max. Softmax HendrycksG17
Bayes-by-Backprop BlundellCKW15
MC-Dropout GalG15a
Deep-Ensembles LakshminarayananPB17
Confident Classifier LeeLLS18
GEN SensoyKCS20
Entropy Oracle
One-vs-All Oracle
Table 5: Results obtained from aggregating predicted uncertainty estimates in a gradient boosting model which was then trained on a validation set.

Table 5 presents an extended version of Table 3 where the latter constitutes the right-hand half of the former. While the right-hand half presents results for gradient boosting applied to the uncertainty scores of each method, aiming to predict TP, FP and OoD, the left half of the table shows analogous results while additionally using the estimated class probabilities as inputs for gradient boosting. We do so for the sake of accounting for other possible transformations of that are not explicitly constructed. In more detail, we use the following uncertainty scores:

  • Ours: OoD uncertainty and entropy of the estimated class probabilites

  • Ours with MC dropout: ,

    and the standard deviations of

    summed over all under MC dropout for forward passes.

  • One-vs-All Baseline: Same as “Ours”.

  • Max softmax: Maximum softmax probability.

  • Entropy: Entropy over estimated class probabilities.

  • Bayes-by-Backprop: For samples from the posterior we compute (aleatoric uncertainty) and (epistemic uncertainty) as in KendallG17.

  • MC-Dropout: Entropy of estimated class probabilities averaged over forward passes, standard deviation of the class probabilities summed over all .

  • Deep-Ensembles: Entropy of estimated class probabilites averaged over ensemble members, standard deviation of the class probabilities summed over all .

  • Confident Classifier: Entropy over estimated class probabilities.

  • GEN: Entropy over estimated class probabilities resulting of the estimated evidence of the Dirichlet distribution.

Also in the left part as well as the right part of the table, our method including dropout is fairly close the best oracle, which is the entropy oracle. The good performance of the entropy oracle can be explained by the fact that FPs typically only include the confusion of 2 classes while training on all OoD data for high entropy enforces the confusion among 5 classes, thus resulting in a higher entropy level. Apart from the oracle, in both studies including and excluding the estimated class probabilities , our method including MC-dropout outperforms all other methods. However, reviewing the result in an absolute sense, there still remains plenty of room for improvement.

Appendix F Detailed Results on Individual OoD Datasets

MNIST 0-4 vs. MNIST 5-9 MNIST 0-4 vs. EMNIST-Letters
Ours + Dropout
One-vs-All Baseline
Max. Softmax
Confident Classifier
Entropy Oracle
One-vs-All Oracle
MNIST 0-4 vs. Omniglot MNIST 0-4 vs. Fashion-MNIST
Ours + Dropout
One-vs-All Baseline
Max. Softmax
Confident Classifier
Entropy Oracle
One-vs-All Oracle
MNIST 0-4 vs. SVHN MNIST 0-4 vs. CIFAR10
Ours + Dropout
One-vs-All Baseline
Max. Softmax
Confident Classifier