1 Introduction
Deep learning has shown outstanding performance in image classification tasks KrizhevskySH12; SimonyanZ14a; HeZRS16. However, due to the enormous capacity and the inability of rejecting examples, deep neural networks (DNNs) are not capable of expressing their uncertainties appropriately. DNNs have demonstrated the tendency to overfit training data SrivastavaHKSS14 and to be easily fooled into wrong class predictions with high confidence GoodfellowSS14; GuoPSW17
. Also, importantly, ReLU networks have been proven to show high confidence far away from training data
HeinAB19. This is contrary to the behavior that would be expected naturally by humans, which is to be uncertain when being confronted with new data examples that have not been observed during training.Several approaches to predictive uncertainty quantification have been introduced in recent years, considering uncertainty in a Bayesian sense BlundellCKW15; GalG15a; KendallG17 as well as from a frequentist’s point of view HendrycksG17; DeVriesT18; PadhyNRLSL20
. A common evaluation protocol is to discriminate between true and false positives (FPs) by means of a given uncertainty quantification. For an introduction to uncertainty in machine learning, we refer to
HullermeierW21, for a survey on uncertainty quantification methods for DNNs see GawlikowskiTALH21.By design of ordinary DNNs for image classification, their uncertainty is often studied on indistribution examples AshukhaLMV20
. The task of outofdistribution (OoD) detection (or novelty detection) is oftentimes considered separately from uncertainty quantification
LiangLS18; SnoekOFLNSDRN19; MundtPMR19. Thus, OoD detection in deep learning has spawned an own line of research and method development. Among others, changes in architecture DeVriesT18SensoyKK18; AmersfoortSTG20 and the incorporation of data serving as OoD proxy HendrycksMD19 have been considered in the literature. Generative adversarial networks (GANs) have been used to replace that proxy by artificially generated data examples. In LeeLLS18, examples of OoD data are created such that they shield the indistribution regime from the OoD regime. Note that this often requires pretraining of the generator. E.g. in the aforementioned work, the authors constructed OoD data in their 2D example to pretrain the generator. While showing promising results, such GANbased approaches mostly predict a single score for OoD detection and do not yield a principled approach to uncertainty quantification distinguishing indistribution uncertainty between classes and outofdistribution uncertainty.In this work, we propose to use GANs to, instead of shielding all classes at once, shield each class separately from the outofclass (OoC) regime (also cf. Figure 1). Instead of maximizing an uncertainty measure, like softmax entropy, we combine this with a onevsall classifier in the final DNN layer. This is learned jointly with a classconditional generator for outofclass data in an adversarial framework. The resulting classifiers are used to model (class conditional) likelihoods. Via Bayes rule we define posterior class probabilities in a principled way. Our work thus makes the following novel contributions:

We introduce a GANbased model yielding a classifier with complete uncertainty quantification.

Our model allows to distinguish uncertainty between classes (in large sample limit, if properly learned, approaching aleatoric uncertainty) from OoD uncertainty (approaching epistemic uncertainty).

By a conditional GAN trained with a Wassersteinbased loss function, we achieve class shielding in low dimensions without any pretraining of the generator.

In higher dimensions, we use a class conditional autoencoder and train the GAN on the latent space. This is coherent with the conditional GAN, allows us to use less complex generators and critics and reduces the influence of adversarial directions.

We improve over the OoD detection and FP detection performance of stateoftheart GANtraining based classifiers.
We present indepth numerical experiments with our method on MNIST and CIFAR10, accompanied with various OoD datasets. We outperform other approaches, also GANbased ones, in terms of OoD detection and FP detection performance on CIFAR10. Also on MNIST we achieve superior OoD detection performance. Noteworthily, on the more challenging CIFAR10 dataset, we achieve significantly stronger model accuracy compared to other approaches based on the same network architecture.
2 Related Work
The works by XiaCWHS15; HendrycksG17; DeVriesT18
can be considered as early baseline methods for the task of OoD detection. They are frequentist approaches relying on confidence scores gathered from model outputs. The problem is oftentimes the usage of the softmax activation function, which leads to overconfident predictions, in particular far away from training data, or to the decoupling of the confidence score from the original classification model during test time. Our proposed method does not make use of the softmax activation function and can produce unified uncertainty estimates during test time without the requirement of auxiliary confidence scores.
Many methods use perturbed training examples LeeLLS18Unified; RenLFSPDDL19; LiangLS18
or auxiliary outlier datasets
HendrycksMD19; KongR21. LeeLLS18Unified use them for their confidence score based on class conditional Gaussian distributions, while LiangLS18 and RenLFSPDDL19 are utilizing them to increase the separability between in and outofdistribution examples. HendrycksMD19 use a hand picked auxiliary outlier dataset during model training and KongR21use it for selecting a suitable GAN discriminator during training which then serves for OoD detection. The common problem with using perturbed examples is the sensitivity to hyperparameter selection which might render the resulting examples uninformative. Additionally, auxiliary outlier datasets cannot always be considered readily available and pose the problem of covering only a small proportion of the real world. In contrast, our method is able to produce OoC examples that are very close to the indistribution but still distinguishable from it, thus we do not require explicit data perturbation or any auxiliary outlier datasets.
Bayesian approaches GalG15a; BlundellCKW15; LakshminarayananPB17 provide a strong theoretical foundation for uncertainty quantification and OoD detection. LakshminarayananPB17 propose deep ensembles that approximate a distribution over models by averaging predictions of multiple independently trained models. GalG15a are utilizing MonteCarlo (MC) sampling with dropout applied to each layer. In BlundellCKW15 a variational learning algorithm for approximating the intractable posterior distribution over network weights has been proposed. While the theoretical foundation is strong, these methods often require changing the architecture, restricting the model space and/or increased computational cost. While making use of Bayesrule, we are staying in a frequentist setting and are not dependent on sampling or ensemble techniques. This reduces computational cost and enables our model to produce high quality aleatoric and epistemic uncertainty estimates with a single forward pass. Also, our proposed framework does not change the network architecture, except for the output layer activation function, and thus makes it compatible with previously published techniques.
OnevsAll methods in the context of OoD detection have been recently studied by FranchiBADB20; PadhyNRLSL20; SaitoS21. In the work by FranchiBADB20 an ensemble of binary neural networks is trained to perform onevsall classification on the indistribution data which are then weighted by a standard softmax classifier. PadhyNRLSL20 use a DNN with a single sigmoid binary output for every class and explore the possibility of training the onevsall network with a distance based loss function instead of the binary cross entropy. Domain adaptation is considered in SaitoS21, where they utilize a onevsall classifier for a first OoD detection step before classifying into known classes with a second model. Their training objective is also accompanied by a hard negative mining on the indistribution data. All these methods use the maximum predicted probability as a score for OoD detection and/or classification and do not aggregate the other probabilities into a single score like the method proposed by us. They also do not distinguish into different kinds of uncertainties as in our work. Lastly their training objectives are only based on indistribution data. Generated OoD data as used in the present work is not considered.
More recently generative model based methods SchleglSWSL17; LeeLLS18; SricharanS18; SunZL19; VernekarGDPASC19 have shown strong performance on the task of outofdistribution detection by supplying classification models with synthesized outofdistribution examples. SchleglSWSL17 utilize the latent space of a GAN by gradient based reconstruction of an input example. In the work by LeeLLS18
, a GAN architecture with an additional classification model is built. The classification model is trained to output a uniform distribution over classes on GAN examples close to the indistribution. This approach is further improved by
SricharanS18 who show improvements on the task of outofdistribution detection. The generalization to distant regions in the sample space and the quality of generated boundary examples is however questionable VernekarGDPASC19. A similar approach using a normalizing flow and randomly sampled latent vectors is proposed by
GrcicBS21. The high level idea of the architectures proposed in the previously mentioned works is similar to the one proposed by us. However, other works are not able to approximate the boundary of data distributions with multiple modes as shown by VernekarGDPASC19. Due to the fact that our GAN is class conditional and trained on a low dimensional latent space, we are able to follow multiple distribution modes resulting from different classes. We improve the indistribution shielding by using a lowdimensional regularizer and have an additional advantage in terms of computational cost as our cGAN model architecture can be chosen with considerably smaller size due to it being trained in the latent space. Furthermore, these methods do not yield separate uncertainty scores for FP and OoD examples.Generating OoD data based on lower dimensional latent representations has been explored in VernekarGADSC19a; SensoyKCS20. VernekarGADSC19a utilize a variational Autoencoder (vAE) to produce examples that are inside the encoded manifold (type I) as well as outside of it (type II). SensoyKCS20
also use a vAE and train a GAN in the resulting latent space, assigning generated examples to the OoD domain in order to estimate a Dirichlet distribution on the class predictions. Utilizing a vAE has the advantage that one can make assumptions on the distribution of the latent space. The downside, however, is that vAE are harder to train due to a vanishing KullbackLeiblerDivergence (KLDivergence), which is commonly applied to keep the posterior close to the prior distribution
FuLLGCC19. While the work of SensoyKCS20 is the most similar to our method, we improve on several shortcomings. First of all we employ class conditional models to improve diversity and class shielding. Additionally, we do not rely on a vAE which makes our approach more stable. Finally, we are able to distinguish aleatoric and epistemic uncertainty while the method by SensoyKCS20 is not. It assigns the same type of uncertainty to OoD and FP examples.3 Method
In this section we introduce our onevsall classifier and the GAN architecture including its losses and regularizers.
3.1 OnevsAll Classification
We start by formulating our classification model as a onevsall classifier. Let model the probability that for a given class , an example with features is OoC. Analogously, for a given class models the probability of being inclass. We can consider as a likelihood proxy for and predict class labels by applying Bayesrule:
(1) 
with being the estimated relative class frequency (see Appendix A for a theoretic argument on this choice of likelihood proxy). We denote the right hand side of Equation 1 by . Using , we estimate the probability of an example being indistribution by defining
(2) 
which yields a quantification of epistemic uncertainty via . For aleatoric uncertainty estimation, we consider the Shannon entropy of the predicted class probabilities
(3) 
To model , we use a DNN with
output neurons equipped with sigmoid activations. For each class output
, the data corresponding to class serves as inclass data and all other data as OoC data. Hence, a basic variant of our training objective is given by a weighted empirical binary cross entropy(4) 
Therein, is estimated from the inclass training set and is weighting the OoC loss to counter potential class imbalance. Similarly weighs the OoC loss compared to the inclass loss.
In addition, we generate for each class OoC examples with a conditional GAN. A joint training of classifier and GAN is introduced in the upcoming Section 3.2.
For comparison, both model architectures, with and without MCDropout, are discussed in the numerical experiments. Note that this demonstrates the compatibility of our model with existing methods.
3.2 GAN Architecture
Similar to LeeLLS18, we combine our classification model with a GAN and train them alternatingly. The Wasserstein GAN with gradient penalty proposed by GulrajaniAADC17 serves a basis for our conditional GAN (cGAN). Additionally, we condition the generator as well as the critic on the class labels to generate class specific OoC examples. Inspired by SensoyKCS20; VernekarGADSC19a who utilized the latent space of a variational autoencoder (vAE), we also train our GAN model on the latent space of an autoencoder. Thus, we do not produce adversarial noise in our examples and can use less complex generators and critics. In contrast to those works, we use a conditional vanilla autoencoder (cAE) instead of the vAE as we observed a more stable training. Prior to the cGAN training, we train the cAE on indistribution data and then freeze the weights during the cGAN and classifier training. The optimization objective of the cAE is given as pixelwise binary crossentropy
(5) 
with the decoded latent variable, being the encoded example, the th pixel of example and the number of pixels belonging to . Therein, the pixel values are assumed to be in the interval , while .
The cGAN is trained using the objective function
(6) 
with the conditional critic, , the latent embedding produced by the conditional generator, noise from a uniform distribution, a class label and the gradient penalty from GulrajaniAADC17. Integrating the classification objective into the cGAN objective, we alternate between
(7) 
and
(8) 
with
being an interpolation factor between real outofclass examples and generated OoC examples and
being an additional regularization loss for the generated latent codes with hyperparameter , which we introduce in Section 3.3. The latent embeddings produced by the cGAN are decoded with the pretrained cAE, thus . That is, the cGAN is trained on the latent space while the classification model is trained on the original feature space. Our entire GANarchitecture is visualized in Figure 2.3.3 LowDimensional Regularizer
In low dimensional latent spaces we found it to be advantageous to apply an additional regularizer to the generated latent embeddings to improve class shielding. Let be the latent embedding of an example with its corresponding class label and all generated latent codes with the same class label and normalized to origin . We encourage the generator to produce latent codes that more uniformly shield the class by maximizing the average angular distance between all , which corresponds to minimizing
(9) 
with being the dotproduct and . The logarithm introduces an exponential scaling for very small angular distances, encouraging a more evenly spread distribution of the generated latent codes. This loss is then averaged over all class labels and training data examples
(10) 
with the number of examples with class label .
During our experiments we studied different regularizer losses such as manhatten/euclidean distance, infinity norm and standard cosine similarity. In experiments, we found that for our purpose
Equation 10 performed best. We argue that this can be attributed to the independence of the latent space value range, which can have a large impact on the norm distance metrics and to the exponential scaling for very small angular distances.4 Experiments
We compare our method with the following related works:

Maximum softmax probability baseline by HendrycksG17

Entropy of the predictive class distribution

BayesbyBackprop BlundellCKW15

MCDropout by GalG15a

DeepEnsembles proposed by LakshminarayananPB17

Confident Classifier proposed by LeeLLS18

GEN by SensoyKCS20
While the first two methods are simple baselines, the next three ones are Bayesian ones and the final two methods are GANbased (cf. Section 2). Following the publications SensoyKCS20; VernekarGADSC19a; RenLFSPDDL19; HendrycksG17, we consider two setups using the MNIST and CIFAR10 datasets as indistribution, respectively. Similar to SensoyKCS20; VernekarGADSC19a
and others, we split the datasets classwise into two nonoverlapping sets, i.e., MNIST 04 / 59 and CIFAR10 04 / 59. While the first half serves as indistribution data, the second half constitutes OoD cases close to the training data and therefore difficult to detect. For the MNIST 04 dataset, we consider the MNIST 59, EMNISTLetters, FashionMNIST, Omniglot, SVHN and CIFAR10 datasets as OoD examples. For the CIFAR10 04 dataset, we use CIFAR10 59, LSUN, SVHN, FashionMNIST and MNIST as OoD examples. These selections yield compositions of training and OoD examples with strongly varying difficulty for stateoftheart OoD detection. Besides that, we examine our method’s behavior on a 2D toy example with two overlapping Gaussians (having trivial covariance structure), see
Figure 1. Additionally, we split the official training sets into 80% / 20% training / validation sets, where the latter are used for hyperparameter tuning and model selection.Like related works, we utilize the LeNet5 architecture on MNIST and a ResNet18 on CIFAR10 as classification models. To ensure fair conditions, we reimplemented all aforementioned methods while following the authors recommendations for hyperparameters and their reference implementations. For methods involving more complex architectures, e.g. a GAN or a VAE as in LeeLLS18; SensoyKCS20
, we used the proposed architectures for those components, while for the sake of comparability sticking to our choice of classifier models. All implementations are based on PyTorch
PaszkeGMLB19 and will be published after the review phase. For each method, we selected the network checkpoint with maximal validation accuracy during training. For a more detailed overview of the used hyperparameters, we refer to Appendix C.For evaluation we use the following well established metrics:

Classification accuracy on the indistribution datasets.

Area under the Receiver Operating Characteristic Curve (AUROC). We apply the AUROC to the binary classification tasks in/outofdistribution via Equation 2 and TP/FP (Success/Failure) via Equation 3.

Expected Calibration Error (ECE) NaeiniCH15 applied to estimated class probabilities for indistribution examples , computed on bins.

Area under the Precision Recall Curve (AUPR) w.r.t. the binary in/outofdistribution decision in Equation 2. As the AUPR is sensitive to the choice of the positive class of that binary classification problem, we further distinguish between AUPRIn and AUPROut. For AUPRin the indistribution class is the positive one, while for AUPROut the outofdistribution class is the positive one.

FPR @ 95% TPR computes the False Positive Rate (FPR) at the decision threshold on the OoD score from Equation 2 ensuring a True Positive Rate (TPR) of 95%.
Method  InDistribution  OutofDistribution  
Accuracy  AUROC S/F  ECE  AUROC  AUPRIn  AUPROut  FPR@ 95% TPR  

Ours  
Ours with MCDropout  
OnevsAll Baseline  
Max. Softmax HendrycksG17  
Entropy  
BayesbyBackprop BlundellCKW15  
MCDropout GalG15a  
DeepEnsembles LakshminarayananPB17  
Confident Classifier LeeLLS18  
GEN SensoyKCS20  
Entropy Oracle  
OnevsAll Oracle 
Before discussing our results on MNIST and CIFAR10, we briefly discuss our findings on the 2D example. As can be seen in Figure 1 the generated OoC examples are nicely shielding the respective indistribution classes. OoC examples of one class can be indistribution examples of other classes. This is an intended feature and to this end, the loss term for the synthesized OoC examples in Equation 8 is class conditional. This feature is supposed to make our likelihood proxy signal low density in the OoC regime. One can also observe that the estimated epistemic and aleatoric uncertainties are complementary, resulting in a high aleatoric uncertainty in the overlapping region of the Gaussians while also having a low epistemic uncertainty there. This is one of the main advantages that sets our approach apart from related methods. Results for a slightly more challenging 2D toy example on the two moons dataset are presented in Figure 4. We now demonstrate that this result generalizes to higher dimensional problems.
For MNIST and CIFAR10, Figure 3 shows the OoC examples produced at the end of the generator training. Due to using a conditional GAN and an AE, we are able to generate OoC examples (instead of only outofdistribution as in related works) during test time. It can be seen that the resulting examples resemble a lot of semantic similarities with the original class while still being distinguishable from them.
All presented results are computed on the respective (official) test sets of the datasets. We also conducted an extensive parameter study on the validation sets, which is summarized in Appendix D. The conclusion of this parameter study is that the performance of our framework is in general stable w.r.t. the choice of the hyperparameters. Increasing positively impacts the model performance up to a certain maximum. The best performance is obtained by choosing latent dimensions such that the cAE is able to compute reconstructions of good visual quality. Across all datasets, choosing achieves the best detection scores, also indicating a positive influence of the generated OoC examples on the model’s classification accuracy.
Method  InDistribution  OutofDistribution  
Accuracy  AUROC S/F  ECE  AUROC  AUPRIn  AUPROut  FPR@ 95% TPR  

Ours  
Ours with MCDropout  
OnevsAll Baseline  
Max. Softmax HendrycksG17  
Entropy  
BayesbyBackprop BlundellCKW15  
MCDropout GalG15a  
DeepEnsembles LakshminarayananPB17  
Confident Classifier LeeLLS18  
GEN SensoyKCS20  
Entropy Oracle  
OnevsAll Oracle 
Method  Accuracy TP  Accuracy FP  Accuracy OoD 

Ours  
Ours + MCDropout  
OnevsAll Baseline  
Max. Softmax  
Entropy  
BayesbyBackprop  
MCDropout  
DeepEnsembles  
Confident Classifier  
GEN  
Entropy Oracle  
OnevsAll Oracle 
Results obtained from aggregating predicted uncertainty estimates in a gradient boosting model which was then trained on a validation set.
Tables 2 and 1 present the results of our numerical experiments. The scores in the columns of the section InDistribution are solely computed on the respective indistribution dataset, i.e., MNIST 04 and CIFAR10 04, respectively. The scores in the columns of the section OutofDistribution are displaying the OoD detection performance when presenting examples from the respective indistribution dataset as well as from the entirety of all assigned OoD datasets. Note that we did not apply any balancing in the OoD datasets but included the respective test sets as (see Appendix C for the sizes of the test sets) is. As an upper bound on the OoD detection performance we also supply results for two oracle models, supplied with the real OoD training datasets they are evaluated on. One of them is trained with the standard softmax and binarycrossentropy and the other one with our proposed loss function, cf. Equation 8
We first discuss the indistribution performance of our method. W.r.t. MNIST, the results given in the left section of Table 1 show that we are on par with stateoftheart GANbased approaches while still having a similar ECE, only being surpassed by MCDropout and the other baselines by a fairly small margin. However, considering the respective CIFAR10 results in the left section of Table 2, we clearly outperform stateoftheart GANbased methods as well as all other baseline methods by a large margin. Noteworthily, we achieve an accuracy of which is to percent points (pp.) above the other classifiers and an AUROC S/F of which is to pp. higher than for the other methods. This corresponds to a relative improvement of in accuracy and in AUROC S/F compared to the second best method. This shows that our model can utilize the GANgenerated OoC examples indeed to better localize aleatoric uncertainty, therefore better separating success and failure, and at the same time improving the classifier’s generalization. Furthermore, we observe that the calibration errors yield midtier results compared to the other methods. This signals that, although we incorporate generated OoC examples impurifying the distribution of training data presented to the classifier, empirically there is no evidence that this harms the learned classifier, neither w.r.t. calibration, nor w.r.t. separation.
Considering the OoD results from the righthand sections of Tables 2 and 1, the superiority of our method compared to the other ones is now consistent over both indistribution datasets. On the MNIST dataset we are outperforming previously published works, especially considering the AUPRIn and FPR @ 95% TPR metrics, with a and relative improvement over the second best method, respectively. This is consistent with the results for CIFAR10 as indistribution dataset where we achieve for both AUROC and AUPRIn a relative improvement of – over the second best method.
Comparing our results with the ones of the oracles, two observations become apparent. Firstly, in some OoD experiments the GANgenerated OoC examples achieve results fairly close to the ones of the oracles while also in some of them there is still room for improvement left (in particular w.r.t. FPR@95%TPR). Secondly, GANgenerated OoC examples can help improve generalization (in terms of classification accuracy) while real OoD data might be too far away from the indistribution data.
As a final experiment, we perform CIFAR10based OoD and FP detection in the wild, i.e., we perform both tasks jointly while presenting indistribution and OoD data in the same mix as in Table 2
to the classifier. To this end, we applied gradient boosting to the uncertainty scores provided by the respective methods to predict TP, FP and OoD. For the Bayesian approaches we considered the sample mean’s entropy and variance. The corresponding results are given in
Table 3 in terms of classwise accuracy. The main observations are that our method outperforms the other GANbased methods and that our method including dropout achieves the overall best performance. It can be observed that the Entropy Oracle performs very strong while using only a single uncertainty score. At a second glance, this is not surprising since an FP mostly involves the confusion of two classes while training the DNN to output maximal entropy on OoD examples is likely to result in the confusion of up to five classes, therefore yielding different entropy levels. We present a more detailed version of this experiment in Appendix E where we also also take the estimated class probabilities into account.An OoDdatasetwise breakdown of the results in Tables 2 and 1 is provided in Appendix F. For MNIST, this breakdown reveals that our method performs particularly well in the difficult task of separating MNIST 04 and MNIST 59. On the other MNISTrelated tasks we achieve midtier results, being slightly behind the other GANbased methods. However, with regards to CIFAR10 we are consistently outperforming the other methods by large margins.
5 Conclusion
In this work, we introduced a GANbased model yielding a onevsall classifier with complete uncertainty quantification. Our model distinguishes uncertainty between classes (in large sample limit approaching aleatoric uncertainty) from OoD uncertainty (approaching epistemic uncertainty). We have demonstrated in numerical experiments that our model sets a new stateoftheart w.r.t. OoD as well as FP detection. The generated OoC examples do not harm the training success in terms of calibration, but even improve it in terms of accuracy. We have seen that further incorporating MC dropout to account for model uncertainty can further improve the results.
Acknowledgment
M.R. acknowledges useful discussions with Hanno Gottschalk.
References
Appendix A Theoretical Consideration of the OnevsAll Classifier
In the indistribution regime, i.e., for , it might be more convenient to view our onevsall classifier as an ensemble of binary classifiers
(11) 
Therein, denotes “not ”, meaning any other class than , and . Each classifier corresponding to class learns to indicate whether it is more likely that a given (where denotes the marginal distribution w.r.t. the input feature space) belongs to class or not to class . Under the assumption that
has the universal approximation property and that an empirical risk minimizer can be found (the former is true while the latter is NPhard, however these are common assumptions in statistical learning theory
bookUML, vanHandel), in the large sample limit would converge to(12) 
if the training data for the classifiers was sampled i.i.d. according to and we trained with the binary cross entropy loss (Equation 4 without the weight factors and the class priors). However, since we balance the classes in Equation 4, our classifier receives data sampled from where is constant for all . Hence, the classifier, if properly learned, converges to
(13) 
This justifies the selection of our one vs. all classifier as an approximation to the likelihood. However, this approximating property immediately disappears, once the generated examples from the GAN are also used as training data for the classifier. In Section 4, we demonstrate empirically that the GAN training does not reduce the overall classification accuracy. It also does neither worsen the capability of the classifier to separate TPs and FPs via aleatoric uncertainty, nor does it affect the calibration error, cf. Tables 2 and 1. In our numerical experiments on CIFAR10 in the main part of this work, we even find a significant improvement in terms of classification accuracy and separation of TPs and FPs compared to the onevsall baseline as well as the onevsall oracle. This reveals which positive effects GANexamples can have not only on OoD detection but also on FP detection as well as classification accuracy.
Appendix B Two Moons  Toy Example
) with orange indicating high and white low uncertainty; 3. Gaussian kernel density estimate of the GAN examples. Triangles indicate GAN OoC examples and crosses correspond to the indistribution data. The data underlying the bottom row has higher variance than the one in underlying the top row.
As a more challenging 2D example, we also present a result on the two moons dataset. In Figure 4, the results of an experiment with two separable classes is shown. In the top row of the figure, the training data is classwise separable. In that case, we observe that the decision boundary also belongs to the OoD regime (top right panel), which is true as there is no indistribution data present. Our model is able to learn this since the classes are shielded tightly enough such that the generated OoC examples are in part also located in the vicinity of the decision boundary. To better visualize the distribution of generated data, we depict estimated densities of the generated OoC data in the top right panel. For the aleatoric uncertainty in the top center panel, we observe that due to numerical issues, aleatoric uncertainty increases further away from the indistribution data. However this can be accounted for by first considering epistemic uncertainty and then the aleatoric one. By this procedure, most of the examples close to the decision boundary would be correctly classified as OoD which is also correct since there is only a minor amount of aleatoric uncertainty involved in this example due to moderate sample size not being reflected by the data.
In the bottom row example, an experiment analogous to the top row but with a noisier version of the data is presented. The bottom left panel shows that the epistemic uncertainty on the decision boundary between the two classes clearly decreases in comparison to the top right panel. At the same time the bottom center panel shows that the gain in aleatoric uncertainty compared to the top center panel.
Note that for data points far away from the indistribution regime, all in take values close to zero. In practice this requires the inclusion of a a small in the denominator to circumvent numerical problems in the logarithmic loss terms. In our experiments this also results in a high aleatoric uncertainty far away from the indistribution as all estimated probabilities uniformly take the the lower bound’s value . However, a joint consideration of aleatoric and epistemic uncertainty disentangles this, since a high estimated probability of being OoD means that the estimate of aleatoric uncertainty can be neglected. This also becomes evident in the center panels where one can observe high aleatoric uncertainty outside the indistribution regime, which can however be masked out by the OoD probability .
Appendix C Hyperparameter Settings for Experiments
For the BayesbyBackprop implementation we use spikeandslap priors in combination with diagonal Gaussian posterior distributions as described in BlundellCKW15. MCDropout uses a 50% dropout probability on all weight layers. Both mentioned methods average their predictions over 50 forward passes. The deepensembles were built by averaging 5 networks. Implementations of Confident Classifier and GEN use the architectures and hyperparameters recommended by the authors and we followed their reference code where possible. Parameter studies showed that our method is mostly stable w.r.t. the hyperparameter selection. We used as proposed in GulrajaniAADC17, for MNIST and for CIFAR10, , for MNIST and for CIFAR10. The latent dimension for MNIST was set to while using
dimensions for CIFAR10. We used batch stochastic gradient descent with the ADAM
adam optimizer and a batch size of . The learning rate was initialized to for the classification model and for the GAN while linearly decaying them to over the course of all training iterations. Training on the MNIST dataset required generator iterations while taking iterations for the CIFAR10 dataset (one iteration is considered to be one batch). As recommended in GulrajaniAADC17, we use batch normalization only in the generator, while the critic as well as the classifier do not use any type of layer normalization. We also adopt the alternating training scheme from the just mentioned work. For each generator iteration the critic as well as classifier are performing
optimization steps on the same batch. We do apply some mild data augmentation by using random horizontal flipping where appropriate. The test set sizes used for computing the numerical results can be found in Table 4.Dataset  TestSet Size 
MNIST 04  
MNIST 59  
CIFAR10 04  
CIFAR10 59  
EMNISTLetters  
FashionMNIST  
SVHN  
Omniglot  
LSUN 
Appendix D Parameter Study
In order to examine the influence of hyperparameter selection onto our framework we conducted an extensive experimental study. We displayed all mentioned evaluation metrics from
Section 4 while varying , , and the chosen latent dimension for the cAE and cGAN. For in Equation 7, which controls the influence of the classifier predictions onto the generated OoC examples, Figure 5 shows clearly that for MNIST and for CIFAR10 are locally optimal values w.r.t. maximum performance. While larger tend to increase the indistribution accuracy slightly it greatly decreases all other evaluation metrics. A very interesting observation in Figure 6 about the influence of from Equation 8 is that both extremes () are greatly decreasing the results. This shows clearly that the generated OoC examples have a positive effect on the OoD detection performance and indistribution separability. In terms of the dimensionality of the latent space, Figure 7 shows that and dimensions are the optimal values for MNIST and CIFAR10, respectively. This is coherent with the visual quality of the examples decoded by the cAE, which does not improve much with higher dimensions. Analysing the influence of in Figure 8 one can observe a positive effect on the results on the MNIST dataset by increasing the value of the parameter up to where we reach a local maximum. It is also very apparent, that for the results are comparatively bad, emphasizing the role of the low dimensional regularizer in our model. For CIFAR10 the effect is not as clear as for the MNIST dataset, but Figure 8 also shows a locally optimal setting of. We believe that the higher latent dimension required for the CIFAR10 dataset and thus the curse of dimensionality is the main factor behind this finding.
Appendix E Joint Detection of OoD and FP
Method  Uncertainty Scores and Predicted Probabilities  Uncertainty Scores Only  
Accuracy TP  Accuracy FP  Accuracy OoD  Accuracy TP  Accuracy FP  Accuracy OoD  

Ours  
Ours with MCDropout  
OnevsAll Baseline  
Max. Softmax HendrycksG17  
Entropy  
BayesbyBackprop BlundellCKW15  
MCDropout GalG15a  
DeepEnsembles LakshminarayananPB17  
Confident Classifier LeeLLS18  
GEN SensoyKCS20  
Entropy Oracle  
OnevsAll Oracle 
Table 5 presents an extended version of Table 3 where the latter constitutes the righthand half of the former. While the righthand half presents results for gradient boosting applied to the uncertainty scores of each method, aiming to predict TP, FP and OoD, the left half of the table shows analogous results while additionally using the estimated class probabilities as inputs for gradient boosting. We do so for the sake of accounting for other possible transformations of that are not explicitly constructed. In more detail, we use the following uncertainty scores:

Ours: OoD uncertainty and entropy of the estimated class probabilites

Ours with MC dropout: ,
and the standard deviations of
summed over all under MC dropout for forward passes. 
OnevsAll Baseline: Same as “Ours”.

Max softmax: Maximum softmax probability.

Entropy: Entropy over estimated class probabilities.

BayesbyBackprop: For samples from the posterior we compute (aleatoric uncertainty) and (epistemic uncertainty) as in KendallG17.

MCDropout: Entropy of estimated class probabilities averaged over forward passes, standard deviation of the class probabilities summed over all .

DeepEnsembles: Entropy of estimated class probabilites averaged over ensemble members, standard deviation of the class probabilities summed over all .

Confident Classifier: Entropy over estimated class probabilities.

GEN: Entropy over estimated class probabilities resulting of the estimated evidence of the Dirichlet distribution.
Also in the left part as well as the right part of the table, our method including dropout is fairly close the best oracle, which is the entropy oracle. The good performance of the entropy oracle can be explained by the fact that FPs typically only include the confusion of 2 classes while training on all OoD data for high entropy enforces the confusion among 5 classes, thus resulting in a higher entropy level. Apart from the oracle, in both studies including and excluding the estimated class probabilities , our method including MCdropout outperforms all other methods. However, reviewing the result in an absolute sense, there still remains plenty of room for improvement.
Appendix F Detailed Results on Individual OoD Datasets
Method  AUROC  AUPRIn  AUPROut  FPR@ 95% TPR  AUROC  AUPRIn  AUPROut  FPR@ 95% TPR 

MNIST 04 vs. MNIST 59  MNIST 04 vs. EMNISTLetters  
Ours  
Ours + Dropout  
OnevsAll Baseline  
Max. Softmax  
Entropy  
BayesbyBackprop  
MCDropout  
DeepEnsembles  
Confident Classifier  
GEN  
Entropy Oracle  
OnevsAll Oracle  
MNIST 04 vs. Omniglot  MNIST 04 vs. FashionMNIST  
Ours  
Ours + Dropout  
OnevsAll Baseline  
Max. Softmax  
Entropy  
BayesbyBackprop  
MCDropout  
DeepEnsembles  
Confident Classifier  
GEN  
Entropy Oracle  
OnevsAll Oracle  
MNIST 04 vs. SVHN  MNIST 04 vs. CIFAR10  
Ours  
Ours + Dropout  
OnevsAll Baseline  
Max. Softmax  
Entropy  
BayesbyBackprop  
MCDropout  
DeepEnsembles  
Confident Classifier  
GEN 