1 Introduction
Deep neural networks are the stateoftheart in many application areas. Nevertheless it is still a major concern to use deep learning in safetycritical systems, e.g. medical diagnosis or selfdriving cars, since it has been shown that deep learning classifiers suffer from a number of unexpected failure modes, such as low robustness to natural perturbations
(geirhos2018generalisation; hendrycks2019benchmarking), overconfident predictions (NguYosClu2015; GuoEtAl2017; HenGim2017; HeiAndBit2019) as well as adversarial vulnerabilities (SzeEtAl2014). For safety critical applications, empirical checks are not sufficient in order to trust a deep learning system in a highstakes decision. Thus provable guarantees on the behavior of a deep learning system are needed.One property that one expects from a robust classifier is that it should not
make highly confident predictions on data that is very different from the training data. However, ReLU networks have been shown to be provably overconfident far away from the training data
(HeiAndBit2019). This is a big problem as (guaranteed) low confidence of a classifier when it operates out of its training domain can be used to trigger human intervention or to let the system try to achieve a safe state when it “detects” that it is applied outside of its specification. Several approaches to the outofdistribution (OOD) detection task have been studied (HenGim2017; liang2017enhancing; LeeEtAl2018; lee2018simple; HeiAndBit2019). The current stateoftheart performance of OOD detection in image classification is achieved by enforcing low confidence on a large training set of natural images that is considered as outdistribution (HenMazDie2019; meinke2020towards).Deep neural networks are also notoriously susceptible to small adversarial perturbations in the input (SzeEtAl2014; CarWag2016) which change the decision of a classifier. Research so far has concentrated on adversarial robustness around the indistribution. Several empirical defenses have been proposed but many could be broken again (CroHei2020; CarWag2017; AthCarWag2018). Adversarial training and variations (MadEtAl2018; ZhaEtAl2019) perform well empirically, but typically no robustness guarantees can be given. Certified adversarial robustness has been achieved by explicit computation of robustness certificates (HeiAnd2017; WonKol2018; RagSteLia2018; MirGehVec2018; gowal2018effectiveness) and randomized smoothing (cohen2019certified).
Adversarial changes to generate high confidence predictions on the outdistribution have received much less attention although it has been shown early on that they can be used to fool a classifier (NguYosClu2015; SchEtAl2018; sehwag2019better). Thus, even if a classifier consistently manages to identify samples as not belonging to the indistribution, it might still assign very high confidence to only marginally perturbed samples from the outdistribution, see Figure 1. A first empirical defense using a type of adversarial training for OOD detection has been proposed in (HeiAndBit2019). However, up to our knowledge in the area of certified outofdistribution detection the only robustness guarantees for OOD were given in (meinke2020towards)
, where they use a density estimator for in and outdistribution and integrate that into the predictive uncertainty of the neural network, which allows them to guarantee that far away from the training data the confidence of the neural network becomes uniform over the classes. Moreover, they can provide worst case guarantees on the confidence on some balls around uniform noise. However, they are not able to provide meaningful guarantees around points which are similar or even close to the indistribution data and, as we will show, provide only weak guarantees against
adversaries.In this work we aim to provide worstcase OOD guarantees not only for noise but also for images from related but different image classification tasks. For this purpose we use the techniques from interval bound propagation (IBP) (gowal2018effectiveness) to derive a provable upper bound on the maximal confidence of the classifier in an ball of radius around a given point. By minimizing this bound on the outdistribution using our training scheme GOOD (Guaranteed OutOfdistribution Detection) we arrive at the first models which have guaranteed low confidence even on image classification tasks related to the original one; e.g., we get stateoftheart results on separating letters from EMNIST from digits in MNIST even though the digit classifier has never seen any images of letters at training time. In particular, the guarantees for the training outdistribution generalize to other outdistribution datasets. In contrast to classifiers which have certified adversarial robustness on the indistribution, GOOD has the desirable property to achieve provable guarantees for OOD detection with almost no loss in accuracy on the indistribution task even on datasets like CIFAR10.
2 Outofdistribution detection: setup and baselines
Let be a feedforward neural network (DNN) with a last linear layer where is the input dimension and
the number of classes. In all experiments below we use the ReLU activation function. The logits of
forare transformed via the softmax function into a probability distribution
over the classes with:(1) 
By we define the confidence of the classifier in the prediction at .
The general goal of OOD detection is to have low confidence predictions for all inputs which are clearly not belonging to the indistribution task, especially for all inputs lying in a region which has zero probability under the indistribution. One typical criterion to measure OOD detection performance is to use as a feature and compute the AUC of in versus outdistribution (how well are confidences of in and outdistribution separated). We discuss a proper conservative measurement of the AUC in case of indistinguishable confidence values, e.g. due to numerical precision, in Appendix C.
As baselines and motivation for our provable approach we use the OOD detection methods Outlier Exposure (OE)
(HenMazDie2019) and Confidence Enhancing Data Augmentation (CEDA) (HeiAndBit2019), which use as objective for training(2) 
where is the indistribution training set, the outdistribution training set, and the crossentropy loss. The hyperparameter determines the relative magnitude of the two loss terms and is most of the time chosen to be one. OE and CEDA differ in the choice of the loss for the outdistribution where OE uses the crossentropy loss between
and the uniform distribution and CEDA uses
. Note that both the CEDA and OE loss attain their global minimum when is the uniform distribution. Their difference is typically minor in practice. An important question is the choice of the outdistribution. For general image classification, it makes sense to use an outdistribution which encompasses basically any possible image one could ever see at test time and thus the set of all natural images is a good outdistribution; following (HenMazDie2019) we use the 80 Million Tiny Images dataset (torralba200880) as a proxy for that.While OE and CEDA yield stateoftheart OOD detection performance for image classification tasks when used together with the 80M Tiny Images dataset as outdistribution, they are, similarly to normal classifiers, vulnerable to adversarial manipulation of the outdistribution images where the attack is trying to maximize the confidence in this scenario (meinke2020towards). Thus (HeiAndBit2019) proposed Adversarial Confidence Enhanced Training (ACET) which replaces the CEDA loss with and can be seen as adversarial training on the outdistribution for an threat model. However, similar to adversarial training on the indistribution (MadEtAl2018) this does not yield any guarantees for outdistribution detection. In the next section we discuss how to use intervalboundpropagation (IBP) to get guaranteed OOD detection performance in a neighborhood of every outdistribution input.
3 Provable guarantees for outofdistribution detection
Our goal is to minimize the confidence of the classifier not only on the outdistribution images themselves but in a whole neighborhood around them. For this purpose, we first derive bounds on the maximal confidence on some ball around a given point. In certified adversarial robustness IBP (gowal2018effectiveness) currently leads to the best guarantees for deterministic classifiers under the threat model. While other methods for deriving guarantees yield tighter bounds (WonKol2018; MirGehVec2018), they are not easily scalable and, when optimized, the bounds given by IBP have been shown to be very tight (gowal2018effectiveness).
Ibp.
Interval bound propagation (gowal2018effectiveness) provides entrywise lower and upper bounds resp. for the output of the th layer of a neural network given that the input is varied in the ball of radius . With and and (
is the vector of all ones) and
being the weights of the th layer (fully connected, convolutional, residual etc.), one gets upper and lower bounds of the next layers via forward propagation:(3) 
where the / expressions are taken componentwise. The activation function (e.g. ReLU) is directly applied to the bounds. The forward propagation of the bounds is of similar nature as a standard forward pass and backpropagation w.r.t. the weights is relatively straightforward.
Upper bound on the confidence in terms of the logits.
The log confidence of the model at can be written as
(4) 
We assume that the last layer is affine: , where is the number of layers of the network. We calculate the upper bounds of all logit differences as:
(5) 
where denotes the th row of and is the th component of . Note that this upper bound of the logit difference can be negative and is zero for . Using this upper bound on the logit difference in Equation (4), we obtain an upper bound on the log confidence:
(6) 
We use the bound in (6) to evaluate the guarantees on the confidences for given outdistribution datasets. However, minimizing it directly during training leads to numerical problems, especially at the beginning of training, when the upper bounds are very large for , which makes training numerically infeasible. Instead, we rather upper bound the log confidence again by bounding the sum inside the negative log from below with times its lowest term:
(7) 
While this bound can considerably differ from the potentially tighter bound of Equation (6), it is often quite close as one term in the sum dominates the others. Moreover, both bounds have the same global minimum when all logits are equal over the ball. We omit the constant in the following as it does not matter for training.
The direct minimization of the upper bound in (7) is still difficult, in particular for more challenging indistribution datasets like SVHN and CIFAR10, as the bound can be several orders of magnitude larger than the indistribution loss. Therefore, we use the logarithm of this quantity. However, we also want to have a more finegrained optimization when the upper bound becomes small in the later stage of the training. Thus we define the Confidence Upper Bound loss for an OOD input as
(8) 
Note that for small and thus we achieve the more finegrained optimization with an type of loss in the later stages of training. The overall objective of fully applied Guaranteed OOD Detection training (GOOD_{100}) is the minimization of
(9) 
where is the indistribution training set and the outdistribution. The hyperparameter determines the relative magnitude of the two loss terms. During training we slowly increase this value and in order to further stabilize the training with GOOD.
QuantileGOOD: tradeoff between clean and guaranteed AUC.
Training models by minimizing (9) means that the classifier gets severely punished if any training OOD input receives a high confidence upper bound. If OOD inputs exist to which the classifier already assigns high confidence without even considering the worst case, e.g. as these inputs share features with the indistribution, it makes little sense to enforce low confidence guarantees. Later in the experiments we show that for difficult tasks like CIFAR10 this can happen. In such cases the normal AUC for OOD detection gets worse as the high loss of the outdistribution part effectively leads to low confidence on a significant part of the indistribution which is clearly undesirable.
Hence, for OOD inputs which are not clearly distinguishable from the indistribution, it is preferable to just have the “normal” loss
without considering the worst case. We realize this by enforcing the loss with the guaranteed upper bounds on the confidence just on some quantile of the easier OOD inputs, namely the ones with the lowest guaranteed outdistribution loss
. We first order the OOD training set by the potential loss of each sample in ascending order , that is . We then apply the loss to the lower quantile of the points (the ones with the smallest loss ) and take for the remaining samples, which means no worstcase guarantees on the confidence are enforced:(10) 
During training we do this ordering on the part of each batch consisting of outdistribution images. On CIFAR10, where the outdistribution dataset 80M Tiny Images is closer to the indistribution, the quantile GOODloss allows us to choose the tradeoff between clean and guaranteed AUC for OOD detection, similar to the tradeoff between clean and robust accuracy in adversarial robustness.
4 Experiments
We provide experimental results for image recognition tasks with MNIST (mnist), SVHN (SVHN) and CIFAR10 (krizhevsky2009learning)
as indistribution datasets. We first discuss the training details, hyperparameters and evaluation before we present the results of GOOD and competing methods. Code is available under
https://gitlab.com/Bitterwolf/GOOD.4.1 Model architectures, training procedure and evaluation
Model architectures and data augmentation.
For all experiments, we use deep convolutional neural networks consisting of convolutional, affine and ReLU layers. For MNIST, we use the large architecture from
(gowal2018effectiveness), and for SVHN and CIFAR10 a similar but deeper and wider model. The layer structure is laid out in Table 2 in the appendix. Data augmentation is applied to both in and outdistribution images during training. For MNIST we use random crops to sizewith padding 4 and for SVHN and CIFAR10 random crops with padding 4 as well as the quite aggressive augmentation AutoAugment
(cubuk2019autoaugment). Additionally, we apply random horizontal flips for CIFAR10.GOOD training procedure. As it is the case with IBP training (gowal2018effectiveness) for certified adversarial robustness, we have observed that the inclusion of IBP bounds can make the training unstable or cause it to fail completely. This can happen for our GOOD training despite the logarithmic damping in the loss in (8). Thus, in order to further stabilize the training similar to (gowal2018effectiveness), we use linear ramp up schedules for and , which are detailed in Appendix D. As radii for the perturbation model on the outdistribution we use for MNIST and for SVHN and CIFAR10 (note that ). The chosen for SVHN/CIFAR10 is so small that the chanages are hardly visible (see Figure 1). As parameter for the tradeoff between crossentropy loss and the GOOD regularizer in (9) and (10), we set for MNIST and for SVHN and CIFAR10.
In order to explore the potential tradeoff between the separation of in and outdistribution for clean and perturbed outdistribution inputs (clean AUCs vs guaranteed AUCs  see below), we train GOOD models for different quantiles in (10) which we denote as GOOD_{Q} in the following. Here, is the percentage of outdistribution training samples for which we minimize the guaranteed upper bounds on the confidence of the neural network in the ball of radius around the outdistribution point during training. Note that GOOD_{100} corresponds to (9) where we minimize the guaranteed upper bound on the worstcase confidence for all outdistribution samples, whereas GOOD_{0} can be seen as a variant of OE or CEDA. A training batch consists of 128 in and 128 outdistribution samples. Examples of OOD training batches with the employed augmentation and their quantile splits for a GOOD_{60} model are shown in Table 3 in the appendix.
For the training outdistribution, we use 80 Million Tiny Images (80M) (torralba200880), which is a large collection of natural images associated to nouns in wordnet (fellbaum2012wordnet). All methods get the same outdistribution for training and we are neither training nor adapting hyperparameters for each OOD dataset separately as in some previous work. Since CIFAR10 and CIFAR100 are subsets of 80M, we follow (HenMazDie2019) and filter them out. Even after the filtering process we have observed that the remaining dataset still contains images from the CIFAR10 and CIFAR100 classes. Thus we have further excluded all samples for which a CIFAR10 CEDA model has confidence above 11%, altogether removing 4.25M images. As can be seen in the example batches in Table 3, even this reduced dataset contains still images from CIFAR10 classes, which explains why our quantilebased loss is essential to get good performance on CIFAR10. We take a subset of 50 million images as OOD training set. Since the size of the training set of the indistribution datasets (MNIST: 60,000; SVHN: 73,257; CIFAR10: 50000) is small compared to 50 million, typically an OOD image appears only once during training.
Evaluation.
For each method, we compute the test accuracy on the indistribution task, and for various outdistribution datasets (not seen during training) we report the area under the receiver operating characteristic curve (AUC) as a measure for the separation of in from outdistribution samples based on the predicted confidences on the test sets. As OOD evaluation sets we use FashionMNIST
(XiaoEtAl2017), the Letters of EMNIST (CohEtAl2017), grayscale CIFAR10, and Uniform Noise for MNIST, and CIFAR100 (krizhevsky2009learning), CIFAR10/SVHN, LSUN Classroom (lsun), and Uniform Noise for SVHN/CIFAR10. Further evaluation on other OOD datasets can be found in Appendix H.We are particularly interested in the worst case OOD detection performance of all methods under the perturbation model for the outdistribution. For this purpose, we compute the adversarial AUC (AAUC) and the guaranteed AUC (GAUC). These AUCs are based on the maximal confidence in the ball of radius around each outdistribution image. For the adversarial AUC, we compute a lower bound on the maximal confidence in the ball by using AutoPGD (CroHei2020) for maximizing the confidence of the classifier inside the intersection of the  ball and the image domain . AutoPGD uses an automatic stepsize selection scheme and has been shown to outperform PGD. We use an adaptation to our setting (described in Appendix A) with 500 steps and 5 restarts on 1000 points from each test set. On MNIST, gradient masking poses a significant challenge so we use an additional attack discussed in Appendix A and report the worst case. For the guaranteed AUC, we compute an upper bound on the confidence in the intersection of the  ball with the image domain via IBP using (6) for the full test set. These worst case/guaranteed confidences for the outdistributions are then used for the AUC computation.
Competitors. We compare a normally trained model (Plain), the stateoftheart OOD detection method Outlier Exposure (OE) (HenMazDie2019), CEDA (HeiAndBit2019) and Adversarial Confidence Enhanced Training (ACET) (HeiAndBit2019), which we adjusted to the given task as described in the appendix. As CEDA performs very similar to OE, we omit it in the figures for better readability. The radii for the balls are the same for ACET and GOOD. So far the only method which could provide robustness guarantees for OOD detection is Certified Certain Uncertainty (CCU) with a datadependent Mahalanobistype threat model. We use their publicly available code to train a CCU model with our architecture and we evaluate their guarantees for our threat model. In Appendix B, we provide details and explain why their guarantees turn out to be vacuous in our setting.
4.2 Results
In Table 1 we present the results on all datasets.
GOOD is provably better than OE/CEDA with regard to worst case OOD detection. We note that for almost all OOD datasets GOOD achieves nontrivial GAUCs. Thus the guarantees generalize from the training outdistribution 80M to the test OOD datasets. For the easier indistributions MNIST and SVHN, which are more clearly separated from the outdistribution, the best results are achieved for GOOD_{100} whereas for CIFAR10 the best guarantees are given by GOOD_{90} or GOOD_{95}. However, if taking clean AUCs into account, arguably the best tradeoff is achieved for GOOD_{80}. Note that the guaranteed AUC (GAUC) of these models is always better than the adversarial AUC (AAUC) of OE/CEDA (except for EMNIST). Thus it is fair to say that the worstcase OOD detection performance of GOOD is provably better than that of OE/CEDA. As expected, ACET yields good AAUCs but has no guarantees. The failure of CCU regarding guarantees is discussed in Appendix B. It is notable that GOOD_{100} has basically perfect guaranteed OOD detection performance for MNIST on CIFAR10/uniform noise and for SVHN on all outdistribution datasets. In Appendix I we show that the guarantees of GOOD partially hold even at larger radii than used during training.
GOOD achieves certified OOD performance with almost no loss in accuracy. While there is a small drop in clean accuracy for MNIST, on SVHN, GOOD_{100} has with a better clean accuracy than all competing methods. On CIFAR10 GOOD_{80} achieves an accuracy of which is better than ACET and only slightly worse than Plain and OE. This is remarkable as we are not aware of any model with certified adversarial robustness on the indistribution which gets even close to this range; e.g. IBP (gowal2018effectiveness) achieves an accuracy of 85.2% on SVHN with (we have 96.6%), on CIFAR10 with they get 71.2% (we have 90.0%). Previous certified methods had even worse clean accuracy. Since a significant loss in prediction performance is usually not acceptable, certified methods have not yet had much practical impact. Thus we think it is an encouraging and interesting observation that properties different from adversarial robustness like worstcase outofdistribution detection can be certified without suffering much in accuracy. In particular, it is quite surprising that certified methods can be trained effectively with aggressive data augmentation like AutoAugment.
Tradeoff between clean and guaranteed AUC via QuantileGOOD. As discussed above, even after filtering, our training outdistribution contains indistribution images from CIFAR10 classes. This seems to be the reason why GOOD_{100} suffers from a significant drop in clean and guaranteed AUC, as the only way to ensure small loss , if in and outdistribution can partially not be distinguished, is to reduce also the confidence on the indistribution. This conflict is then resolved via GOOD_{80} and GOOD_{90} which both have better clean and guaranteed AUCs. It is an interesting open question if similar tradeoffs are potentially also useful for certified adversarial robustness.
EMNIST: distinguishing letters from digits without ever having seen letters. GOOD_{100} achieves an excellent AUC of 98.9% for the letters of EMNIST which is, up to our knowledge, stateoftheart. Indeed, an AUC of 100% should not be expected as even for humans some letters like i and l are indistinguishable from digits. This result is quite remarkable as GOOD_{100} has never seen letters during training. Moreover, as the AUC just distinguishes the separation of inand outdistribution based on the confidence, we provide the mean confidence on all datasets in the Appendix in Table 4 and in Figure 2 (see also Figure 3 in the Appendix) we show some samples from EMNIST together with their prediction/confidences for all models. GOOD_{100} has a mean confidence of on MNIST but only on EMNIST in contrast to ACET with , OE and Plain . This shows that while the AUC’s of ACET and OE are good for EMNIST, these methods are still highly overconfident on EMNIST. Only GOOD_{100} produces meaningful higher confidences on EMNIST, when the letter has clear features of the corresponding digit.
5 Conclusion
We propose GOOD, a novel training method to achieve guaranteed OOD detection in a worstcase setting. GOOD provably outperforms OE, the stateoftheart in OOD detection, in worst case OOD detection and has stateoftheart performance on EMNIST which is a particularly challenging outdistribution dataset. As the test accuracy of GOOD is comparable to the one of normal training, this shows that certified methods have the potential to be useful in practice even for more complex tasks. In future work it will be interesting to explore how close certified methods can get to stateoftheart test performance.
Broader Impact
In order to use machine learning in safetycritical systems it is required that the machine learning system correctly flags its uncertainty. As neural networks have been shown to be overconfident far away from the training data, this work aims at overcoming this issue by not only enforcing low confidence on outdistribution images but even guaranteeing low confidence in a neighborhood around it. As a neural network should not flag that it knows when it does not know, we see only positive implications of this work for our society.
Acknowledgements
The authors acknowledge support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039A) and from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy (EXC number 2064/1, Project number 390727645). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRSIS) for supporting Alexander Meinke.
References
Appendix
Appendix A Adversarial attacks on OOD detection
It has been demonstrated [biggio2013evasion, SzeEtAl2014, CarWag2016, Croce_2019_ICCV] that without strong countermeasures, DNNs are very susceptible to adversarial attacks changing the classification result. The goal of adversarial attacks in our setting is to fool the OOD detection which is based on the confidence in the prediction. Thus the attacker aims at maximizing the confidence in a neighborhood around a given outdistribution input so that the adversarially modified image will be wrongly assigned to the indistribution. In this paper, we regard as threat model/neighborhood an ball of a given radius , that is ; note that in our case the disturbed inputs have to be valid images, hence the additional constraint .
For evaluation, we use AutoPGD [CroHei2020], which is a stateoftheart implementation of PGD (projected gradient descent), with backtracking, adaptive step sizes and random restarts. Since AutoPGD has been designed for finding adversarial samples around the indistribution, we change the objective of AutoPGD to be the confidence of the classifier. We use AutoPGD with 500 steps and 5 random restarts which is a quite strong attack. By default, the random initialization is drawn uniformly from the ball. However, we found that for MNIST the attack very often got stuck for our GOOD models, because a large random perturbation of size 0.3 would move the sample directly into a region of the input space where the model is completely flat and thus no gradients are available (in this sense adversarial attacks on OOD inputs are more difficult than usual adversarial attacks on the indistribution). We instead use a modified version of the attack for MNIST which starts within short distance of the original point. Thus we use as initialization a random perturbation from (note that for our evaluation on SVHN and CIFAR10, this choice coincides with the default settings).
Nevertheless, for MNIST most outdistribution points lie in regions where the predictions of our GOOD models are flat, i.e. the gradients are exactly zero. Because of this, AutoPGD is unable to effectively explore the search space around those points. Thus, for MNIST we created an adaptive attack which partially circumvents these issues. First, we use an initialization scheme that mitigates lack of gradients by increasing the contrast as much as the threat model allows. All pixel values that lie above get set to and all values get set to . In our experience these points are more likely to yield gradients, so we use them as initialization for a 200step PGD attack with backtracking, adaptive step size selection and momentum of . Concretely, we use a step size of , and whenever a PGD step does not increase the confidence we backtrack and halve the step size. After every successful gradient step we multiply the step size by . Using backtracking and adaptive step size is necessary because otherwise one can easily step into regions where gradient information is no longer available. Additionally, to further mitigate the problem of gradientmasking at initialization, we use the adversarial images that AutoPGD finds for models without significant gradient masking (Plain, OE, GOOD) as initialization for the same monotone PGD for models which show significant gradient masking (CEDA, ACET).
Appendix B A review of robust OOD detection
Acet
A method that was proposed in order to achieve adversarially robust low confidence on OOD data is Adversarial Confidence Enhancing Training (ACET) [HeiAndBit2019] which is based on adversarial training on the outdistribution. However, similar to adversarial training on the indistribution, this does typically not lead to any guarantees, whereas our goal is to get guarantees on the confidences of worstcase outdistribution inputs. ACET has the following objective:
(11) 
They use with low frequency noise as their training outdistribution. We found firstly that training an ACET model with 80M as outdistribution yields much better results than the smoothed uniform noise used in [HeiAndBit2019] and secondly using the crossentropy loss with respect to the uniform prediction instead of also leads to improvements. For training ACET models, we employ a standard PGD attack with 40 steps of size with initialization at the target input for maximizing the loss around . As usual for a attack, we use the sign of the gradient as direction and project onto the intersection of the image domain and the ball of radius around the target. Finally, the attack returns the image with the highest confidence found during the iterations. For the attack at training time we use no backtracking or adaptive stepsizes. ACET does not provide any guaranteed confidence bounds.
Ccu
Certified Certain Uncertainty (CCU) [meinke2020towards] gives low confidence guarantees around certain OOD data that is far away from the training dataset in a specific metric. Those bounds do hold on such faraway datasets, but do not generalize to inputs relatively close to the in distribution, like for example CIFAR10 vs. CIFAR100. Moreover, even in the regime where CCU yields meaningful guarantees, they are given in terms of a datadependent Mahalanobis distance rather than the distance. However, due to norm equivalences, one can still extract guarantees from CCU and we evaluated the CCU guarantees as follows. We use the corollary 3.1 from [meinke2020towards] which states that for a CCU model that is written as
(12) 
with being the softmax output of a neural network and and Gaussian mixture models for inand outdistribution, one can bound the confidence in a certain neighborhood around any point via
(13) 
Here is a positive function that increases monotonically in the radius and that depends on the parameters of the Gaussian mixture models (details in [meinke2020towards]). The metric that they used for their CCU model is given as
(14) 
where is a regularized version of the covariance matrix, calculated on the augmented indistribution data. Note that this Mahalanobis metric is strongly equivalent to the metric induced by the norm and consequently to the metric induced by the norm. By computing the equivalence constants between these metrics we can extract the
guarantees that are implicit in the CCU model. Geometrically speaking, we compute the size a an ellipsoid (its shape determined by the eigenvalues of
) that is large enough to fit a cube inside it with a radius given by our threat model or , respectively. Via norm equivalences one has(15) 
where is the largest eigenvalue of . This means that the confidence upper bounds from (13) on a Mahalanobisball of radius automatically apply to an ball of radius . However, the covariance matrix is highly illconditioned, which means that is fairly high. On top of that, in high dimensions is big as well so that in practice the required radius becomes too large for CCU to certify meaningful guarantees. Even on uniform noise, the upper bounds were larger than the highest confidence on the indistribution test set, with the consequence that there are no lowerbounds on the AAUC. However, we want to stress that at least for uniform noise the lack of guarantees of CCU is due to the incompatability of the threat models used in our paper and [meinke2020towards].
Another type of guarantee that certifies a detection rate for OOD samples by applying probably approximately correct (PAC) learning considerations has been proposed in [liu18e]. Their problem setting and nature of guarantees are not directly comparable to ours, since their guarantees handle behaviour on whole distributions while our guarantees are given for individual datapoints.
Appendix C AUC and Conservative AUC
As a measure for the separation of in vs. outdistribution data we use the Area Under the Receiver Operating Characteristic curve (AUROC or AUC) using the confidence of the classifier as the feature. The AUC is equal to the empirical probability of a random insample to be assigned a higher confidence than a random outsample, plus one half times the probability of the confidences being equal. Thus, the standard way (as e.g. implemented in scikitlearn [scikitlearn]) to calculate the AUC from given confidence values on sets of in and outdistribution samples and is
(16) 
where for a set , indicates the number of its elements. The halfweighted equality term gives this definition certain symmetry properties. However, it assigns a positive score to some completely uninformed functions . For example, a constant uniform classifier with receives an AUC value of 50%. Similarly, a classifier that assigns 100% confidence to most indistribution inputs would have positive AUC and even GAUC statistics, even if it fails to have confidence below 100% on any OOD inputs. In order to regard only example pairs where the distributions are positively distinguished, we define the Conservative AUC () by dropping the equality term:
(17) 
While in general , the confidences of all models presented in the paper are differentiated enough so that for all shown numbers actually . However, we have experienced that one can have models where the confidences (uniform or onehot predictions) cannot be distinguished due to limited numerical precision. In these cases the normal AUC definition would indicate a certain discrimination where it is actually impossible to discriminate the confidences.
Appendix D Experimental details
The layer compositions of the architectures used for all GOOD and baseline models are laid out in Table 2. No normalization of inputs or activations is used. Weight decay () is set to for MNIST and for SVHN and CIFAR10. For all runs, we use a batch size of 128 samples from both the in and the outdistribution (where applicable). At https://gitlab.com/Bitterwolf/GOOD you can find the exact implementation.
L  XL 

Conv2d(64)  Conv2d(128) 
Conv2d(64)  Conv2d(128) 
Conv2d(128)_{s=2}  Conv2d(256)_{s=2} 
Conv2d(128)  Conv2d(256) 
Conv2d(128)  Conv2d(256) 
Linear(512)  Linear(512) 
Linear(10)  Linear(512) 
Linear(10) 
Model architectures used for MNIST (L), SVHN (XL) and CIFAR10 (XL) experiments. Each convolutional and nonfinal affine layer is followed by a ReLU activation. All convolutions use a kernel size of 3, a padding of 1, and stride of 1, except for the third convolution which has stride=2.
For the MNIST experiments, we use as optimizer SGD with 0.9 Nesterov momentum, with an initial learning rate of
that is divided by 5 after 50, 100, 200, 300 and 350 epochs, with a total number of 420 training epochs. For the GOOD, CEDA and OE runs, the first two epochs only use indistribution
; over the next 100 epochs, the value of is ramped up linearly from zero to its final value of for GOOD/OE and for CEDA, where it stays for the remaining 318 epochs. The value in the loss for GOOD is also increased linearly, starting at epoch 10 and reaching its final value of on epoch 130. CCU is trained using the publicly available code from [meinke2020towards], where we modify the architecture, learning rate schedule and data augmentation to be the same as OE. The initial learning rate for the Gaussian mixture models is and gets dropped at the same epochs as the neural network learning rate. Our more aggressive data augmentation implies that our underlying Mahalanobis metric is not the same as they used in [meinke2020towards]. The ACET model for MNIST is warmed up with two epochs on the indistribution only, then four with and , and the full ACET loss with and for the remaining epochs. The reason why we chose a smaller of for the MNIST GOOD runs is that considering the large for which guarantees are enforced, training with higher values makes training unstable without improving any validation results.For the SVHN and CIFAR10 baseline models, we used the ADAM optimizer [KinEtAl2014] with initial learning rate for SVHN and for CIFAR10 that was divided by 5 after 30 and 100 epochs, with a total number of 420 training epochs.
For OE, is increased linearly from zero to one between epochs 60 and 360. The same holds for CCU which again uses the same hyperparameters as OE.
Again, ACET is warmed up with two indistributiononly and four OE epochs. Then it is trained with and , with a shorter training time of 100 epochs (the same number as used in [HeiAndBit2019]).
In line with the experiences reported in [gowal2018effectiveness] and [zhang2020towards], for GOOD training on SVHN and CIFAR10 longer training schedules with slower ramping up of the loss are necessary, as adding the outdistribution loss defined in Equation (8) to the training objective at once will overwhelm the indistribution crossentropy loss and cause the model to collapse to uniform predictions for all inputs, without recovery.
In order to reduce warmup time, we use a pretrained CEDA model for initialization and train for 900 epochs.
The learning rate is 10^{4} in the beginning and is divided by 5 after epochs 450, 750 and 850.
Due to the pretraining, we begin training with a small and already start with nonzero after epoch 4. Then, is increased linearly to its final value of which is reached at epoch 204. Simultaneously, is increased linearly with a virtual starting point at epoch 2 to its final value of at epoch 298.
Due to the tendency of IBP based training towards instabilities, the selection of hyperparameters was based on finding settings where training is reliably stable while guaranteed bounds over meaningful radii are possible.
For the accuracy, AUC and GAUC evaluations in Table 1 the test splits of each (nonnoise) dataset were used, with the following numbers of samples: 10,000 for MNIST, FashionMNIST, CIFAR10, CIFAR100 and Uniform Noise; 20,800 for EMNIST Letters; 26,032 for SVHN; 300 for LSUN Classroom. Due to the computational cost of the employed attacks, the AAUC values are based on subsets of 1000 samples for each dataset.
All experiments were run on Nvidia Tesla P100 and V100 GPUs, with GPU memory requirement below 16GB.
Appendix E Depiction of GOOD Quantileloss
In QuantileGOOD training, the outdistribution part of each batch is split up into “harder” and “easier” parts, since trying to enforce low confidence guarantees on inputs that are very close to the indistribution leads to low confidences in general, even on the indistribution. In Table 3, we show example batches of GOOD_{60} models with MNIST, SVHN and CIFAR10 as indistribution near the end of training (from epochs 410, 890 and 890, respectively). Even though many CIFARlike images were filtered out, some are still present. For the CIFAR10 model, such samples (among others) get sorted above the quantile. For MNIST, lower brightness images appear to be more difficult, while for SVHN images with fewer objects seem to be comparably hardest to distinguish from the house numbers of the indistribution.
Appendix F Confidences on EMNIST
Figure 3 shows samples of the letters “k” to “z” together with the predictions and confidences of the GOOD_{100} MNIST model and four baseline models, complementing Figure 2. Also on these samples we see that GOOD_{100} only produces high confidences for letters when they show digitspecific features (“l”, “q”, “s”). All other methods including ACET also produce high confidences for letters which are quite distinct from digits (“m”, “n”, “p”, “y”).
The mean confidence values of the same selection of MNIST models for each letter of the alphabet for EMNIST are plotted in Figure 4. We observe that the mean confidence is mostly aligned with the intuitive likeness of a letter with some digit: GOOD_{100} has the highest mean confidence on the letter inputs “i” and “l”, which in many cases do look like the digit “1”. Curiously, the confidence of GOOD_{100} on the letter “o”, which even humans often cannot distinguish from a digit “0”, is generally low.
Appendix G Distributions of confidences and confidence upper bounds
Table 4 shows the mean confidences of all models on the indistribution as well as the mean confidences and the mean guaranteed upper bounds on the worstcase confidences on the evaluated outdistributions. As discussed, GOOD_{100} training can reduce the confidence on the indistribution, with a particularly strong effect for CIFAR10. By adjusting the loss quantile, this effect can be significantly reduced while maintaining nontrivial guarantees.
The histograms of mean confidences on the indistribution and mean guaranteed upper bounds on the worstcase confidences on the samples from the evaluated outdistribution test sets for seven models are shown in Tables 5 (MNIST), 6 (SVHN) and 7 (CIFAR10). A higher GOOD loss quantile generally shifts the distribution of the upper bounds on the worstcase confidence towards smaller values, but in some cases, especially for GOOD_{100} on CIFAR10, strongly lowers confidences in indistribution predictions as well.
Appendix H Evaluation on additional datasets
Extending the evaluation results presented in Table 1, we provide AUC, AAUC and GAUC values for additional outdistribution datasets in Table 8. These datasets are:

80M Tiny Images, the outdistribution that was used during training. While it is the same distribution as seen during training, the test set consists of 30,000 samples that are not part of the training set.

Omniglot (Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Humanlevel concept learning through probabilistic program induction. Science, 350(6266), 13321338.) is a dataset of hand drawn characters. We use the evaluation split consisting of 13180 characters from 20 different alphabets.

notMNIST is a dataset of the letters A to J taken from different publicly available fonts. The dataset was retrieved from https://yaroslavvb.blogspot.com/2011/09/notmnistdataset.html. We evaluate on the hand cleaned subset of 18724 images,

ImageNet [HeiAndBit2019], which is a subset of ImageNet [imagenet_cvpr09] without images labelled as classes equal or similar to CIFAR10 classes.

Smooth Noise is generated as described by [HeiAndBit2019]. First, a uniform noise image is generated. Then, a Gaussian filter with drawn uniformly at random between 1.0 and 2.5 is applied. Finally, the image is rescaled such that the minimal pixel value is 0.0 and the maximal one is 1.0. We evaluate AUC and GAUC on 30,000 samples.
For MNIST, GOOD_{100} has an excellent GAUC for the training outdistribution 80M Tiny images as well as for notMNIST. For Omniglot, GOOD_{100} is again better than OE/CEDA (similar to EMNIST) in terms of clean AUC’s but here ACET is slightly better. However, again it is very difficult to provide any guarantees for this dataset even though nontrival adversarial AUC’s are possible.
For SVHN, the detection of smooth noise turns out to be the most difficult of the evaluated tasks. There, the clean AUCs of all nonplain methods are lower than the perfect scores we see on other outdistributions but still very high, and only GOOD_{100} can give some guarantees. An explanation might be that the image features of SVHN house numbers and of this kind of synthetic noise are similarly smooth. For 80M Tiny Images and Imagenet, on the other hand, the SVHN high quantile GOOD models, particularly GOOD_{100}, are able to provide almost perfect guaranteed AUCs.
For CIFAR10, on all three outdistributions we again observe the tradeoff between clean and guaranteed AUC that comes with the choice of the loss quantile. Overall, the GOOD_{80} model again retains reasonable AUC values for the clean data while also providing useful guaranteed AUCs.
Appendix I Generalization of provable confidence bounds to a larger radius
In Table 9, we evaluate the generalization of empirical worst case and guaranteed upper bound for the confidence within a larger ball around OOD samples than what the model was trained for.
As expected, the adversarial AUC’s (AAUC) degrade for the larger radius . However, we suspect that the seemingly stronger robustness of CEDA compared to OE could be partially to due to the lack of gradients at the initialization points. As mentioned above in general attacking all OOD models is difficult and requires adaptive and transfer attacks to be successful. That said, the relative differences of the AAUC should still be meaningful. However, this shows even more that for worstcase OOD detection provable guarantees are particularly needed.
On MNIST, GOOD_{100} not only still has a perfect guaranteed GAUC for uniform noise for an of 0.4 but even on FashionMNIST and CIFAR10 it still has substantial guarantees. Moreover, GOOD_{100} has with the exception of FashionMNIST a better AAUC than that of ACET.
For SVHN the excellent guarantees of GOOD_{100} for the radius of models do not generalize well to the significantly larger radius of (but note that this is more than three times as large as at training time). This is in particular the case for uniform noise where there are basically no guarantees anymore. Nevertheless the adversarial AAUC is still very high and better than that of ACET.
In contrast, for CIFAR10 the generalization of the bounds of GOOD_{80} to the larger radius of is surprisingly good: for all outdistributions, we only see an at most moderate drop of the GAUC value compared to Table 1. The same is true for the AAUC which is now significantly better than that of ACET whereas for the training radius of ACET had a better AAUC.
In summary, GOOD in most cases still achieves reasonable guarantees for the larger threat model at test time. Interestingly, the AAUC for the GOOD models is, with the exception of FashionMNIST, always better than that of ACET and thus our guaranteed IBP training shows in this regard a better generalization to larger evaluation radii than adversarial training on the outdistribution.
Comments
There are no comments yet.