Deep neural networks are the state-of-the-art in many application areas. Nevertheless it is still a major concern to use deep learning in safety-critical systems, e.g. medical diagnosis or self-driving cars, since it has been shown that deep learning classifiers suffer from a number of unexpected failure modes, such as low robustness to natural perturbations(geirhos2018generalisation; hendrycks2019benchmarking), overconfident predictions (NguYosClu2015; GuoEtAl2017; HenGim2017; HeiAndBit2019) as well as adversarial vulnerabilities (SzeEtAl2014). For safety critical applications, empirical checks are not sufficient in order to trust a deep learning system in a high-stakes decision. Thus provable guarantees on the behavior of a deep learning system are needed.
One property that one expects from a robust classifier is that it should not
make highly confident predictions on data that is very different from the training data. However, ReLU networks have been shown to be provably overconfident far away from the training data(HeiAndBit2019). This is a big problem as (guaranteed) low confidence of a classifier when it operates out of its training domain can be used to trigger human intervention or to let the system try to achieve a safe state when it “detects” that it is applied outside of its specification. Several approaches to the out-of-distribution (OOD) detection task have been studied (HenGim2017; liang2017enhancing; LeeEtAl2018; lee2018simple; HeiAndBit2019). The current state-of-the-art performance of OOD detection in image classification is achieved by enforcing low confidence on a large training set of natural images that is considered as out-distribution (HenMazDie2019; meinke2020towards).
Deep neural networks are also notoriously susceptible to small adversarial perturbations in the input (SzeEtAl2014; CarWag2016) which change the decision of a classifier. Research so far has concentrated on adversarial robustness around the in-distribution. Several empirical defenses have been proposed but many could be broken again (CroHei2020; CarWag2017; AthCarWag2018). Adversarial training and variations (MadEtAl2018; ZhaEtAl2019) perform well empirically, but typically no robustness guarantees can be given. Certified adversarial robustness has been achieved by explicit computation of robustness certificates (HeiAnd2017; WonKol2018; RagSteLia2018; MirGehVec2018; gowal2018effectiveness) and randomized smoothing (cohen2019certified).
Adversarial changes to generate high confidence predictions on the out-distribution have received much less attention although it has been shown early on that they can be used to fool a classifier (NguYosClu2015; SchEtAl2018; sehwag2019better). Thus, even if a classifier consistently manages to identify samples as not belonging to the in-distribution, it might still assign very high confidence to only marginally perturbed samples from the out-distribution, see Figure 1. A first empirical defense using a type of adversarial training for OOD detection has been proposed in (HeiAndBit2019). However, up to our knowledge in the area of certified out-of-distribution detection the only robustness guarantees for OOD were given in (meinke2020towards)
, where they use a density estimator for in- and out-distribution and integrate that into the predictive uncertainty of the neural network, which allows them to guarantee that far away from the training data the confidence of the neural network becomes uniform over the classes. Moreover, they can provide worst case guarantees on the confidence on some balls around uniform noise. However, they are not able to provide meaningful guarantees around points which are similar or even close to the in-distribution data and, as we will show, provide only weak guarantees against-adversaries.
In this work we aim to provide worst-case OOD guarantees not only for noise but also for images from related but different image classification tasks. For this purpose we use the techniques from interval bound propagation (IBP) (gowal2018effectiveness) to derive a provable upper bound on the maximal confidence of the classifier in an -ball of radius around a given point. By minimizing this bound on the out-distribution using our training scheme GOOD (Guaranteed Out-Of-distribution Detection) we arrive at the first models which have guaranteed low confidence even on image classification tasks related to the original one; e.g., we get state-of-the-art results on separating letters from EMNIST from digits in MNIST even though the digit classifier has never seen any images of letters at training time. In particular, the guarantees for the training out-distribution generalize to other out-distribution datasets. In contrast to classifiers which have certified adversarial robustness on the in-distribution, GOOD has the desirable property to achieve provable guarantees for OOD detection with almost no loss in accuracy on the in-distribution task even on datasets like CIFAR-10.
2 Out-of-distribution detection: setup and baselines
Let be a feedforward neural network (DNN) with a last linear layer where is the input dimension andfor
are transformed via the softmax function into a probability distributionover the classes with:
By we define the confidence of the classifier in the prediction at .
The general goal of OOD detection is to have low confidence predictions for all inputs which are clearly not belonging to the in-distribution task, especially for all inputs lying in a region which has zero probability under the in-distribution. One typical criterion to measure OOD detection performance is to use as a feature and compute the AUC of in- versus out-distribution (how well are confidences of in- and out-distribution separated). We discuss a proper conservative measurement of the AUC in case of indistinguishable confidence values, e.g. due to numerical precision, in Appendix C.
As baselines and motivation for our provable approach we use the OOD detection methods Outlier Exposure (OE)(HenMazDie2019) and Confidence Enhancing Data Augmentation (CEDA) (HeiAndBit2019), which use as objective for training
where is the in-distribution training set, the out-distribution training set, and the cross-entropy loss. The hyper-parameter determines the relative magnitude of the two loss terms and is most of the time chosen to be one. OE and CEDA differ in the choice of the loss for the out-distribution where OE uses the cross-entropy loss between
and the uniform distribution and CEDA uses. Note that both the CEDA and OE loss attain their global minimum when is the uniform distribution. Their difference is typically minor in practice. An important question is the choice of the out-distribution. For general image classification, it makes sense to use an out-distribution which encompasses basically any possible image one could ever see at test time and thus the set of all natural images is a good out-distribution; following (HenMazDie2019) we use the 80 Million Tiny Images dataset (torralba200880) as a proxy for that.
While OE and CEDA yield state-of-the-art OOD detection performance for image classification tasks when used together with the 80M Tiny Images dataset as out-distribution, they are, similarly to normal classifiers, vulnerable to adversarial manipulation of the out-distribution images where the attack is trying to maximize the confidence in this scenario (meinke2020towards). Thus (HeiAndBit2019) proposed Adversarial Confidence Enhanced Training (ACET) which replaces the CEDA loss with and can be seen as adversarial training on the out-distribution for an -threat model. However, similar to adversarial training on the in-distribution (MadEtAl2018) this does not yield any guarantees for out-distribution detection. In the next section we discuss how to use interval-bound-propagation (IBP) to get guaranteed OOD detection performance in a -neighborhood of every out-distribution input.
3 Provable guarantees for out-of-distribution detection
Our goal is to minimize the confidence of the classifier not only on the out-distribution images themselves but in a whole neighborhood around them. For this purpose, we first derive bounds on the maximal confidence on some -ball around a given point. In certified adversarial robustness IBP (gowal2018effectiveness) currently leads to the best guarantees for deterministic classifiers under the -threat model. While other methods for deriving guarantees yield tighter bounds (WonKol2018; MirGehVec2018), they are not easily scalable and, when optimized, the bounds given by IBP have been shown to be very tight (gowal2018effectiveness).
Interval bound propagation (gowal2018effectiveness) provides entrywise lower and upper bounds resp. for the output of the -th layer of a neural network given that the input is varied in the -ball of radius . With and and (
is the vector of all ones) andbeing the weights of the -th layer (fully connected, convolutional, residual etc.), one gets upper and lower bounds of the next layers via forward propagation:
where the / expressions are taken componentwise. The activation function (e.g. ReLU) is directly applied to the bounds. The forward propagation of the bounds is of similar nature as a standard forward pass and back-propagation w.r.t. the weights is relatively straightforward.
Upper bound on the confidence in terms of the logits.
The log confidence of the model at can be written as
We assume that the last layer is affine: , where is the number of layers of the network. We calculate the upper bounds of all logit differences as:
where denotes the -th row of and is the -th component of . Note that this upper bound of the logit difference can be negative and is zero for . Using this upper bound on the logit difference in Equation (4), we obtain an upper bound on the log confidence:
We use the bound in (6) to evaluate the guarantees on the confidences for given out-distribution datasets. However, minimizing it directly during training leads to numerical problems, especially at the beginning of training, when the upper bounds are very large for , which makes training numerically infeasible. Instead, we rather upper bound the log confidence again by bounding the sum inside the negative log from below with times its lowest term:
While this bound can considerably differ from the potentially tighter bound of Equation (6), it is often quite close as one term in the sum dominates the others. Moreover, both bounds have the same global minimum when all logits are equal over the -ball. We omit the constant in the following as it does not matter for training.
The direct minimization of the upper bound in (7) is still difficult, in particular for more challenging in-distribution datasets like SVHN and CIFAR-10, as the bound can be several orders of magnitude larger than the in-distribution loss. Therefore, we use the logarithm of this quantity. However, we also want to have a more fine-grained optimization when the upper bound becomes small in the later stage of the training. Thus we define the Confidence Upper Bound loss for an OOD input as
Note that for small and thus we achieve the more fine-grained optimization with an -type of loss in the later stages of training. The overall objective of fully applied Guaranteed OOD Detection training (GOOD100) is the minimization of
where is the in-distribution training set and the out-distribution. The hyper-parameter determines the relative magnitude of the two loss terms. During training we slowly increase this value and in order to further stabilize the training with GOOD.
Quantile-GOOD: trade-off between clean and guaranteed AUC.
Training models by minimizing (9) means that the classifier gets severely punished if any training OOD input receives a high confidence upper bound. If OOD inputs exist to which the classifier already assigns high confidence without even considering the worst case, e.g. as these inputs share features with the in-distribution, it makes little sense to enforce low confidence guarantees. Later in the experiments we show that for difficult tasks like CIFAR-10 this can happen. In such cases the normal AUC for OOD detection gets worse as the high loss of the out-distribution part effectively leads to low confidence on a significant part of the in-distribution which is clearly undesirable.
Hence, for OOD inputs which are not clearly distinguishable from the in-distribution, it is preferable to just have the “normal” loss
without considering the worst case. We realize this by enforcing the loss with the guaranteed upper bounds on the confidence just on some quantile of the easier OOD inputs, namely the ones with the lowest guaranteed out-distribution loss. We first order the OOD training set by the potential loss of each sample in ascending order , that is . We then apply the loss to the lower quantile of the points (the ones with the smallest loss ) and take for the remaining samples, which means no worst-case guarantees on the confidence are enforced:
During training we do this ordering on the part of each batch consisting of out-distribution images. On CIFAR-10, where the out-distribution dataset 80M Tiny Images is closer to the in-distribution, the quantile GOOD-loss allows us to choose the trade-off between clean and guaranteed AUC for OOD detection, similar to the trade-off between clean and robust accuracy in adversarial robustness.
We provide experimental results for image recognition tasks with MNIST (mnist), SVHN (SVHN) and CIFAR-10 (krizhevsky2009learning)
as in-distribution datasets. We first discuss the training details, hyperparameters and evaluation before we present the results of GOOD and competing methods. Code is available underhttps://gitlab.com/Bitterwolf/GOOD.
4.1 Model architectures, training procedure and evaluation
Model architectures and data augmentation.
For all experiments, we use deep convolutional neural networks consisting of convolutional, affine and ReLU layers. For MNIST, we use the large architecture from(gowal2018effectiveness), and for SVHN and CIFAR-10 a similar but deeper and wider model. The layer structure is laid out in Table 2 in the appendix. Data augmentation is applied to both in- and out-distribution images during training. For MNIST we use random crops to size
with padding 4 and for SVHN and CIFAR-10 random crops with padding 4 as well as the quite aggressive augmentation AutoAugment(cubuk2019autoaugment). Additionally, we apply random horizontal flips for CIFAR-10.
GOOD training procedure. As it is the case with IBP training (gowal2018effectiveness) for certified adversarial robustness, we have observed that the inclusion of IBP bounds can make the training unstable or cause it to fail completely. This can happen for our GOOD training despite the logarithmic damping in the loss in (8). Thus, in order to further stabilize the training similar to (gowal2018effectiveness), we use linear ramp up schedules for and , which are detailed in Appendix D. As radii for the -perturbation model on the out-distribution we use for MNIST and for SVHN and CIFAR-10 (note that ). The chosen for SVHN/CIFAR-10 is so small that the chanages are hardly visible (see Figure 1). As parameter for the trade-off between cross-entropy loss and the GOOD regularizer in (9) and (10), we set for MNIST and for SVHN and CIFAR-10.
In order to explore the potential trade-off between the separation of in- and out-distribution for clean and perturbed out-distribution inputs (clean AUCs vs guaranteed AUCs - see below), we train GOOD models for different quantiles in (10) which we denote as GOOD in the following. Here, is the percentage of out-distribution training samples for which we minimize the guaranteed upper bounds on the confidence of the neural network in the -ball of radius around the out-distribution point during training. Note that GOOD100 corresponds to (9) where we minimize the guaranteed upper bound on the worst-case confidence for all out-distribution samples, whereas GOOD0 can be seen as a variant of OE or CEDA. A training batch consists of 128 in- and 128 out-distribution samples. Examples of OOD training batches with the employed augmentation and their quantile splits for a GOOD60 model are shown in Table 3 in the appendix.
For the training out-distribution, we use 80 Million Tiny Images (80M) (torralba200880), which is a large collection of natural images associated to nouns in wordnet (fellbaum2012wordnet). All methods get the same out-distribution for training and we are neither training nor adapting hyperparameters for each OOD dataset separately as in some previous work. Since CIFAR-10 and CIFAR-100 are subsets of 80M, we follow (HenMazDie2019) and filter them out. Even after the filtering process we have observed that the remaining dataset still contains images from the CIFAR-10 and CIFAR-100 classes. Thus we have further excluded all samples for which a CIFAR-10 CEDA model has confidence above 11%, altogether removing 4.25M images. As can be seen in the example batches in Table 3, even this reduced dataset contains still images from CIFAR-10 classes, which explains why our quantile-based loss is essential to get good performance on CIFAR-10. We take a subset of 50 million images as OOD training set. Since the size of the training set of the in-distribution datasets (MNIST: 60,000; SVHN: 73,257; CIFAR-10: 50000) is small compared to 50 million, typically an OOD image appears only once during training.
For each method, we compute the test accuracy on the in-distribution task, and for various out-distribution datasets (not seen during training) we report the area under the receiver operating characteristic curve (AUC) as a measure for the separation of in- from out-distribution samples based on the predicted confidences on the test sets. As OOD evaluation sets we use FashionMNIST(XiaoEtAl2017), the Letters of EMNIST (CohEtAl2017), grayscale CIFAR-10, and Uniform Noise for MNIST, and CIFAR-100 (krizhevsky2009learning), CIFAR-10/SVHN, LSUN Classroom (lsun), and Uniform Noise for SVHN/CIFAR-10. Further evaluation on other OOD datasets can be found in Appendix H.
We are particularly interested in the worst case OOD detection performance of all methods under the -perturbation model for the out-distribution. For this purpose, we compute the adversarial AUC (AAUC) and the guaranteed AUC (GAUC). These AUCs are based on the maximal confidence in the -ball of radius around each out-distribution image. For the adversarial AUC, we compute a lower bound on the maximal confidence in the -ball by using Auto-PGD (CroHei2020) for maximizing the confidence of the classifier inside the intersection of the - ball and the image domain . Auto-PGD uses an automatic stepsize selection scheme and has been shown to outperform PGD. We use an adaptation to our setting (described in Appendix A) with 500 steps and 5 restarts on 1000 points from each test set. On MNIST, gradient masking poses a significant challenge so we use an additional attack discussed in Appendix A and report the worst case. For the guaranteed AUC, we compute an upper bound on the confidence in the intersection of the - ball with the image domain via IBP using (6) for the full test set. These worst case/guaranteed confidences for the out-distributions are then used for the AUC computation.
Competitors. We compare a normally trained model (Plain), the state-of-the-art OOD detection method Outlier Exposure (OE) (HenMazDie2019), CEDA (HeiAndBit2019) and Adversarial Confidence Enhanced Training (ACET) (HeiAndBit2019), which we adjusted to the given task as described in the appendix. As CEDA performs very similar to OE, we omit it in the figures for better readability. The -radii for the -balls are the same for ACET and GOOD. So far the only method which could provide robustness guarantees for OOD detection is Certified Certain Uncertainty (CCU) with a data-dependent Mahalanobis-type threat model. We use their publicly available code to train a CCU model with our architecture and we evaluate their guarantees for our threat model. In Appendix B, we provide details and explain why their guarantees turn out to be vacuous in our setting.
In Table 1 we present the results on all datasets.
GOOD is provably better than OE/CEDA with regard to worst case OOD detection. We note that for almost all OOD datasets GOOD achieves non-trivial GAUCs. Thus the guarantees generalize from the training out-distribution 80M to the test OOD datasets. For the easier in-distributions MNIST and SVHN, which are more clearly separated from the out-distribution, the best results are achieved for GOOD100 whereas for CIFAR-10 the best guarantees are given by GOOD90 or GOOD95. However, if taking clean AUCs into account, arguably the best trade-off is achieved for GOOD80. Note that the guaranteed AUC (GAUC) of these models is always better than the adversarial AUC (AAUC) of OE/CEDA (except for EMNIST). Thus it is fair to say that the worst-case OOD detection performance of GOOD is provably better than that of OE/CEDA. As expected, ACET yields good AAUCs but has no guarantees. The failure of CCU regarding guarantees is discussed in Appendix B. It is notable that GOOD100 has basically perfect guaranteed OOD detection performance for MNIST on CIFAR-10/uniform noise and for SVHN on all out-distribution datasets. In Appendix I we show that the guarantees of GOOD partially hold even at larger radii than used during training.
GOOD achieves certified OOD performance with almost no loss in accuracy. While there is a small drop in clean accuracy for MNIST, on SVHN, GOOD100 has with a better clean accuracy than all competing methods. On CIFAR-10 GOOD80 achieves an accuracy of which is better than ACET and only slightly worse than Plain and OE. This is remarkable as we are not aware of any model with certified adversarial robustness on the in-distribution which gets even close to this range; e.g. IBP (gowal2018effectiveness) achieves an accuracy of 85.2% on SVHN with (we have 96.6%), on CIFAR-10 with they get 71.2% (we have 90.0%). Previous certified methods had even worse clean accuracy. Since a significant loss in prediction performance is usually not acceptable, certified methods have not yet had much practical impact. Thus we think it is an encouraging and interesting observation that properties different from adversarial robustness like worst-case out-of-distribution detection can be certified without suffering much in accuracy. In particular, it is quite surprising that certified methods can be trained effectively with aggressive data augmentation like AutoAugment.
Trade-off between clean and guaranteed AUC via Quantile-GOOD. As discussed above, even after filtering, our training out-distribution contains in-distribution images from CIFAR-10 classes. This seems to be the reason why GOOD100 suffers from a significant drop in clean and guaranteed AUC, as the only way to ensure small loss , if in- and out-distribution can partially not be distinguished, is to reduce also the confidence on the in-distribution. This conflict is then resolved via GOOD80 and GOOD90 which both have better clean and guaranteed AUCs. It is an interesting open question if similar trade-offs are potentially also useful for certified adversarial robustness.
EMNIST: distinguishing letters from digits without ever having seen letters. GOOD100 achieves an excellent AUC of 98.9% for the letters of EMNIST which is, up to our knowledge, state-of-the-art. Indeed, an AUC of 100% should not be expected as even for humans some letters like i and l are indistinguishable from digits. This result is quite remarkable as GOOD100 has never seen letters during training. Moreover, as the AUC just distinguishes the separation of in-and out-distribution based on the confidence, we provide the mean confidence on all datasets in the Appendix in Table 4 and in Figure 2 (see also Figure 3 in the Appendix) we show some samples from EMNIST together with their prediction/confidences for all models. GOOD100 has a mean confidence of on MNIST but only on EMNIST in contrast to ACET with , OE and Plain . This shows that while the AUC’s of ACET and OE are good for EMNIST, these methods are still highly overconfident on EMNIST. Only GOOD100 produces meaningful higher confidences on EMNIST, when the letter has clear features of the corresponding digit.
We propose GOOD, a novel training method to achieve guaranteed OOD detection in a worst-case setting. GOOD provably outperforms OE, the state-of-the-art in OOD detection, in worst case OOD detection and has state-of-the-art performance on EMNIST which is a particularly challenging out-distribution dataset. As the test accuracy of GOOD is comparable to the one of normal training, this shows that certified methods have the potential to be useful in practice even for more complex tasks. In future work it will be interesting to explore how close certified methods can get to state-of-the-art test performance.
In order to use machine learning in safety-critical systems it is required that the machine learning system correctly flags its uncertainty. As neural networks have been shown to be overconfident far away from the training data, this work aims at overcoming this issue by not only enforcing low confidence on out-distribution images but even guaranteeing low confidence in a neighborhood around it. As a neural network should not flag that it knows when it does not know, we see only positive implications of this work for our society.
The authors acknowledge support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039A) and from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy (EXC number 2064/1, Project number 390727645). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Alexander Meinke.
Appendix A Adversarial attacks on OOD detection
It has been demonstrated [biggio2013evasion, SzeEtAl2014, CarWag2016, Croce_2019_ICCV] that without strong countermeasures, DNNs are very susceptible to adversarial attacks changing the classification result. The goal of adversarial attacks in our setting is to fool the OOD detection which is based on the confidence in the prediction. Thus the attacker aims at maximizing the confidence in a neighborhood around a given out-distribution input so that the adversarially modified image will be wrongly assigned to the in-distribution. In this paper, we regard as threat model/neighborhood an -ball of a given radius , that is ; note that in our case the disturbed inputs have to be valid images, hence the additional constraint .
For evaluation, we use Auto-PGD [CroHei2020], which is a state-of-the-art implementation of PGD (projected gradient descent), with backtracking, adaptive step sizes and random restarts. Since Auto-PGD has been designed for finding adversarial samples around the in-distribution, we change the objective of Auto-PGD to be the confidence of the classifier. We use Auto-PGD with 500 steps and 5 random restarts which is a quite strong attack. By default, the random initialization is drawn uniformly from the -ball. However, we found that for MNIST the attack very often got stuck for our GOOD models, because a large random perturbation of size 0.3 would move the sample directly into a region of the input space where the model is completely flat and thus no gradients are available (in this sense adversarial attacks on OOD inputs are more difficult than usual adversarial attacks on the in-distribution). We instead use a modified version of the attack for MNIST which starts within short distance of the original point. Thus we use as initialization a random perturbation from (note that for our evaluation on SVHN and CIFAR10, this choice coincides with the default settings).
Nevertheless, for MNIST most out-distribution points lie in regions where the predictions of our GOOD models are flat, i.e. the gradients are exactly zero. Because of this, Auto-PGD is unable to effectively explore the search space around those points. Thus, for MNIST we created an adaptive attack which partially circumvents these issues. First, we use an initialization scheme that mitigates lack of gradients by increasing the contrast as much as the threat model allows. All pixel values that lie above get set to and all values get set to . In our experience these points are more likely to yield gradients, so we use them as initialization for a 200-step PGD attack with backtracking, adaptive step size selection and momentum of . Concretely, we use a step size of , and whenever a PGD step does not increase the confidence we backtrack and halve the step size. After every successful gradient step we multiply the step size by . Using backtracking and adaptive step size is necessary because otherwise one can easily step into regions where gradient information is no longer available. Additionally, to further mitigate the problem of gradient-masking at initialization, we use the adversarial images that Auto-PGD finds for models without significant gradient masking (Plain, OE, GOOD) as initialization for the same monotone PGD for models which show significant gradient masking (CEDA, ACET).
Appendix B A review of robust OOD detection
A method that was proposed in order to achieve adversarially robust low confidence on OOD data is Adversarial Confidence Enhancing Training (ACET) [HeiAndBit2019] which is based on adversarial training on the out-distribution. However, similar to adversarial training on the in-distribution, this does typically not lead to any guarantees, whereas our goal is to get guarantees on the confidences of worst-case out-distribution inputs. ACET has the following objective:
They use with low frequency noise as their training out-distribution. We found firstly that training an ACET model with 80M as out-distribution yields much better results than the smoothed uniform noise used in [HeiAndBit2019] and secondly using the cross-entropy loss with respect to the uniform prediction instead of also leads to improvements. For training ACET models, we employ a standard PGD attack with 40 steps of size with initialization at the target input for maximizing the loss around . As usual for a -attack, we use the sign of the gradient as direction and project onto the intersection of the image domain and the -ball of radius around the target. Finally, the attack returns the image with the highest confidence found during the iterations. For the attack at training time we use no backtracking or adaptive stepsizes. ACET does not provide any guaranteed confidence bounds.
Certified Certain Uncertainty (CCU) [meinke2020towards] gives low confidence guarantees around certain OOD data that is far away from the training dataset in a specific metric. Those bounds do hold on such far-away datasets, but do not generalize to inputs relatively close to the in distribution, like for example CIFAR-10 vs. CIFAR-100. Moreover, even in the regime where CCU yields meaningful guarantees, they are given in terms of a data-dependent Mahalanobis distance rather than the -distance. However, due to norm equivalences, one can still extract -guarantees from CCU and we evaluated the CCU guarantees as follows. We use the corollary 3.1 from [meinke2020towards] which states that for a CCU model that is written as
with being the softmax output of a neural network and and Gaussian mixture models for in-and out-distribution, one can bound the confidence in a certain neighborhood around any point via
Here is a positive function that increases monotonically in the radius and that depends on the parameters of the Gaussian mixture models (details in [meinke2020towards]). The metric that they used for their CCU model is given as
where is a regularized version of the covariance matrix, calculated on the augmented in-distribution data. Note that this Mahalanobis metric is strongly equivalent to the metric induced by the -norm and consequently to the metric induced by the -norm. By computing the equivalence constants between these metrics we can extract the -guarantees that are implicit in the CCU model. Geometrically speaking, we compute the size a an ellipsoid (its shape determined by the eigenvalues of
-guarantees that are implicit in the CCU model. Geometrically speaking, we compute the size a an ellipsoid (its shape determined by the eigenvalues of) that is large enough to fit a cube inside it with a radius given by our threat model or , respectively. Via norm equivalences one has
where is the largest eigenvalue of . This means that the confidence upper bounds from (13) on a Mahalanobis-ball of radius automatically apply to an -ball of radius . However, the covariance matrix is highly ill-conditioned, which means that is fairly high. On top of that, in high dimensions is big as well so that in practice the required radius becomes too large for CCU to certify meaningful guarantees. Even on uniform noise, the upper bounds were larger than the highest confidence on the in-distribution test set, with the consequence that there are no lower-bounds on the AAUC. However, we want to stress that at least for uniform noise the lack of guarantees of CCU is due to the incompatability of the threat models used in our paper and [meinke2020towards].
Another type of guarantee that certifies a detection rate for OOD samples by applying probably approximately correct (PAC) learning considerations has been proposed in [liu18e]. Their problem setting and nature of guarantees are not directly comparable to ours, since their guarantees handle behaviour on whole distributions while our guarantees are given for individual datapoints.
Appendix C AUC and Conservative AUC
As a measure for the separation of in- vs. out-distribution data we use the Area Under the Receiver Operating Characteristic curve (AUROC or AUC) using the confidence of the classifier as the feature. The AUC is equal to the empirical probability of a random in-sample to be assigned a higher confidence than a random out-sample, plus one half times the probability of the confidences being equal. Thus, the standard way (as e.g. implemented in scikit-learn [scikit-learn]) to calculate the AUC from given confidence values on sets of in- and out-distribution samples and is
where for a set , indicates the number of its elements. The half-weighted equality term gives this definition certain symmetry properties. However, it assigns a positive score to some completely uninformed functions . For example, a constant uniform classifier with receives an AUC value of 50%. Similarly, a classifier that assigns 100% confidence to most in-distribution inputs would have positive AUC and even GAUC statistics, even if it fails to have confidence below 100% on any OOD inputs. In order to regard only example pairs where the distributions are positively distinguished, we define the Conservative AUC () by dropping the equality term:
While in general , the confidences of all models presented in the paper are differentiated enough so that for all shown numbers actually . However, we have experienced that one can have models where the confidences (uniform or one-hot predictions) cannot be distinguished due to limited numerical precision. In these cases the normal AUC definition would indicate a certain discrimination where it is actually impossible to discriminate the confidences.
Appendix D Experimental details
The layer compositions of the architectures used for all GOOD and baseline models are laid out in Table 2. No normalization of inputs or activations is used. Weight decay () is set to for MNIST and for SVHN and CIFAR-10. For all runs, we use a batch size of 128 samples from both the in- and the out-distribution (where applicable). At https://gitlab.com/Bitterwolf/GOOD you can find the exact implementation.
Model architectures used for MNIST (L), SVHN (XL) and CIFAR-10 (XL) experiments. Each convolutional and non-final affine layer is followed by a ReLU activation. All convolutions use a kernel size of 3, a padding of 1, and stride of 1, except for the third convolution which has stride=2.
For the MNIST experiments, we use as optimizer SGD with 0.9 Nesterov momentum, with an initial learning rate of that is divided by 5 after 50, 100, 200, 300 and 350 epochs, with a total number of 420 training epochs.
For the GOOD, CEDA and OE runs, the first two epochs only use in-distribution
For the MNIST experiments, we use as optimizer SGD with 0.9 Nesterov momentum, with an initial learning rate of
that is divided by 5 after 50, 100, 200, 300 and 350 epochs, with a total number of 420 training epochs. For the GOOD, CEDA and OE runs, the first two epochs only use in-distribution; over the next 100 epochs, the value of is ramped up linearly from zero to its final value of for GOOD/OE and for CEDA, where it stays for the remaining 318 epochs. The value in the loss for GOOD is also increased linearly, starting at epoch 10 and reaching its final value of on epoch 130. CCU is trained using the publicly available code from [meinke2020towards], where we modify the architecture, learning rate schedule and data augmentation to be the same as OE. The initial learning rate for the Gaussian mixture models is and gets dropped at the same epochs as the neural network learning rate. Our more aggressive data augmentation implies that our underlying Mahalanobis metric is not the same as they used in [meinke2020towards]. The ACET model for MNIST is warmed up with two epochs on the in-distribution only, then four with and , and the full ACET loss with and for the remaining epochs. The reason why we chose a smaller of for the MNIST GOOD runs is that considering the large for which guarantees are enforced, training with higher values makes training unstable without improving any validation results.
For the SVHN and CIFAR-10 baseline models, we used the ADAM optimizer [KinEtAl2014] with initial learning rate for SVHN and for CIFAR-10 that was divided by 5 after 30 and 100 epochs, with a total number of 420 training epochs.
For OE, is increased linearly from zero to one between epochs 60 and 360. The same holds for CCU which again uses the same hyperparameters as OE.
Again, ACET is warmed up with two in-distribution-only and four OE epochs. Then it is trained with and , with a shorter training time of 100 epochs (the same number as used in [HeiAndBit2019]).
In line with the experiences reported in [gowal2018effectiveness] and [zhang2020towards], for GOOD training on SVHN and CIFAR-10 longer training schedules with slower ramping up of the loss are necessary, as adding the out-distribution loss defined in Equation (8) to the training objective at once will overwhelm the in-distribution cross-entropy loss and cause the model to collapse to uniform predictions for all inputs, without recovery. In order to reduce warm-up time, we use a pre-trained CEDA model for initialization and train for 900 epochs. The learning rate is 10-4 in the beginning and is divided by 5 after epochs 450, 750 and 850. Due to the pre-training, we begin training with a small and already start with non-zero after epoch 4. Then, is increased linearly to its final value of which is reached at epoch 204. Simultaneously, is increased linearly with a virtual starting point at epoch -2 to its final value of at epoch 298.
Due to the tendency of IBP based training towards instabilities, the selection of hyper-parameters was based on finding settings where training is reliably stable while guaranteed bounds over meaningful radii are possible.
For the accuracy, AUC and GAUC evaluations in Table 1 the test splits of each (non-noise) dataset were used, with the following numbers of samples: 10,000 for MNIST, FashionMNIST, CIFAR-10, CIFAR-100 and Uniform Noise; 20,800 for EMNIST Letters; 26,032 for SVHN; 300 for LSUN Classroom. Due to the computational cost of the employed attacks, the AAUC values are based on subsets of 1000 samples for each dataset.
All experiments were run on Nvidia Tesla P100 and V100 GPUs, with GPU memory requirement below 16GB.
Appendix E Depiction of GOOD Quantile-loss
In Quantile-GOOD training, the out-distribution part of each batch is split up into “harder” and “easier” parts, since trying to enforce low confidence guarantees on inputs that are very close to the in-distribution leads to low confidences in general, even on the in-distribution. In Table 3, we show example batches of GOOD60 models with MNIST, SVHN and CIFAR-10 as in-distribution near the end of training (from epochs 410, 890 and 890, respectively). Even though many CIFAR-like images were filtered out, some are still present. For the CIFAR-10 model, such samples (among others) get sorted above the quantile. For MNIST, lower brightness images appear to be more difficult, while for SVHN images with fewer objects seem to be comparably hardest to distinguish from the house numbers of the in-distribution.
Appendix F Confidences on EMNIST
Figure 3 shows samples of the letters “k” to “z” together with the predictions and confidences of the GOOD100 MNIST model and four baseline models, complementing Figure 2. Also on these samples we see that GOOD100 only produces high confidences for letters when they show digit-specific features (“l”, “q”, “s”). All other methods including ACET also produce high confidences for letters which are quite distinct from digits (“m”, “n”, “p”, “y”).
The mean confidence values of the same selection of MNIST models for each letter of the alphabet for EMNIST are plotted in Figure 4. We observe that the mean confidence is mostly aligned with the intuitive likeness of a letter with some digit: GOOD100 has the highest mean confidence on the letter inputs “i” and “l”, which in many cases do look like the digit “1”. Curiously, the confidence of GOOD100 on the letter “o”, which even humans often cannot distinguish from a digit “0”, is generally low.
Appendix G Distributions of confidences and confidence upper bounds
Table 4 shows the mean confidences of all models on the in-distribution as well as the mean confidences and the mean guaranteed upper bounds on the worst-case confidences on the evaluated out-distributions. As discussed, GOOD100 training can reduce the confidence on the in-distribution, with a particularly strong effect for CIFAR-10. By adjusting the loss quantile, this effect can be significantly reduced while maintaining non-trivial guarantees.
The histograms of mean confidences on the in-distribution and mean guaranteed upper bounds on the worst-case confidences on the samples from the evaluated out-distribution test sets for seven models are shown in Tables 5 (MNIST), 6 (SVHN) and 7 (CIFAR-10). A higher GOOD loss quantile generally shifts the distribution of the upper bounds on the worst-case confidence towards smaller values, but in some cases, especially for GOOD100 on CIFAR-10, strongly lowers confidences in in-distribution predictions as well.
Appendix H Evaluation on additional datasets
80M Tiny Images, the out-distribution that was used during training. While it is the same distribution as seen during training, the test set consists of 30,000 samples that are not part of the training set.
Omniglot (Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266), 1332-1338.) is a dataset of hand drawn characters. We use the evaluation split consisting of 13180 characters from 20 different alphabets.
notMNIST is a dataset of the letters A to J taken from different publicly available fonts. The dataset was retrieved from https://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html. We evaluate on the hand cleaned subset of 18724 images,
ImageNet- [HeiAndBit2019], which is a subset of ImageNet [imagenet_cvpr09] without images labelled as classes equal or similar to CIFAR-10 classes.
Smooth Noise is generated as described by [HeiAndBit2019]. First, a uniform noise image is generated. Then, a Gaussian filter with drawn uniformly at random between 1.0 and 2.5 is applied. Finally, the image is re-scaled such that the minimal pixel value is 0.0 and the maximal one is 1.0. We evaluate AUC and GAUC on 30,000 samples.
For MNIST, GOOD100 has an excellent GAUC for the training out-distribution 80M Tiny images as well as for notMNIST. For Omniglot, GOOD100 is again better than OE/CEDA (similar to EMNIST) in terms of clean AUC’s but here ACET is slightly better. However, again it is very difficult to provide any guarantees for this dataset even though non-trival adversarial AUC’s are possible.
For SVHN, the detection of smooth noise turns out to be the most difficult of the evaluated tasks. There, the clean AUCs of all non-plain methods are lower than the perfect scores we see on other out-distributions but still very high, and only GOOD100 can give some guarantees. An explanation might be that the image features of SVHN house numbers and of this kind of synthetic noise are similarly smooth. For 80M Tiny Images and Imagenet-, on the other hand, the SVHN high quantile GOOD models, particularly GOOD100, are able to provide almost perfect guaranteed AUCs.
For CIFAR-10, on all three out-distributions we again observe the trade-off between clean and guaranteed AUC that comes with the choice of the loss quantile. Overall, the GOOD80 model again retains reasonable AUC values for the clean data while also providing useful guaranteed AUCs.
Appendix I Generalization of provable confidence bounds to a larger radius
In Table 9, we evaluate the generalization of empirical worst case and guaranteed upper bound for the confidence within a larger -ball around OOD samples than what the model was trained for.
As expected, the adversarial AUC’s (AAUC) degrade for the larger radius . However, we suspect that the seemingly stronger robustness of CEDA compared to OE could be partially to due to the lack of gradients at the initialization points. As mentioned above in general attacking all OOD models is difficult and requires adaptive and transfer attacks to be successful. That said, the relative differences of the AAUC should still be meaningful. However, this shows even more that for worst-case OOD detection provable guarantees are particularly needed.
On MNIST, GOOD100 not only still has a perfect guaranteed GAUC for uniform noise for an of 0.4 but even on FashionMNIST and CIFAR-10 it still has substantial guarantees. Moreover, GOOD100 has with the exception of FashionMNIST a better AAUC than that of ACET.
For SVHN the excellent guarantees of GOOD100 for the radius of models do not generalize well to the significantly larger radius of (but note that this is more than three times as large as at training time). This is in particular the case for uniform noise where there are basically no guarantees anymore. Nevertheless the adversarial AAUC is still very high and better than that of ACET.
In contrast, for CIFAR-10 the generalization of the bounds of GOOD80 to the larger radius of is surprisingly good: for all out-distributions, we only see an at most moderate drop of the GAUC value compared to Table 1. The same is true for the AAUC which is now significantly better than that of ACET whereas for the training radius of ACET had a better AAUC.
In summary, GOOD in most cases still achieves reasonable guarantees for the larger threat model at test time. Interestingly, the AAUC for the GOOD models is, with the exception of FashionMNIST, always better than that of ACET and thus our guaranteed IBP training shows in this regard a better generalization to larger evaluation radii than adversarial training on the out-distribution.