Breaking_Down_OOD_Detection
None
view repo
It is an important problem in trustworthy machine learning to recognize out-of-distribution (OOD) inputs which are inputs unrelated to the in-distribution task. Many out-of-distribution detection methods have been suggested in recent years. The goal of this paper is to recognize common objectives as well as to identify the implicit scoring functions of different OOD detection methods. We focus on the sub-class of methods that use surrogate OOD data during training in order to learn an OOD detection score that generalizes to new unseen out-distributions at test time. We show that binary discrimination between in- and (different) out-distributions is equivalent to several distinct formulations of the OOD detection problem. When trained in a shared fashion with a standard classifier, this binary discriminator reaches an OOD detection performance similar to that of Outlier Exposure. Moreover, we show that the confidence loss which is used by Outlier Exposure has an implicit scoring function which differs in a non-trivial fashion from the theoretically optimal scoring function in the case where training and test out-distribution are the same, which again is similar to the one used when training an Energy-Based OOD detector or when adding a background class. In practice, when trained in exactly the same way, all these methods perform similarly.
READ FULL TEXT VIEW PDFNone
While deep learning has significantly improved performance in many application domains, there are serious concerns for using deep neural networks in applications which are of safety-critical nature. With one major problem being adversarial samples
(SzeEtAl2014; MadEtAl2018), which are small imperceptible modifications of the image that change the decision of the classifier, another major problem are overconfident predictions (NguYosClu2015; HenGim2017; HeiAndBit2019)for images not belonging to the classes of the actual task. Here, one distinguishes between far out-of-distribution data, e.g. different forms of noise or completely unrelated tasks like CIFAR-10 vs. SVHN, and close out-of-distribution data which can for example occur in related image classification tasks where the semantic structure is very similar e.g. CIFAR-10 vs. CIFAR-100. Both are important to be distinguished from the in-distribution, but it is conceivable that close out-of-distribution data is the more difficult problem with potentially fatal consequences: in an automated diagnosis system we want that the system recognizes that it “does not know” when a new unseen disease comes in rather than assigning high confidence into a known class leading to fatal treatment decisions. Thus out-of-distribution awareness is a key property of trustworthy AI systems.
In this paper, we focus on the setting of OOD detection where during training time, there is no information available on the distribution of OOD inputs that might appear when the model is used for inference. It is often reasonable to assume access to a surrogate out-distribution during training. One can however not assume that these surrogate is related to the OOD inputs that will be encountered at test-time. A large number of different approaches to OOD detection based on combinations of density estimation, classifier confidence, logit space energy, feature space geometry, behaviour on auxiliary tasks, and other principles has been proposed to tackle this problem. We give a detailed overview of existing OOD detection methods in Appendix
D. However, most OOD detection papers are focused on establishing superior empirical detection performance and provide little theoretical background on differences but also similarities to existing methods. In this paper we want to take a different path as we believe that a solid theoretical basis is needed to make further progress in this field. Our goal is to identify, at least for a particular subclass of techniques, whether the differences are indeed due to a different underlying theoretical principle or whether they are due to the efficiency of different estimation techniques for the same underlying detection criterion, called “scoring function”. In some cases, we will see that one can even disentangle the estimation procedure from the scoring function, so that one can simulate several different scoring functions from one model’s estimated quantities.[leftmargin=3.1mm, topsep=0pt]
We show that several OOD detection approaches which optimize an objective that includes predictions on surrogate OOD data are equivalent to the binary discriminator between in- and out-distribution when analyzing the rankings induced by the Bayes optimal classifier/density.
We derive the implicit scoring functions for the confidence loss (LeeEtAl2018) used by Outlier Exposure (HenMazDie2019), for Energy-Based OOD Detection (liu2020energy), and for an extra background class for the out-distribution (thulasidasan2021effective). The confidence scoring function turns out not to be equivalent to the “optimal” scoring function of the binary discriminator when training and test out-distributions are the same.
We show that the combination of a binary discriminator between in- and out-distribution with a standard classifier on the in-distribution, when trained in a shared fashion, yields OOD detection performance competitive with state-of-the-art methods based on surrogate OOD data.
We show that density estimation is equivalent to discrimination between the in-distribution and uniform noise which indicates why standard density estimates are not suitable for OOD detection, as has frequently been observed.
Even though we identify that a simple baseline is competitive with the state-of-the-art, the main aim of this paper is a better understanding of the key components of different OOD detection methods and to identify the key properties which lead to SOTA OOD detection performance. All of our findings are supported by extensive experiments on CIFAR-10 and CIFAR-100 with evaluation on various challenging out-of-distribution test datasets.
We first characterize the set of transformations of a scoring function which leaves the OOD detection criteria like AUC or FPR invariant. This is important for the analysis later on, since the scoring functions of different methods are in many cases not identical as functions but yield the same OOD detection performance by those criteria. Like most work in the literature we consider OOD detection on a compact input domain with the most important example being image classification where . The most popular approach to OOD detection is the construction of an in-distribution-scoring function such that tends to be smaller if is drawn from an out-distribution , short , than if it is drawn from the in-distribution , short . There is a variety of different performance metrics for this task, with a very common one being the
area under the receiver-operator characteristic curve
(AUC). The AUC for a scoring function distinguishing between an in-distribution and an out-distribution is given byWe define an equivalence of scoring functions based on their AUCs and will show that this equivalence implies equality of other employed performance metrics as well.
Two scoring functions and are equivalent and we write if
for all potential distributions and .
As the AUC is not dependent on the actual values of but just on the ranking induced by one obtains the following characterization of the equivalence of two scoring functions.
Two scoring functions are equivalent if and only if there exists a strictly monotonously increasing such that .
The equivalence between scoring functions in Def. 1 is an equivalence relation.
Another metric is the false positive rate at a fixed true positive rate q, denoted as FPR@qTPR. A commonly used value for the TPR is 95%. The smaller the FPR@qTPR, the better the OOD discrimination performance.
Two equivalent scoring functions have the same FPR@qTPR for any pair of in- and out-distributions and for any chosen TPR q.
In the next section, we use the previous results to show that the Bayes optimal scoring functions of several proposed methods for out-of-distribution detection are equivalent to those of simple binary discriminators.
In the following we will show that the Bayes optimal function of several existing approaches to OOD detection for unlabeled data are equivalent to a binary discriminator between in- and a (training) out-distribution, whereas different solutions arise for methods that involve labeled data. As the equivalences are based on the Bayes optimal solution, these are asymptotic statements and thus it has to be noted that convergence to the Bayes optimal solution can be infinitely slow and that the methods can have implicit inductive biases. This is why we additionally support our findings with extensive experiments.
We first provide a formal definition of OOD detection before we show the equivalence of density estimators resp. likelihood to a binary discriminator.
The OOD problem
In order to make rigorous statements about the OOD detection problem we first have to provide the mathematical basis for doing so. We assume that we are given an in-distribution and potentially also a training out-distribution . At this particular point no labeled data is involved, so both of them are just distributions over . For simplicity we assume in the following that they both have a density wrt. the Lebesgue measure on . We assume that in practice we get samples from the mixture distribution
where
is the probability that we expect to see in-distribution samples in total. In order to make the decision between in-and out-distribution for a given point
, it is thereby optimal to estimatewhich is defined for all with (assuming and can be written as densities).
If the training out-distribution is also the test out-distribution then this is already optimal but we would like that the approach generalizes to other unseen test out-distributions and thus an important choice is the training out-distribution . Note that as is only well-defined for all with , it is thus reasonable to choose for a distribution with support in .
In this case we ensure that the criterion with which we perform OOD detection is defined for any possible input . This is desirable, as OOD detection should work for any possible input .
Optimal prediction of a binary discriminator between in- and out-distribution
We consider a binary discriminator with model parameters between in- and (training) out-distribution, where is the predicted probability for the in-distribution. Under the assumption that is the probability for in-distribution samples and using cross-entropy (which in this case is the logistic loss up to a constant global factor of ) the expected loss becomes:
One can derive that the Bayes optimal classifier minimizing the expected loss has the predictive distribution:
Thus at least for the training out-distribution, a binary classifier based on samples from in- and (training) out-distribution would suffice to solve the OOD detection problem perfectly.
Equivalence of density estimation and binary discrimination for OOD detection
In this section we further analyze the relationship of common OOD detection approaches with the binary discriminator between in-and out-distribution. We start with density estimators sourced from generative models. A basic approach that is known to yield relatively weak OOD performance (NalEtAl2018; ren2019likelihood; XiaoLikelihoodRegret) is directly utilizing a model’s estimate for the density at a sample input .
An improved density based approach which uses perturbed in-distribution samples as a surrogate training out-distribution is the Likelihood Ratios method (ren2019likelihood), which proposes to fit a generative model for both the in- and out-distribution and to use the ratio between the likelihoods output by the two models as a discriminative feature.
We show that with respect to the scoring function, the correct density is equivalent to the Bayes optimal prediction of a binary discriminator between the in-distribution and uniform noise. Furthermore, the density ratio is equivalent to the prediction of a binary discriminator between the two distributions on which the respective models used for density estimation have been trained. Because of this equivalence, we argue that the use of binary discriminators is a simple alternative to these methods because of its easier training procedure. While this equivalence is an asymptotic statement, the experimental comparisons in the appendix show that the methods perform similarly poorly compared to the methods using labeled data.
We first prove the more general case of arbitrary likelihood ratios. In the following we use the abbreviation to save space and make the statements more concise.
Assume and can be represented by densities and the support of covers the whole input domain . Then for any .
This means that the likelihood ratio score of two optimal density estimators is equivalent to the in-distribution probability predicted by a binary discriminator and this is true for any possible ratio of to . In the experiments below, we show that using such a discriminator has similar performance as the likelihood ratios of the different trained generative models.
For the approaches that try to directly use the likelihood of a generative model as a discriminative feature, this means that their objective is equivalent to training a binary discriminator against uniform noise, whose density is at any .
Assume that can be represented by a density. Then for any .
This provides additional evidence why a purely density based approach for many applications proves to be insufficient as an OOD detection score on the complex image domain: it is not reasonable to assume that a binary discriminator between certain classes of natural images on the one hand and uniform noise on the other hand provides much useful information about images from other classes or even about other nonsensical inputs.
We first discuss how one can formulate the OOD problem when one has access to labeled data for the in-distribution and we identify the target distribution of OOD detection using a background/reject class. Then we derive the Bayes optimal classifier of the confidence loss (LeeEtAl2018) as used by the most successful variant of Outlier Exposure (HenMazDie2019) and discuss the implicit scoring function. In most cases the scoring functions turn out not to be non-equivalent to (which is optimal if training and test out-distribution agree) as they integrate additional information from the classification task. Given a joint in-distribution (where given that we have
labels) for the labeled in-distribution, there are different ways how to come up with a joint distribution for in- and out-distribution. Interestingly, the different encodings used e.g. in training with a background class
(thulasidasan2021effective) vs. training a classifier with confidence loss (LeeEtAl2018) together with variants of the employed scoring function lead to methods which unexpectedly can have quite different behavior.Background class: In this case we just put all out-of-distribution samples into a -class which is typically called background/reject class (thulasidasan2021effective). The joint distribution then becomes
We denote by the marginal in-distribution and note that the marginal distribution of the joint distribution of in- and out-distribution is again
Thus we get the conditional distribution
The Bayes optimal solution of training with a background class using any calibrated loss function
, e.g. the cross-entropy loss (LapEtAl2016), then yields a Bayes optimal classifier which has a predictive distribution . There are two potential scoring function that come to mind:The first one, used in chen2020informative-outlier-matters; thulasidasan2021effective, is motivated by the fact that is directly the predicted probability that the point is from the out-distribution as indeed it holds: which is the optimal scoring function if training and test out-distribution are equal. On the other hand the maximal predicted probability , which is often employed as a scoring function (HenGim2017), becomes for the Bayes optimal classifier
which is a product of and the maximal conditional probability of some class of the in-distribution; note that is well defined as is defined if has support everywhere in and if then also . Thus the scoring function integrates additionally to also class-specific information and is thus less dependent on the chosen training out-distribution. In fact, one can see that only ranks points high if both the binary discriminator and the classifier rank the corresponding point high. However, in the case where training and test out-distribution are identical, this scoring function is not equivalent to and thus introduces a bias in the estimation.
Outlier Exposure (HenMazDie2019) with confidence loss (LeeEtAl2018): We analyze the Bayes optimal solution for the confidence loss (LeeEtAl2018) that is used by Outlier Exposure (OE) and show that the associated scoring function can be written, similarly to the scoring function for training with a background class, as a function of and .
The training objective with the confidence loss is in expectation given by
where are the model parameters and is the model output as logits, and
is the uniform distribution over the
classes of the in-distribution classification task.In the following theorem we derive the Bayes optimal predictive distribution for this training objective.
The predictive distribution of the Bayes optimal classifier minimizing the expected confidence loss is given for as
Thus the effective scoring function of using the probability of the predicted class as suggested in HenGim2017; LeeEtAl2018; HenMazDie2019 is
Please note that the term inside the brackets is positive as . Interestingly, the scoring functions and are not equivalent even though they look quite similar. In particular, due to the subtraction of the scoring function puts more emphasis on the classifier than . In Appendix F we additionally analyze Energy-Based OOD Detection (liu2020energy), and show that the Bayes optimal decision is equivalent to using the scoring function .
Energy-Based OOD Detection (liu2020energy): This method uses the energy of a model’s logit prediction which is defined as as an OOD detection score. The model’s energy is fine-tuned to be low on the in-distribution and high on surrogate OOD data, within certain margins . We analyze the method in detail in Appendix F, where we prove the following:
The Bayes optimal logit output of the Energy-Based OOD detection model minimizing the expected loss on an input yields class probabilities that are optimal for a standard classifier with cross-entropy loss and simultaneously fulfills
The Bayes optimal solution of the Energy-based OOD detection criterion is equivalent to the Bayes optimal solution of a binary discriminator between the training in- and out-distributions.
So far we have derived that at least from the point of view of the ranking induced by the Bayes optimal solution, OOD detection based on generative methods, likelihood ratios, logit energy, and the background class formulation with the scoring function is equivalent to a binary classification problem between in- and out-distribution in order to estimate . The differences arise mainly in the choice of the training out-distribution : i) uniform for generative resp. density based methods, ii) a quite specific out-distribution for likelihood ratios (ren2019likelihood) and iii) a proxy of the distribution of all natural images (HenMazDie2019; thulasidasan2021effective). On the other hand when labeled data is involved we can additionally train a classifier on the in-distribution in order to estimate . We will then combine the estimates of and according to the three scoring functions derived in the previous section and check if the novel OOD detection methods constructed in this way perform similar to the OOD methods from which we derived the corresponding scoring function i) OOD detection with a background class (thulasidasan2021effective) or ii) using Outlier Exposure (HenMazDie2019). This will allow us to differentiate between differences of the employed scoring functions for OOD detection and the estimators for the involved quantities. In this way we foster a more systematic approach to OOD detection.
In the unlabeled case we train simply the binary classifier using logistic/cross entropy loss in a class balanced fashion
where and are samples from the in-distribution and the out-distribution.
In the case where we have labeled data we can additionally solve the classification problem. The obvious approach is to train the binary classifier for estimating and the classifier to estimate completely independently. Not surprisingly, we show in Section 4 that this approach works less well. In fact both tasks benefit from each other. Moreover, in training a neural network using a background class or with Outlier Exposure (HenMazDie2019) we are implicitly using a shared representation for both tasks which improves the results.
Thus we propose to train the binary discriminator of in-versus out-distribution together with the classifier on the in-distribution jointly. Concretely, we use a neural network with outputs where the first outputs represent the classifier and the last output is the logit of the binary discriminator. The resulting shared problem can then be written as
where which is typically set to during training in order to get a class-balanced problem. Note that the in-distribution samples used to estimate can be a super-set of the labeled examples used to train the classifier so that one can potentially integrate unlabeled data - this is an advantage compared to OOD detection with a background class or Outlier Exposure where this is not directly possible. An example of such a situation is given in Appendix K, where we observe that shared training of a classifier and a binary discriminator works better than OE when only 10% of the in-distribution training samples have labels. We stress that the loss functions of the classifier and the discriminator act on independent outputs; the functions modelling the two tasks only interact with each other due to the shared network weights up to the final layer. Nevertheless, we see in the next Section 4 that training with a shared representation boosts both the classifier and the binary discriminator.
We use CIFAR-10 and CIFAR-100 (krizhevsky2009learning) datasets as in-distribution and OpenImages dataset (OpenImages2) as training out-distribution. The 80 Million Tiny Images (80M) dataset (torralba200880) is the de facto standard for training out-distribution aware models that has been adopted by most prior works, but this dataset has been withdrawn by the authors as (Birhane_2021_WACV) pointed out the presence of offensive images. To be able to compare with other state-of-the-art methods without introducing a potential bias due to dataset selection, we include the evaluation with 80M as training out-distribution in Appendix H. Moreover, we show in the appendix results for the binary discriminator trained with different training out-distributions vs. likelihoods resp. likelihood ratios (ren2019likelihood) as OOD method.
We use as OOD detection metric the false positive rate at 95% true positive rate, FPR@95%TPR; evaluations with AUC are in Appendix G. We evaluate the OOD detection performance on the following datasets: SVHN (SVHN), resized LSUN Classroom (lsun), Uniform Noise, Smooth Noise generated as described by (HeiAndBit2019), the respective other CIFAR dataset, 80M, and CelebA (CelebA). We highlight that none of the listed methods has access to those test distributions during training or for fine-tuning as we try to assess the ability of an OOD aware model to generalize to unseen distributions. The FPR for the OpenImages test set is not part of the Mean AUC, since OpenImages has been used during training.
The binary discriminators (BinDisc) as well as the classifiers with background class (BGC) and the shared binary discriminator+classifier (Shared) of and are trained on the 40-2 Wide Residual Network (ZagKom2016) architecture with the same training schedule as used in (HenMazDie2019) for training their Outlier Exposure(OE) models. This includes averaging the loss over batches that are twice as large for the out-distribution. This way we ensure that the differences do not arise due to differences in the training schedules or other important details but only on the employed objectives. In addition to their standard augmentation and normalization, we apply AutoAugment (cubuk2019autoaugment) without Cutout, and we use where applicable, which is a sound choice as we observe in an ablation on in Appendix L. For the Energy
OOD detector, we fine-tune the plain model for 10 epochs with OpenImages as out-distribution. For the Mahalanobis OOD detector
(LeeEtAl2018b), we use the models and code published by the authors and use OpenImages for the fine tuning of input noise and layer weighting regression. Our code is available at https://github.com/j-cb/Breaking_Down_OOD_Detection and we describe the exact details of the training settings and the used dataset splits in Appendix C.In Table 1 we compare multiple OOD methods trained with training out-distribution OpenImages and CIFAR-10/100 as in-distribution: confidence of standard training (Plain) and OE, Mahalanobis detection, Energy-Based OOD detection, classifier with background class (BGC) and the combination of a plain classifier and a binary in-vs-out-distribution classifier with shared representation (Shared Combi). As described in Section 2, both BGC and Shared Combi can be used in combination with different scoring functions. For BGC, we evaluate all three scoring functions , and and for Shared Combi we only use and as is equivalent to which is the output of Shared BinDisc. Additionally, we evaluate OOD detection based on the confidence of the shared classifier (Shared Classi) trained together with Shared BinDisc.
Model | FPR | Model | FPR |
---|---|---|---|
Plain Classi | 60.83 | Shared BinDisc () | 18.89 |
OE () | 19.53 | Shared Classi | 34.93 |
BGC | 19.29 | Shared Combi | 18.35 |
BGC | 19.58 | Shared Combi | 18.38 |
BGC | 19.62 |
For CIFAR-10, a first interesting observation is that Shared Classi has remarkably good OOD performance; significantly better than a normal classifier (plain) even though it is just trained using normal cross-entropy loss and so the OOD performance is only due to the regularization enforced by the shared representation with Shared BinDisc. In fact Shared BinDisc has already good OOD performance with a mean FPR@95%TPR of 19.56, which is improved by considering scoring function in the combination of Shared BinDisc and Shared Classi which yields very good classification accuracy and mean FPR/AUC. Moreover, interesting are the results of the classifier with background class (BGC) which is the method recently advocated in (thulasidasan2021effective). It works very well but the performance depends on the chosen scoring function. Whereas (output of the background class) is a usable scoring function (mean FPR: 18.83), the maximum probability over the other classes (mean FPR: 16.52) or the combination in terms of (mean FPR: 16.63) performs better. In total with the scoring function integrating classifier and discriminative information, BGC reaches similar performance to OE (which implicitly also uses as scoring function). In general, the differences of the methods are relatively minor both in terms of OOD detection and classification accuracy, where the latter is better for all OOD methods compared to the plain classifier; this is most likely explained by better learned representations, see also HenMazDie2019; RATIO for similar observations. The results for CIFAR-100 are similar to CIFAR-10, with some reversals of the overall rankings of the compared methods. OE achieves comparable OOD results to BGC and Shared Combi . For this in-distribution our BGC and Shared BinDisc perform best in terms of OOD performance. Classification test accuracy is slightly higher for BGC and Shared, but the differences are minor. The above observations are confirmed by further experiment further experiments with 80M as training out-distribution in Appendix H
as well as with Restricted ImageNet
(tsipras2018robustness)as in-distribution and the remaining ImageNet classes as training out-distribution in Appendix
I. For the reader’s convenience, we summarize the mean FPR over the 5 models for each method and over all datasets in Table 2. As discussed, the differences between the different methods using surrogate OOD data are relatively minor, with the none of the methods being strictly better or worse than an other over all 5 settings.Overall, as suggested by the theoretical results on the equivalence of the Bayes optimal classifier of OE with the scoring function of BGC and Shared Combi, we observe that even though these methods are derived and in particular trained with quite different objectives, they behave very similar in our experiments. In total we think that this provides a much better understanding where differences of OOD methods are coming from. Regarding the question of which method and scoring function should be used for a given application, the experimental results across datasets and different out-distributions, see Appendix H, suggest that their difference is minor and there is no clear best choice. However, in Appendix B, we describe a potential situation where the score and in consequence OE is not powerful enough to distinguish in- and out-of-distribution inputs. On the other hand, in cases where the score is not very informative as training and test out-distributions largely differ, combining it with the classifier confidences is beneficial; this can be observed in experiments with SVHN as training out-distribution which we show in Appendix J. This is why for an unknown situation, we recommend BGC or Shared Combi with the scoring function as the safest option. However, it is an open question if there are also situations where is fundamentally inferior to .
As highlighted above the shared training of Shared Classi and Shared BinDisc and their combination Shared Combi with as scoring functions yields strong OOD detection and test accuracy among all methods. Here, we evaluate the importance of training the binary discriminator and the plain classifier with a shared representation in comparison to training two entirely separate models Plain Classi and Separate BinDisc and their combination Separate Combi with scoring function . The results for CIFAR-10 and CIFAR-100 can be found in Table 3. In total, we see that separate training in particular for CIFAR-100 leads to worse results compared to shared training as expected as the binary discriminator and the classifier cannot benefit from each other. An interesting curiosity is that the combination of the separate classifier with the binary discriminator trained in a shared fashion (Plain Sha Disc) yields almost the same OOD results as Shared Combi even though the classifier is significantly worse. Overall, Shared Combi performs significantly better when also considering the better classification accuracy which it inherits from Shared Classi.
In this paper we have analyzed different OOD detection methods and have shown that the simple baseline of a binary discriminator between in-and out-distribution is a powerful OOD detection method if trained in a shared fashion with a classifier. Moreover, we have revealed the inner mechanism of Outlier Exposure and training with a background class which unexpectedly use a scoring function which integrates information from and . We think that these findings will allow to build novel OOD methods in a more principled fashion.
The authors acknowledge support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039A) and from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy (EXC number 2064/1, Project number 390727645), as well as from the DFG TRR 248 (Project number 389792660). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Alexander Meinke.
See 1
Assume that such a function exists. Then for any pair we have the logical equivalences and . This directly implies that the AUCs are the same, regardless of the distributions.
Assume . For each , choose some . For any pair , by regarding the Dirac distributions and that are each concentrated on one of the points, we can infer that and similarly . The latter ensures that the function defined as
(1) |
is independent of the choice of and that , and the former confirms that is strictly monotonously increasing. ∎
See 1
We know that a function as in Theorem 1 exists. Then for any pair , we have the logical equivalences
(2) |
and
(3) |
This directly implies that the FPR@qTPR-values are the same, for any and q. ∎
See 2
The function defined by (setting ) fulfills the criterion from Theorem 1 of being strictly monotonously increasing. With
(4) |
for and , the equivalence follows. ∎
See 3
This is a special case of Lemma 2, by setting . ∎
See 2
Minimizing the loss of Outlier Exposure
(5) |
means solving the optimization problem
(6) |
where is the model’s -dimensional prediction. For or , the optimalities of the respective terms are easy to show (applying the common conventions for ), so we assume that tose are non-zero. The Lagrange function of the optimization problem is
(7) |
with and . Its first derivative with respect to for any is
(8) |
where we set . The second derivative is a positive diagonal matrix for any point of its domain, therefore we find the unique minimum by setting (8) to zero, i.e. at
(9) |
The dual problem is hence the maximization (with ) of
here, only appears in which has a positive factor , so maximizes the expression. Noting , what remains is , which is maximized by . This means that the dual optimal pair is . Slater’s condition (BoydVandenberghe) holds since the feasible set of the original problem is the probability simplex. Thus, is indeed primal optimal. ∎
Seeing that for both in-distribution accuracy and OOD detection, OE models trained with confidence loss, models with background class and shared classifier/discriminator combinations behave very similarly, the question arises if the training methods themselves lead to equivalent models. One idea might be that and effect of the confidence loss on the degree of freedom from logit translation invariance could be unfolded to obtain a
-dimensional output that contains the same information as a classifier with background class or with and additional binary discriminator output (the latter two are indeed equivalent). This is not the case, as the following example that in certain situations, the and scores of background class models and classifier/discriminator combinations are able to separate in- and out-distribution, while the score and the equivalent confidence loss/OE models cannot.As an example where the mentioned non-equivalence would occur, we hypothetically regard the task of classifying photos of 1-Euro coins by the issuing country. Each €1 coin features a common side that is the same for each country and a national side that pictures a unique motive per country. We assume that one side is visible on each photo, and that the training dataset of size is balanced, consisting of coin photos with label for each country , where photos show the common side and the other photos show the informative national side for each country .
It is easy to see that the Bayes optimal classifier trained with cross-entropy loss on this dataset predicts for the respective country when shown a photo of the national side of a €1 coin, and predicts for each country when shown the common side of a €1 coin.
Now we compare the behaviour of the different methods given a training out-distribution of poker chips images which are clearly recognizable as not being €1 coins.
A -class model trained with confidence loss (LeeEtAl2018) will not make a difference between common side coin images and poker chip images, and in the Bayes optimal case, it will predict the uniform class distribution in both cases. This does not only hold for the prediction of a hypothetical Bayes optimal model: assuming full batch gradient descent and identical sets of common side training photos for each class, the loss for a common side input is the same as the loss for a poker chip.
On the other hand, a binary discriminator will easily distinguish between poker chips and €1 coins, no matter which side of the coin is shown. The same holds for a model with background class: the score of the class will be close to for chips and close to for €1 coins.
We conclude that in the described situation, models trained with confidence loss/outlier exposure are not able to sufficiently distinguish in- and out-distribution, while the scoring function of a classifier with background class or a binary discriminator is suitable for this task.
With the scoring function, the background class model gives us , and thus , which means that if is sufficiently large for in-distribution inputs and sufficiently small for out-of-distribution inputs, is able to distinguish them independent of inconclusiveness in the first classes. Similarly, applied to a binary discriminator with a classifier (shared trained or not) will be able to distinguish common sides of coins and poker chips.
With , on the other hand, common sides of coins and poker chips can no longer be separated. For a classifier/discriminator pair, as defined above, . If on the common side of a coin the classifier predicts uniform , we have no matter what the discriminator predicts. On poker chips with discriminator prediction , we also get . For background class models, also yields for a common side where the prediction over the in-distribution classes is uniform and for a poker chips, where . The fact that in this coin scenario when scored with , background class and classifier/discriminator combinations have the same problem as confidence loss/OE is not surprising considering their equivalence shown in Theorem 2.
For training our models, we build upon the code of HenMazDie2019 which they have available at https://github.com/hendrycks/outlier-exposure and borrow their general architecture and training settings. Concretely, we use 40-2 Wide Residual Network (ZagKom2016) models with normalization based on the CIFAR datasets and a dropout rate of 0.3. They are trained for 100 epochs with an initial learning rate of
that decreases following a cosine annealing schedule. Unless mentioned otherwise, each training step uses a batch of size 128 for the in-distribution and a batch of size 256 for the training out-distribution. The optimizer uses stochastic gradient descent with a Nesterov momentum of 0.9. Weight decay is set to
. The deep learning framework we use is PyTorch
(PyTorch), and for evaluating we use the scikit-learn (scikit-learn) implementation of the AUC. Our code is available at https://github.com/j-cb/Breaking_Down_OOD_Detection.For evaluating the Mahalanobis detector, we use the code by the authors of LeeEtAl2018b provided at https://github.com/pokaxpoka/deep_Mahalanobis_detector. The input noise levels and regression parameters are chosen on the available out-distribution OpenImages and are 0.0014 for CIFAR-10 and 0.002 for CIFAR-100.
All experiments were run on Nvidia V100 GPUs of an internal cluster of our institution, using up to 4 GB GPU memory (batch sizes in:128/out:256), with no noticeable difference between ours and the compared OE (HenMazDie2019) runs.
We train our models with the train splits of CIFAR-10 and CIFAR-100 (krizhevsky2009learning) (MIT license) which each consist of 50,000 labeled images, and evaluate on their test splits of 10,000 samples. As training out-distribution we use OpenImages v4 (OpenImages2) (images have a CC BY 2.0 license); the training split that we employ here consists of 8,945,291 images of different sizes, which get resized to pixels, and we test on 10,000 from the official validation split. For training with 80 Million Tiny Images (80M) (torralba200880) in Appendix H (no license, see links in Appendix H), we use data from the beginning of the sequentialized dataset, and evaluate on a test set of 30,080 images starting at index 50,000,000. A subset of CIFAR images contained in 80M is excluded for training and evaluation. Further image datasets used for evaluation are SVHN (SVHN) (free for non-commercial use) with 26,032 samples, LSUN (lsun) Classroom (no license) with 300 samples, and CelebA (CelebA) (available for non-commercial research purposes only) with 19,962 test samples. Uniform and Smooth Noise (HeiAndBit2019) are sampled, the latter by generating uniform noise and smoothing it using a Gaussian filter with a width that is drawn uniformly at random in . Each datapoint is then shifted and scaled linearly such that the minimal and maximal pixel values are 0 and 1, respectively. For both noises, we evaluate 30,080 inputs.
Out-of-distribution detection has been an important research area in recent years, and several approaches that are fitted towards different training and inference scenarios have been proposed.
One seemingly obvious line of thought is to use generative models for density estimation to differentiate between in- and out-distribution (bishop1994; NalEtAl2018; ren2019likelihood; nalisnick19b; XiaoLikelihoodRegret). Recent methods to a certain extent overcome the problem mentioned in NalEtAl2018 that generative models can assign higher likelihood to distributions on which they have not been trained. Another line of work are score-based methods using an underlying classifier or the internal features of such a classifier, potentially combined with a generative model (HenGim2017; liang2017enhancing; lee2018simple; HenMazDie2019; HeiAndBit2019). One of the most effective methods up to now is Outlier Exposure (HenMazDie2019) and work building upon it (chen2020informative-outlier-matters; meinke2020towards; Mohseni2020SelfSupervisedLF; RATIO; OECC; thulasidasan2021effective) where a classifier is trained on the in-distribution task and one enforces low confidence as proposed by LeeEtAl2018 during training on a large and diverse set of out-of-distribution images (HenMazDie2019) which can be seen as a proxy of all natural images. This approach generalizes well to other out-distributions. Recently, NTOM (chen2020informative-outlier-matters) has achieved excellent results for detecting far out-of-distribution data by adding a background class to the classifier which is trained on samples from the surrogate out-distribution that are mined such that they show a desired hardness for the model. At test time, the output probability for that class is used to decide if an input is to be flagged as OOD. Their ATOM method does the same while also adding adversarial perturbations to the OOD inputs during training. Even though it has been claimed that new approaches outperform HenMazDie2019, up to our knowledge this has not been shown consistently across different and challenging test out-of-distribution datasets (including close and far out-of-distribution datasets). Below, we discuss some other recently proposed approaches that build upon different premises on the data available during training.
hendrycks2019selfsupervised do not use any OOD data during training and instead teach the model to predict whether an input has been rotated by 0°, 90°, 180° or 270°. For inference, they use the loss of this predictor as an OOD score, and add this score to classifier output entropy, which behaves very similar to classifier confidence. Similar to our methods, they also use shared representations and the combination of the in-distribution classifier with a dedicated OOD detection score. If one interprets their rotation predictor loss as being an estimator of for some implicit out-distribution, their scoring function coincides with our scores.
golan2018deep learn a similar transformation detector (with Dirichlet statistics collected on the in-distribution replacing ground truth labels) and use it directly to detect OOD samples without using in-distribution class information.
winkens2020contrastive fit for each class a normalized Mahalanobis detector on the activations of a model trained with SimCLR ands a classification head on only the in-distribution with smoothed labels. They describe their method as applying class-wise density estimation in the feature space, where the normalized Mahalanobis distance is equivalent to a Gaussian density for each class.
DermaOOD treat an interesting application of flagging unseen skin diseases, making use of class labels that are also available for their training OOD data, which contains diseases that are different from both the in-distribution diseases and the unseen diseases. This allows them to do fine-grained OOD detection by regarding the sum over all OOD classes which for their dataset shows large improvement over methods that treat the training out-distributuion as one class. They gain additional slight improvements by combining this with a coarse grained binary loss that treats the sum over all in-distribution class probabilities as and the sum over all OOD classes as . They show that this method can be combined with various representation learning approaches in order to improve their detection of unknown diseases.
tack2020csi introduce distribution shifting transformations into SimCLR training. Those are transformations that are so strong that the resulting samples can be considered as OOD and as negatives w.r.t. the original in the SimCLR loss. Similarly to hendrycks2019selfsupervised and golan2018deep, they also train a head that classifies the applied transformation. In a version extended to using in-distribution labels, they consider samples from the same class and with the same transformation as positives, and samples where either is different as negatives. With this method, they obtain OOD detection results that significantly improve over standard classifier confidence, without using any training OOD dataset.
(liu2020hybrid)
derive a contrastive loss from Joint Energy-Based Model training
(Grathwohl2020Your) and train it together with cross-entropy on the in-distribution in order to obtain classifiers whose logit values can be transformed into energies that are equivalent as OOD scoring functions to in-distribution density. They show that using these energies yields some improvements over previous density estimation approaches, and that also the classifier confidences show moderately improved OOD detection when compared to standard training.liu2020energy propose another Energy-Based method which incorporates surrogate OOD data during training. We analyse this method in detail in Appendix F, where we show that their Bayes optimal OOD detector is equivalent to the binary discriminator between in- and out-distribution.
(li2020background) use the same training method as OE and show that careful resampling of the training out-distribution resembling hard negative mining can reduce its size and therefore lead to a more resource effective training OOD dataset, while the resulting models reach similarly good OOD detection performance.
We want to answer the following questions: How does estimating the in-distribution density compare to simply employing a binary discriminator between the in-distribution and uniform noise with respect to the task of out-of-distribution detection? Can other density based models be substituted for potentially easier to handle binary discriminators against a suitable (semi-)synthetic out-distribution? As generative models, we use a standard likelihood VAE, a likelihood PixelCNN++ and additionally compare with a Likelihood Regret VAE (XiaoLikelihoodRegret). The binary classifier is trained to separate real data from uniform noise, thus none of the methods presented in this section make use of 80M or any other surrogate distribution. The results for OOD detection in terms of AUC for all methods are presented in Table 4.
Comparing the OOD detection performance of the binary discriminator trained against uniform noise with both VAE models, we asses that neither model is suitable for reliably detecting inputs from unknown out-distributions.
Following the theoretical analysis from the previous sections, the likelihood models and our binary classifier are able to perfectly separate the in-distribution data from uniform noise. This is expected as those methods are trained on that particular task of separating CIFAR-10 from uniform noise, whereas the LH Regret VAE with modified train objective has worse performance on uniform noise. It appears as if the training objective of the binary classifier seems to be too easy as the training and validation loss converge to almost zero in the first few epochs of training. However, the ability to separate uniform noise from real images does not generalize to other test distributions as both methods fail to achieve good out-of-distribution detection performance on the other test distributions. We note that while the score features from the likelihood models and the binary classifier are in expectation equivalent, both methods behave quite different on the test datasets (except for uniform noise). This is not surprising, as the probability of drawing real images from the uniform distribution is so small that neither training method properly regularizes the model’s behaviour on those particular image manifolds. Thus the results are artifacts of random fluctuation and no method clearly outperforms the other one, for example the binary classifier is better at separating CIFAR-10 from SVHN whereas the likelihood VAE significantly outperforms the binary classifier on LSUN. Similar fluctuations also exist between the two variational auto encoders and PixelCNN++, but in conclusion none of those methods is able to generalize to more specialized unseen distributions.
In Section 3.1, we discussed that for the Bayes optimal solutions of their training objectives, the ratio of the likelihoods of two density estimators for different distributions is as an OOD detection scoring function equivalent to the prediction of a binary discriminator between the two distributions. In order to find out which role this equivalence plays in practise, we train a binary discriminator between CIFAR-10 as in-distribution and the background distribution obtained by mutating 10% of the pixles of in-distribution images as described in ren2019likelihood. In Table 5, we compare the OOD detection performance of this discriminator with likelihood ratios estimated with PixelCNN++ (Salimans2017PixeCNN) as trained with the code of ren2019likelihood setting regularization as 10, and with the numbers taken from XiaoLikelihoodRegret
given for their VAE models. Even though we use the code and hyperparameter settings of
ren2019likelihood, the AUC we obtain for SVHN as out-distribution differs significantly from their reported 93.1%. We observe that all three methods struggle with detecting inputs from several out-distributions and thus we do not consider them as reliable out-of-distribution detection methods.With Energy-Based OOD Detection (liu2020energy), we exhibit a surprising case of the equivalence of Binary Discriminators to a further OOD detection method which is based on ideas quite different to those of the other extensively discussed methods. This method is based on the premise that the logits of a classifier can be used to define an energy (here we ignore a potential temperature factor which as liu2020energy find can in good conscience be set to one)
(10) |
which the model assigns to the input, as has been proposed by lecun2006tutorial; Grathwohl2020Your. Ideally, this energy would be equivalent to the probability density of the in-distribution via
(11) |
However, since the integral over the whole image domain is intractible, it is not possible to effectively decrease on the in-distribution directly while also controlling . Naïvely training the classifier to have high energy on random inputs, i.e. uniform noise, is of course not a solution, since the model easily distinguishes the noise, and it is very unlikely to encounter any at least vaguely real images within finite training time. Thus, liu2020energy rather use a surrogate training out-distribution of natural images which they increase the energy on during training; for their experiments, they take 80 Million Tiny Images, for which we compare their models with several other methods in Appendix H. Simultaneously, they minimize the standard classifier cross-entropy on the in-distribution. In order to avoid infinitely small potential losses, their training objective uses two margin hyper-parameters and reads
(12) |
with
(13) |
Note that their does not balance between in and out by depending on a prior on , but it balances the energy regularization compared to the classifier cross-entropy loss.
See 3
Our goal is to find for a given input the model output that minimizes the expected loss, assuming that we know the probabilities and . This expected loss is
(14) | ||||
(15) |
First, we note that if a certain minimizes the expected energy loss to
(16) |
and if some minimizes the expected CE loss (note that its minimization is independent of the positive factor ) to
(17) |
with corresponding , then the logit output with