TensorFlow 2 implementation of the paper Generalized ODIN: Detecting Out-of-distribution Image without Learning from Out-of-distribution Data (https://arxiv.org/abs/2002.11297).
Deep neural networks have attained remarkable performance when applied to data that comes from the same distribution as that of the training set, but can significantly degrade otherwise. Therefore, detecting whether an example is out-of-distribution (OoD) is crucial to enable a system that can reject such samples or alert users. Recent works have made significant progress on OoD benchmarks consisting of small image datasets. However, many recent methods based on neural networks rely on training or tuning with both in-distribution and out-of-distribution data. The latter is generally hard to define a-priori, and its selection can easily bias the learning. We base our work on a popular method ODIN, proposing two strategies for freeing it from the needs of tuning with OoD data, while improving its OoD detection performance. We specifically propose to decompose confidence scoring as well as a modified input pre-processing method. We show that both of these significantly help in detection performance. Our further analysis on a larger scale image dataset shows that the two types of distribution shifts, specifically semantic shift and non-semantic shift, present a significant difference in the difficulty of the problem, providing an analysis of when ODIN-like strategies do or do not work.READ FULL TEXT VIEW PDF
TensorFlow 2 implementation of the paper Generalized ODIN: Detecting Out-of-distribution Image without Learning from Out-of-distribution Data (https://arxiv.org/abs/2002.11297).
State-of-the-art machine learning models, specifically deep neural networks, are generally designed for a static and closed world. The models are trained under the assumption that the input distribution at test time will be the same as the training distribution. In the real world, however, data distributions shift over time in a complex, dynamic manner. Even worse, new concepts (e.g. new categories of objects) can be presented to the model at any time. Such within-class distribution shift and unseen concepts both may lead to catastrophic failures since the model still attempts to make predictions based on its closed-world assumption. These failures are therefore often silent in that they do not result in explicit errors in the model.
The above issue had been formulated as a problem of detecting whether an input data is from in-distribution (i.e. the training distribution) or out-of-distribution (i.e. a distribution different from the training distribution) . This problem has been studied for many years  and has been discussed in several views such as rejection [8, 5]1], open set recognition 
, and uncertainty estimation[22, 23, 24]
. In recent years, a popular neural network-based baseline is to use the max value of class posterior probabilities output from a softmax classifier, which can in some cases be a good indicator for distinguishing in-distribution and out-of-distribution inputs.
ODIN , based on a trained neural network classifier, provides two strategies, temperature scaling and input preprocessing, to make the max class probability a more effective score for detecting OoD data. Its performance has been further confirmed by , where 15 OoD detection methods are compared with a less biased evaluation protocol. ODIN out-performs popular strategies such as MC-Dropout , DeepEnsemble , PixelCNN++ , and OpenMax .
Despite its effectiveness, ODIN has a requirement that it needs OoD data to tune hyperparameters for both its strategies, leading to a concern that hyperparameters tuned with one out-of-distribution dataset might not generalize to others, discussed in. In fact, other neural network-based methods [20, 38], which follow the same problem setting, have a similar requirement. [6, 14] push the idea of utilizing OoD data further by using a carefully chosen OoD dataset to regularize the learning of class posteriors so that OoD data have much lower confidence than in-distribution. Lastly,  uses a generative model to generate out-of-distribution data around the boundary of the in-distribution for learning.
Although the above works show that learning with OoD data is effective, the space of OoD data (ex: image pixel space) is usually too large to be covered, potentially causing a selection bias for the learning. Some previous works have done a similar attempt to learn without OoD data, such as , which uses word embeddings for extra supervision, and  which applies metric learning criteria. However, both works report performance similar to ODIN, showing that learning without OoD data is a challenging setting.
In this work, we closely follow the setting of ODIN, proposing two corresponding strategies for the problem of learning without OoD data. First, we provide a new probabilistic perspective for decomposing confidence of predicted class probabilities. We specifically add a variable for explicitly adopting the closed world assumption, representing whether the data is in-distribution or not, and discuss its role in a decomposed conditional probability. Inspired by the probabilistic view, we use a dividend/divisor structure for a classifier, which encourages neural networks to behave similarly to the decomposed confidence effect. The concept is illustrated in Figure 1, and we note the dividend/divisor structure is closely related to temperature scaling except that the scale depends on the input instead of a tuned hyperparameter. Second, we build on the input preprocessing method from ODIN  and develop an effective strategy to tune its perturbation magnitude (which is a hyperparameter of the preprocessing method) with only in-distribution data.
We then perform extensive evaluations on benchmark image datasets such as CIFAR10/100, TinyImageNet, LSUN, SVHN, as well as a larger scale dataset DomainNet, for investigating the conditions under which the proposed strategies do or do not work. The results show that the two strategies can significantly improve upon ODIN, achieving a performance close to, and in some cases surpassing, state-of-the-art methods  which use out-of-distribution data for tuning. Lastly, our systematical evaluation with DomainNet reveals the relative difficulties between two types of distribution shift: semantic shift and non-semantic shift, which are defined by whether a shift is related to the inclusion of new semantic categories.
In summary, the contribution of this paper is three-fold:
A new perspective of decomposed confidence for motivating a set of classifier designs that consider the closed-world assumption.
A modified input preprocessing method without tuning on OoD data.
Comprehensive analysis with experiments under the setting of learning without OoD data.
This work considers the OoD detection setting in classification problems. We begin with a dataset , denoting in-distribution data and categorical label for classes. is generated by sampling from a distribution . We then have a discriminative model with parameters learned with the in-domain dataset , predicting the class posterior probability .
When the learned classifier is deployed in the open world, it may encounter data drawn from a different distribution such that . Sampling from all possible distributions that might be encountered is generally intractable especially when the dimension is large, such as in the cases of image data. Note also that we can conceptually categorize the type of differences into non-semantic shift and semantic shift. Data with non-semantic shift is drawn from the distribution . Examples with this shift come from the same object class but are presented in different forms, such as cartoon or sketch images. Such shift is also a scenario be widely discussed in the problem of domain adaptation [30, 31]. In the case of semantic shift, the data is drawn from a distribution with . In other words, the data is from a class not seen in the training set . Figure 2 has an illustration.
The above separation leads to two natural questions that must be answered for a model to work in an open world: How can the model avoid making a prediction when encountering an input , or reject a low confidence prediction when ? In this work, we propose to introduce an explicit binary domain variable in order to represent this decision, with meaning that the input is while meaning (or equivalently ). Note that while generally the model cannot distinguish between the two cases we defined, we can still show that both of the questions above can be answered by estimating this single variable d.
The ultimate goal, then, is to find a scoring function which correlates to the domain posterior probability , in that a higher score from indicates a higher probability of . The binary decision now can be made by applying a threshold on . Selecting such a threshold is subject to the application requirement or the performance metric calculation protocol. With the above notation, we can view the baseline method  as a special case with a specific scoring function , where is obtained from a standard neural network classifier trained with cross-entropy loss. However, can become a learnable parameterized function, and different OoD methods can then be categorized by specific parameterizations and learning procedures. A key differentiator between methods is whether the parameters are learned with or without OoD data.
This section describes the two methods that are the most related to our work: ODIN  and Mahalanobis . These two methods will serve as strong baselines in our evaluation, especially since Mahalanobis has further been shown to have significant advantages over ODIN. Note that both ODIN and Mahalanobis start from a vanilla classifier trained on , then have a scoring function which has extra parameters to be tuned. In their original work, those parameters are specifically tuned for each OoD dataset. Here we will describe methods to use them without tuning on OoD data.
ODIN comprises two strategies: temperature scaling and input preprocessing. The temperature scaling is applied to its scoring function, which has
for the logit ofclass:
Although ODIN originally involved tuning the hyperparameter with out-of-distribution data, it was also shown that a large value can generally be preferred, suggesting that the gain is saturated after 1000 . We follow this guidance and fix in our experiments.
Mahalanobis comprises two parts as well: Mahalanobis distance calculation and input preprocessing. The score is calculated with Mahalanobis distance as follows:
The represents the output features at the th-layer of neural networks, while and are the class mean representation and the covariance matrix, correspondingly. The hyperparameter is . In the original method, is regressed with a small validation set containing both in-distribution and out-of-distribution data. Therefore they have a set of tuned for each OoD dataset. As a result, for the baseline that does not tune on OoD data we use uniform weighting .
Note that both methods use the input preprocessing strategy, which has a hyperparameter to be tuned. In their original works, this hyperparameter is tuned for each OoD dataset as well. Therefore we develop a version that does not require tuning with out-of-distribution data.
observed that the softmax classifier tends to output a highly confident prediction, reporting that ”random Gaussian noise fed into an MNIST image classifier gives a predicted class probability of 91%”. They attribute this to the use of the softmax function which is a smooth approximation of an indicator function, hence tending to give a spiky distribution instead of a uniform distribution over classes. We acknowledge this view and further consider it as a limitation in the design of the softmax classifier. To address this limitation, our inspiration starts from reconsidering its outputs, the class posterior probability , which does not consider the domain d at all. In other words, current methods condition on domain based on the implicit closed world assumption. Thus, we use our explicit variable in the classifier, rewriting it as the quotient of the joint class-domain probability and the domain probability using the rule of conditional probability:
Equation 4 provides a probabilistic view of why classifiers tend to be overconfident. Consider an example : It is natural to expect that the joint probability is low (e.g. ) for its maximum value among C classes. One would also expect its domain probability is low (e.g. ). Therefore, calculating with Equation 4 gives a high probability (), demonstrating how overconfidence can result. Based on the form of Equation 4, we call and the decomposed confidence scores.
One straightforward solution for the above issue is to learn a classifier to predict the joint probability by having both supervision on class y and domain d. Learning to predict is preferred over because it can serve both purposes for predicting a class by and rejecting a prediction by thresholding. This idea relates to the work of , which adds an extra loss term to penalize a predicted non-uniform class probability when an out-of-distribution data is given to the classifier. However, this strategy requires out-of-distribution data for regularizing the training.
Without having supervision on domain d (i.e. without out-of-distribution data), there is no principled way to learn and
. This situation is similar to unsupervised learning (or self-supervised learning) in that we need to insert assumptions or prior knowledge about the task for learning. In our case, we use the dividend/divisor structure in Equation4 as the prior knowledge to design the structure of classifiers, providing classifiers a capacity to decompose the confidence of class probability.
In the dividend/divisor structure for classifiers, we define the logit for class , which is the division between two functions and :
The quotient of the two scores is then normalized by the exponential function (i.e. softmax) for outputting a class probability , which is subject to cross-entropy loss.
With the exponential normalization effect of softmax, the cross-entropy loss can be minimized in two ways: increasing or decreasing . In other words, when the data is not in the high-density region of in-distribution, may tend towards smaller values. In such case, the is encouraged to be small so that the resulting logits can further minimize the cross-entropy loss. In the other case when the data is in the high density region, generally can reach a higher value relatively easier, thus its corresponding value is less encouraged to go small. The discussed interaction between and is the major driving force to encourage to behave similar to and to behave similar to .
Although the dividend/divisor structure provides a tendency, it does not necessarily guarantee the decomposed confidence effect to happen. The characteristic of and can largely affect how likely the decomposition could happen. Therefore we discuss a set of simple design choices to investigate whether such decomposition is generally obtainable.
Specifically we have , which uses features
from the penultimate layer of neural networks sequentially through another linear layer, batch normalization (
, optional for a faster convergence), and a sigmoid function. The and represent the learnable weights. For
, we investigate three similarity measures, including inner-product (I), negative Euclidean distance (E), and cosine similarity (C) for, , and , correspondingly:
The overall neural network model therefore has two branches ( and ) after its penultimate layer (See Figure 1). At training time, the model calculates the logit followed by the softmax function with cross-entropy loss on top of it. At testing time, the class prediction can be made by either calculating or (both will give the same predictions). For out-of-distribution detection, we use the scoring function or .
Note that when and , this method reduces to the baseline . We call the three variants of our method DeConf-I, DeConf-E, and DeConf-C. For simplicity, the above names represent using for the scores. The use of will be indicated specifically.
The in Equation 5 can be immediately viewed as a learned temperature scaling function discussed in  and a concurrent report . However, our experiment results strongly suggest that is more than a scale. The achieves an OoD detection performance significantly better than baselines in many experiments, indicating its potential in estimating the . More importantly, the temperature scaling is generally used as a numerical trick for learning a better embedding , softening the prediction , or calibrating the confidence . Our work provides a probabilistic view for its effect, indicating such temperature might relate to how strong a classifier assumes a closed world as a prior.
This section describes a modified version of the input preprocessing method proposed in ODIN . The primary purpose of the modification is making the search of the perturbation magnitude to not rely on out-of-distribution data. The perturbation of input is given by:
In the original method  the best value of is searched with a half-half mixed validation dataset of and over a list of 21 values. The perturbed images are fed into the classification model for calculating the score . The performance of each magnitude is evaluated with the benchmark metric (TNR@TPR95, described later) and the best one is selected. This process repeats for each out-of-distribution dataset, and therefore the original method results in a number of values equal to the number of out-of-distribution datasets in the benchmark.
In our method, we search for the which maximizes the score with only the in-distribution validation dataset :
Our searching criteria is still based on the same observation made by . They observe that the in-distribution images tend to have their score increased more than the out-of-distribution images when the input perturbation is applied. We therefore use Eq. 10 since we argue that an which makes a large score increase for in-distribution data should be sufficient to create a distinction in score. Our method also does not even require class labels although it is available in . More importantly, our method selects only one based on without access to the benchmark performance metric (e.g. TNR@TPR95), greatly avoiding the hyperparameter from fitting to a specific benchmark score. Lastly, we perform the search of on a much coarser grid, which has only 6 values: . Therefore, our search is much faster. Although overshooting is possible (e.g. the maximum value is at the middle of two scales in the grid) due to the coarser grid, it can be mitigated by reducing the found magnitude by one scale (i.e. divide it by two). This simple strategy consistently gains or maintains the performance on varied scoring functions, such as , , , and .
The method in this section is orthogonal to all the methods evaluated in this work. For convenience, we will add a * after the name of other methods to indicate a combination, for example, Baseline* and DeConf-C*.
|Baseline / ODIN* / Mahalanobis* / DeConf-C*|
|CIFAR-100||Imagenet(c)||79.0 / 90.5 / 92.4 / 97.6||25.3 / 56.0 / 63.5 / 87.8|
|Imagenet(r)||76.4 / 91.1 / 96.4 / 98.6||22.3 / 59.4 / 82.0 / 93.3|
|LSUN(c)||78.6 / 89.9 / 81.2 / 95.3||23.0 / 53.0 / 31.6 / 75.0|
|LSUN(r)||78.2 / 93.0 / 96.6 / 98.7||23.7 / 64.0 / 82.6 / 93.8|
|iSUN||76.8 / 91.6 / 96.5 / 98.4||21.5 / 58.4 / 81.2 / 92.5|
|SVHN||78.1 / 85.6 / 89.9 / 95.9||18.9 / 35.3 / 43.3 / 77.0|
|Uniform||65.0 / 91.4 / 100. / 99.9||2.95 / 66.1 / 100. / 100.|
|Gaussian||48.0 / 62.0 / 100. / 99.9||0.06 / 33.3 / 100. / 100.|
|CIFAR-10||Imagenet(c)||92.1 / 88.2 / 96.3 / 98.7||50.0 / 47.8 / 81.2 / 93.4|
|Imagenet(r)||91.5 / 90.1 / 98.2 / 99.1||47.4 / 51.9 / 90.9 / 95.8|
|LSUN(c)||93.0 / 91.3 / 92.2 / 98.3||51.8 / 63.5 / 64.2 / 91.5|
|LSUN(r)||93.9 / 92.9 / 98.2 / 99.4||56.3 / 59.2 / 91.7 / 97.6|
|iSUN||93.0 / 92.2 / 98.2 / 99.4||52.3 / 57.2 / 90.6 / 97.5|
|SVHN||88.1 / 89.6 / 98.0 / 98.8||40.5 / 48.7 / 90.6 / 94.0|
|Uniform||95.4 / 98.9 / 99.9 / 99.9||59.9 / 98.1 / 100. / 100.|
|Gaussian||94.0 / 98.6 / 100. / 99.9||48.8 / 92.1 / 100. / 100.|
Overall procedure: In all experiments, we first train a classifier on an in-distribution training set, then tune the hyperparameters (e.g. the perturbation magnitude ) on an in-distribution validation set without using its class labels. At testing time, the OoD detection scoring function calculates the scores from the outputs of . The scores is calculated for both in-distribution validation set and out-of-distribution dataset . The scores are then sent to a performance metric calculation function. The above procedure is the same as related works in this line of research [21, 20, 14, 34, 38, 19], except that we do not use OoD data for tuning the hyperparameters in the scoring function .
for the classification task. Detecting OoD with CIFAR-100 classifier is generally harder than CIFAR-10 and SVHN, since a larger amount of classes usually involves a wider range of variance, and thus it has a higher tendency to treat random data (e.g. Gaussian noise) as in-distribution. For that reason, we use CIFAR-100 in our ablation and robustness study.
Out-of-distribution Datasets: We include all the OoD datasets used in ODIN , which are TinyImageNet(crop), TinyImageNet(resize), LSUN(crop), LSUN(resize), iSUN, Uniform random images, and Gaussian random images. We further add SVHN, a colored street numbers image dataset, to serve as a difficult OoD dataset. The selection is inspired by the finding in the line of works that uses a generative model for OoD detection [32, 26, 4]. Those works report that a generative model of CIFAR-10 assigns higher likelihood to SVHN images, indicating a hard case for OoD detection.
for the classifier backbone. DenseNet has 100 layers with a growth rate of 12. It is trained with batch size 64 for 300 epochs with weight decay 0.0001. The ResNet and WideResNet-28-10 are trained with batch size 128 for 200 epochs with weight decay 0.0005. In both training, the optimizer is SGD with momentum 0.9, and the learning rate starts with 0.1 and decreases by factor 0.1 at 50% and 75% of the training epochs. Note that we do not apply weight decay for the weights in thefunction of DeConf classifier since they work as the centroids for classes, and those weights are initialized with He-initialization . In the robustness analysis, the model may be indicated to have an extra regularization. In such case, we additional apply a dropout rate of 0.7 at the inputs for the dividend/divisor structure.
We use the two most widely adopted metrics in the OoD detection literature. The first one is the area under the receiver operating characteristic curve (AUROC), which plots the true positive rate (TPR) of in-distribution data against the false positive rate (FPR) of OoD data by varying a threshold. Thus it can be regarded as an averaged score. The second one is true negative rate at 95% true positive rate (TNR@TPR95), which simulates an application requirement that the recall of in-distribution data should be 95%. Having a high TNR under a high TPR is much more challenging than having a high AUROC score; thus TNR@TPR95 can discern between high-performing OoD detectors better.
|ODIN / Maha/ ODIN* / Maha* / DeConf-C*|
|Imagenet(r)||85.2 / 97.4 / 91.1 / 96.4 / 98.6||42.6 / 86.6 / 59.4 / 82.0 / 93.3|
|LSUN(r)||85.5 / 98.0 / 93.0 / 96.6 / 98.7||41.2 / 91.4 / 64.0 / 82.6 / 93.8|
|SVHN||93.8 / 97.2 / 85.6 / 89.9 / 95.9||70.6 / 82.5 / 35.3 / 43.3 / 77.0|
|Imagenet(r)||98.5 / 98.8 / 90.1 / 98.2 / 99.1||92.4 / 95.0 / 51.9 / 90.9 / 95.8|
|LSUN(r)||99.2 / 99.3 / 92.9 / 98.2 / 99.4||96.2 / 97.2 / 59.2 / 91.7 / 97.6|
|SVHN||95.5 / 98.1 / 89.6 / 98.0 / 98.8||86.2 / 90.8 / 48.7 / 90.6 / 94.0|
OoD benchmark performance: We show an overall comparison for methods that train without OoD data in Table 1 with 8 OoD benchmark datasets. The ODIN* and Mahalanobis* are significantly better than the baseline, while DeConf-C* still outperforms them with a significant margin. These results clearly show that learning OoD detection without OoD data is feasible, and the two methods we proposed in Sections 3.1 and 3.2 combined are very effective for this purpose.
In Table 2 we further compare our results with the original ODIN  and Mahalanobis  methods which are tuned on each OoD dataset. We refer to the results of both original methods reported by  since it uses the same backbone network, OoD datasets, and metrics to evaluate OoD detection performance. In the comparison, we find our ODIN* and Mahalanobis* perform worse than the ODIN and Mahalanobis in a major fraction of the cases. The result is not surprising because the original methods gain advantage from using OoD data. However, our DeConf-C* still outperforms the two original methods in many of the cases. The cross-setting comparison further supports the effectiveness of the proposed strategies.
Ablation Study: We study the effect of applying DeConf and our modified input preprocessing (IPP) strategy separately. In Figure 3, it shows that both and from all three variants (I, E, C) of the DeConf strategy help OoD detection performance with CIFAR-10 and SVHN classifier, showing that the concept of DeConf is generally effective. However, the failure of DeConf-I and with CIFAR-100 classifier in Figure (a)a may indicate these functions have different robustness and scalability, which we will investigate in the next section. One downside of using the DeConf strategy is that the accuracy of the classifier may slightly reduce in the case with CIFAR-100 (A drop compared to a vanilla classifier). This could be a natural consequence of having an alternative term, i.e.
, in the model to fit the loss function. This may cause the lack of a high score for, instead of assigning a lower score for the data away from the high-density region of in-distribution data. We see this effect is reduced and has only a accuracy drop when the extra regularization (dropout rate 0.7) is applied. Please see Supplementary for the comparison of in-distribution classification accuracy.
In Figure 5, the results show that tuning the perturbation magnitude with only in-distribution data is an effective strategy, allowing us to reduce the required supervision for learning. The supervision here means the binary label for in/out-of-distribution.
Robustness Study: This study investigates when the OoD detection method will or will not work. In Figure 6, it shows that the number of in-distribution training data can largely affect the performance of the OoD detector. Mahalanobis has the lowest data requirement, but the DeConf methods generally reach a higher performance in the high data regime. In Figure 6, we also examine scalability by varying the number of classes in the in-distribution data. In this test, DeConf-E* and DeConf-C* show the best scalability. Overall, DeConf-C* is more robust than the other two DeConf variants. Lastly, Figure 7 shows that high performing methods such as DeConf-E*, DeConf-C*, and Mahalanobis* are not sensitive to the type and depth of neural networks. Therefore, the number of in-distribution samples and classes are the main factors that affect OoD detection performance.
Enhancing the Robustness: The overfitting issue may be the cause of low OoD detection performance for some of the DeConf variants and . In Figure (b)b, the OoD detection performance is significantly increased with DeConf-I and all when extra regularization (dropout rate 0.7) is applied. Figure 8 provides further analysis for DeConf-I and its by varying the number of samples and classes in the training data. The performance with extra regularization is significantly better than the cases without it. Besides, the performance is also very similar between regularized and , indicating that overfitting is an important issue. Lastly, we note that the DeConf-E and DeConf-C have a reduced performance with extra regularization in Figure (b)b. It is an expected outcome since dropout generally harms the distance calculation between centroids and data since part of the feature is masked. The results indicate that the design of (I, E, C) might not be optimal for the problem, leaving room for future work to find a robust pair of and for the OoD detection problem.
One interesting aspect of out-of-distribution data that has not been explored is the separation of semantic and non-semantic shift. We therefore use a larger scale image dataset, DomainNet , to repeat an evaluation similar to Table 1. DomainNet has high-resolution images in 345 classes from six different domains. There are four domains in the dataset with class labels available when the experiments are conveyed. They are real, sketch, infograph, and quickdraw, resulting in different types of distribution shifts.
To create subsets with semantic shift, we separate the classes into two splits. Split A has class indices from 0 to 172, while split B has 173 to 344. Our experiment uses real-A for in-distribution and has the other subsets for out-of-distribution. With the definition given in Section 2, real-B has a semantic shift from real-A, while sketch-A has a non-semantic shift. Sketch-B therefore has both types of distribution shift. Figure 2 illustrates the setup. The classifier learned on real-A uses a Resnet-34 backbone. Its training setting is described in Section 4.1 except that the networks are trained for 100 epochs with initial learning rate of 0.01, and the images are center-cropped and resized to 224x224 in this experiment.
|S||NS||Baseline / ODIN* / Maha* / DeConf-C*|
|real-B||✓||75.1 / 69.9 / 53.6 / 69.8||15.3 / 15.4 / 5.09 / 14.0|
|sketch-A||✓||75.5 / 80.7 / 59.5 / 84.5||20.1 / 31.2 / 7.30 / 37.5|
|sketch-B||✓||✓||81.8 / 85.7 / 60.4 / 89.1||25.2 / 36.8 / 7.55 / 44.1|
|infograph-A||✓||79.6 / 82.7 / 81.5 / 89.0||23.5 / 27.8 / 21.6 / 45.4|
|infograph-B||✓||✓||82.1 / 85.3 / 80.9 / 90.9||24.8 / 31.7 / 21.9 / 49.6|
|quickdraw-A||✓||78.8 / 96.4 / 67.4 / 96.9||21.1 / 79.9 / 3.38 / 83.1|
|quickdraw-B||✓||✓||80.5 / 96.9 / 66.1 / 97.4||22.1 / 83.6 / 2.38 / 86.6|
|Uniform||✓||✓||54.7 / 75.6 / 99.8 / 99.3||1.65 / 5.37 / 100. / 100.|
|Gaussian||✓||✓||71.3 / 95.5 / 99.9 / 99.4||0.64 / 46.9 / 100. / 100.|
The results in Table 3 reveal two interesting trends. The first one is that the OoD datasets with both types of distribution shifts are easier to detect, followed by non-semantic shift. The semantic shift turns out to be the hardest one to detect. The second observation is the failure of Mahalanobis*. In most cases it is even worse than Baseline, except detecting random noise. In contrast, ODIN* has performance gain in most of the cases, but has less gain with random noise. Our DeConf-C* still performs the best, showing that its robustness and scalability is capable of handling a more realistic problem setting, although there is still large room for improvement.
In this paper, we propose two strategies, the decomposed confidence and the modified input preprocessing. These two simple modifications to ODIN lead to a significant change in the paradigm, which does not need OoD data for tuning the method. Our comprehensive analysis shows that our strategies are effective and even better in several cases than the methods tuned for each OoD dataset. Our further analysis using a larger scale image dataset shows that the data with only semantic shift is harder to detect, pointing out a challenge for future works to address.
Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §1.
Deep anomaly detection with outlier exposure. International Conference on Learning Representations (ICLR). Cited by: §1, §3.1, §4.1.
|CIFAR-100||32x32||100||DenseNet||Table 1,2; Figure 7||77.00.2||75.80.4||76.40.1||75.90.1|
|Baseline / ODIN* / Maha* / DeConf-I* / DeConf-E* / DeConf-C*|
|CIFAR-100||Imagenet(c)||79.0(2.2) /90.5(1.1) /92.4(0.3) /84.4(2.3) /95.1(0.5) /97.6(0.2)|
|Imagenet(r)||76.4(3.2) /91.1(1.3) /96.4(0.2) /81.2(3.6) /97.4(0.3) /98.6(0.2)|
|LSUN(c)||78.6(1.1) /89.9(0.5) /81.2(0.6) /91.7(0.3) /90.1(0.3) /95.3(0.4)|
|LSUN(r)||78.2(2.4) /93.0(0.8) /96.6(0.2) /84.1(2.1) /97.8(0.2) /98.7(0.0)|
|iSUN||76.8(2.7) /91.6(1.1) /96.5(0.2) /82.1(2.9) /97.4(0.2) /98.4(0.0)|
|SVHN||78.1(3.5) /85.6(0.0) /89.9(0.2) /89.7(0.4) /94.0(0.6) /95.9(0.7)|
|Uniform||65.0(22.) /91.4(10.) /100.(0.0) /48.5(16.) /99.9(0.0) /99.9(0.0)|
|Gaussian||48.0(28.) /62.0(38.) /100.(0.0) /6.79(4.9) /99.9(0.0) /99.9(0.0)|
|CIFAR-10||Imagenet(c)||92.1(1.0) /88.2(4.2) /96.3(0.1) /98.2(0.0) /98.0(0.2) /98.7(0.1)|
|Imagenet(r)||91.5(1.4) /90.1(4.1) /98.2(0.0) /98.4(0.0) /98.2(0.2) /99.1(0.1)|
|LSUN(c)||93.0(0.5) /91.3(2.0) /92.2(0.4) /98.4(0.0) /98.6(0.2) /98.3(0.2)|
|LSUN(r)||93.9(0.4) /92.9(2.9) /98.2(0.0) /98.6(0.0) /98.8(0.0) /99.4(0.1)|
|iSUN||93.0(0.7) /92.2(3.4) /98.2(0.0) /98.6(0.0) /98.8(0.0) /99.4(0.0)|
|SVHN||88.1(4.8) /89.6(0.3) /98.0(0.3) /98.2(0.2) /98.4(0.6) /98.8(0.1)|
|Uniform||95.4(0.7) /98.9(0.7) /99.9(0.0) /99.2(0.5) /99.9(0.0) /99.9(0.0)|
|Gaussian||94.0(2.9) /98.6(1.7) /100.(0.0) /99.1(0.3) /99.9(0.0) /99.9(0.0)|
|Baseline / ODIN* / Maha* / DeConf-I* / DeConf-E* / DeConf-C*|
|CIFAR-100||Imagenet(c)||25.3(2.8) /56.0(3.1) /63.5(2.1) /31.0(3.4) /74.6(2.8) /87.8(1.7)|
|Imagenet(r)||22.3(3.1) /59.4(3.7) /82.0(1.6) /21.4(4.0) /87.6(1.7) /93.3(1.2)|
|LSUN(c)||23.0(2.2) /53.0(1.0) /31.6(1.3) /59.6(1.9) /51.0(1.0) /75.0(1.9)|
|LSUN(r)||23.7(2.5) /64.0(3.0) /82.6(1.8) /21.1(3.3) /89.8(1.5) /93.8(0.3)|
|iSUN||21.5(2.8) /58.4(4.1) /81.2(1.4) /17.6(3.3) /87.3(1.2) /92.5(0.2)|
|SVHN||18.9(4.9) /35.3(2.9) /43.3(2.7) /52.0(0.6) /67.1(3.4) /77.0(5.0)|
|Uniform||2.95(4.1) /66.1(46.) /100.(0.0) /0.0(0.0) /100.(0.0) /100.(0.0)|
|Gaussian||0.06(0.0) /33.3(47.) /100.(0.0) /0.0(0.0) /100.(0.0) /100.(0.0)|
|CIFAR-10||Imagenet(c)||50.0(2.8) /47.8(15.) /81.2(0.8) /92.0(0.2) /90.1(1.5) /93.4(1.2)|
|Imagenet(r)||47.4(4.4) /51.9(16.) /90.9(0.5) /93.6(0.2) /91.7(1.6) /95.8(0.9)|
|LSUN(c)||51.8(3.1) /63.5(7.8) /64.2(0.6) /92.5(0.4) /93.3(1.5) /91.5(1.2)|
|LSUN(r)||56.3(3.6) /59.2(18.) /91.7(0.3) /94.9(0.2) /95.7(0.1) /97.6(0.5)|
|iSUN||52.3(3.6) /57.2(18.) /90.6(0.7) /94.6(0.3) /95.4(0.2) /97.5(0.3)|
|SVHN||40.5(6.9) /48.7(3.2) /90.6(1.7) /91.4(1.1) /92.1(3.4) /94.0(0.6)|
|Uniform||59.9(12.) /98.1(2.6) /100.(0.0) /99.9(0.0) /100.(0.0) /100.(0.0)|
|Gaussian||48.8(26.) /92.1(11.) /100.(0.0) /99.9(0.0) /100.(0.0) /100.(0.0)|
Performance of six OOD detection methods on 8 benchmark datasets. This is a full version of Table 1 in the main paper, which uses DenseNet for the backbone networks. The value in parentheses is the standard deviation.
|Baseline / ODIN* / Maha* / DeConf-I* / DeConf-E* / DeConf-C*|
|CIFAR-100||Imagenet(c)||78.9(0.1) /84.8(0.6) /93.4(0.3) /88.2(0.6) /95.2(0.6) /95.3(0.6)|
|Imagenet(r)||75.1(0.8) /85.7(0.2) /96.3(0.1) /84.6(1.0) /97.0(0.4) /95.9(0.7)|
|LSUN(c)||78.8(0.6) /80.3(1.3) /79.8(0.3) /93.8(0.3) /92.6(0.2) /93.8(0.3)|
|LSUN(r)||76.2(1.4) /86.6(0.8) /96.3(0.2) /85.9(1.8) /97.0(0.7) /96.1(0.5)|
|iSUN||75.2(1.4) /85.9(0.8) /95.8(0.2) /84.7(1.4) /96.6(0.6) /95.7(0.5)|
|SVHN||75.1(2.5) /80.2(2.0) /80.9(1.1) /89.2(2.6) /93.8(0.8) /93.2(1.1)|
|Uniform||69.0(13.) /96.7(2.5) /100.(0.0) /79.3(8.3) /99.9(0.0) /99.9(0.0)|
|Gaussian||51.5(1.8) /93.7(1.7) /99.9(0.0) /60.8(23.) /99.9(0.0) /99.9(0.0)|
|CIFAR-10||Imagenet(c)||90.0(0.9) /81.2(2.4) /94.2(0.1) /98.2(0.2) /98.2(0.1) /96.0(0.2)|
|Imagenet(r)||87.3(1.3) /81.1(2.9) /96.5(0.1) /98.1(0.3) /98.1(0.3) /96.1(0.5)|
|LSUN(c)||92.0(1.7) /77.9(4.6) /87.7(0.2) /98.8(0.1) /98.5(0.0) /97.2(0.1)|
|LSUN(r)||91.6(1.2) /88.5(2.0) /97.2(0.1) /98.9(0.2) /99.0(0.1) /98.0(0.1)|
|iSUN||90.1(1.4) /86.1(2.5) /96.5(0.2) /98.8(0.2) /98.9(0.1) /97.6(0.1)|
|SVHN||87.7(2.4) /63.9(4.3) /87.8(1.6) /96.8(0.4) /96.1(1.4) /97.8(0.3)|
|Uniform||85.9(10.) /93.3(4.5) /99.9(0.0) /99.6(0.1) /99.9(0.0) /99.9(0.0)|
|Gaussian||89.9(10.) /97.1(2.0) /99.9(0.0) /99.7(0.0) /99.9(0.0) /99.9(0.0)|
|Baseline / ODIN* / Maha* / DeConf-I* / DeConf-E* / DeConf-C*|
|CIFAR-100||Imagenet(c)||24.1(0.6) /44.0(2.2) /68.2(1.4) /42.6(2.7) /73.4(3.7) /72.6(3.7)|
|Imagenet(r)||19.4(0.1) /45.5(1.4) /82.6(0.8) /30.4(3.0) /84.3(2.7) /76.5(3.8)|
|LSUN(c)||21.9(0.4) /34.8(2.4) /27.7(1.4) /66.1(2.2) /59.7(0.7) /65.7(2.3)|
|LSUN(r)||19.8(1.6) /48.2(3.0) /81.8(1.4) /29.4(5.2) /84.6(4.0) /76.8(3.3)|
|iSUN||17.7(0.5) /45.3(2.8) /80.4(0.8) /27.1(4.3) /83.0(3.1) /75.3(3.3)|
|SVHN||16.6(1.5) /27.5(5.0) /25.7(2.6) /43.7(10.) /60.8(5.3) /55.1(7.1)|
|Uniform||5.63(7.0) /76.4(27.) /100.(0.0) /4.11(5.8) /100.(0.0) /100.(0.0)|
|Gaussian||0.0(0.0) /46.6(20.) /100.(0.0) /0.06(0.0) /100.(0.0) /100.(0.0)|
|CIFAR-10||Imagenet(c)||54.6(2.6) /53.7(3.1) /74.6(0.6) /90.8(1.5) /91.1(0.9) /81.1(1.7)|
|Imagenet(r)||48.3(3.2) /53.1(4.3) /85.1(0.6) /90.5(1.8) /90.8(1.8) /81.4(2.4)|
|LSUN(c)||59.9(4.7) /50.9(6.1) /53.6(1.0) /93.9(0.5) /92.4(0.5) /87.3(1.0)|
|LSUN(r)||57.5(4.4) /68.1(4.2) /87.4(0.8) /95.8(1.0) /96.0(0.7) /90.9(0.9)|
|iSUN||53.7(3.8) /62.8(5.0) /84.6(0.9) /95.1(1.0) /95.3(0.5) /88.8(1.1)|
|SVHN||44.5(8.1) /29.7(6.2) /46.2(4.8) /84.5(2.5) /78.8(7.6) /89.5(2.1)|
|Uniform||27.9(20.) /74.5(20.) /100.(0.0) /100.(0.0) /100.(0.0) /100.(0.0)|
|Gaussian||52.7(40.) /87.1(9.3) /100.(0.0) /100.(0.0) /100.(0.0) /100.(0.0)|