Generalized ODIN: Detecting Out-of-distribution Image without Learning from Out-of-distribution Data

Deep neural networks have attained remarkable performance when applied to data that comes from the same distribution as that of the training set, but can significantly degrade otherwise. Therefore, detecting whether an example is out-of-distribution (OoD) is crucial to enable a system that can reject such samples or alert users. Recent works have made significant progress on OoD benchmarks consisting of small image datasets. However, many recent methods based on neural networks rely on training or tuning with both in-distribution and out-of-distribution data. The latter is generally hard to define a-priori, and its selection can easily bias the learning. We base our work on a popular method ODIN, proposing two strategies for freeing it from the needs of tuning with OoD data, while improving its OoD detection performance. We specifically propose to decompose confidence scoring as well as a modified input pre-processing method. We show that both of these significantly help in detection performance. Our further analysis on a larger scale image dataset shows that the two types of distribution shifts, specifically semantic shift and non-semantic shift, present a significant difference in the difficulty of the problem, providing an analysis of when ODIN-like strategies do or do not work.


Effective Out-of-Distribution Detection in Classifier Based on PEDCC-Loss

Deep neural networks suffer from the overconfidence issue in the open wo...

Truly shift-invariant convolutional neural networks

Thanks to the use of convolution and pooling layers, convolutional neura...

Exploring Covariate and Concept Shift for Detection and Calibration of Out-of-Distribution Data

Moving beyond testing on in-distribution data works on Out-of-Distributi...

Out-of-Distribution Detection using Multiple Semantic Label Representations

Deep Neural Networks are powerful models that attained remarkable result...

Class-wise Thresholding for Detecting Out-of-Distribution Data

We consider the problem of detecting OoD(Out-of-Distribution) input data...

CodeS: A Distribution Shift Benchmark Dataset for Source Code Learning

Over the past few years, deep learning (DL) has been continuously expand...

Natural Attribute-based Shift Detection

Despite the impressive performance of deep networks in vision, language,...

Code Repositories


TensorFlow 2 implementation of the paper Generalized ODIN: Detecting Out-of-distribution Image without Learning from Out-of-distribution Data (

view repo



view repo

1 Introduction

State-of-the-art machine learning models, specifically deep neural networks, are generally designed for a static and closed world. The models are trained under the assumption that the input distribution at test time will be the same as the training distribution. In the real world, however, data distributions shift over time in a complex, dynamic manner. Even worse, new concepts (

e.g. new categories of objects) can be presented to the model at any time. Such within-class distribution shift and unseen concepts both may lead to catastrophic failures since the model still attempts to make predictions based on its closed-world assumption. These failures are therefore often silent in that they do not result in explicit errors in the model.

The above issue had been formulated as a problem of detecting whether an input data is from in-distribution (i.e. the training distribution) or out-of-distribution (i.e. a distribution different from the training distribution) [13]. This problem has been studied for many years [12] and has been discussed in several views such as rejection [8, 5]

, anomaly detection

[1], open set recognition [3]

, and uncertainty estimation

[22, 23, 24]

. In recent years, a popular neural network-based baseline is to use the max value of class posterior probabilities output from a softmax classifier, which can in some cases be a good indicator for distinguishing in-distribution and out-of-distribution inputs


ODIN [21], based on a trained neural network classifier, provides two strategies, temperature scaling and input preprocessing, to make the max class probability a more effective score for detecting OoD data. Its performance has been further confirmed by [34], where 15 OoD detection methods are compared with a less biased evaluation protocol. ODIN out-performs popular strategies such as MC-Dropout [7], DeepEnsemble [18], PixelCNN++ [33], and OpenMax [2].

Despite its effectiveness, ODIN has a requirement that it needs OoD data to tune hyperparameters for both its strategies, leading to a concern that hyperparameters tuned with one out-of-distribution dataset might not generalize to others, discussed in

[34]. In fact, other neural network-based methods [20, 38], which follow the same problem setting, have a similar requirement. [6, 14] push the idea of utilizing OoD data further by using a carefully chosen OoD dataset to regularize the learning of class posteriors so that OoD data have much lower confidence than in-distribution. Lastly, [19] uses a generative model to generate out-of-distribution data around the boundary of the in-distribution for learning.

Although the above works show that learning with OoD data is effective, the space of OoD data (ex: image pixel space) is usually too large to be covered, potentially causing a selection bias for the learning. Some previous works have done a similar attempt to learn without OoD data, such as [35], which uses word embeddings for extra supervision, and [25] which applies metric learning criteria. However, both works report performance similar to ODIN, showing that learning without OoD data is a challenging setting.

Figure 1: The concept of detecting out-of-distribution images by encouraging neural networks to output scores, and , to behave like the decomposed factors in the conditional probability when the close-world assumption is explicitly considered. Its elucidation is in Section 3.1. A small overlap between the green and red histograms means the x-axis a good scoring function for distinguishing OoD data from in-distribution.

In this work, we closely follow the setting of ODIN, proposing two corresponding strategies for the problem of learning without OoD data. First, we provide a new probabilistic perspective for decomposing confidence of predicted class probabilities. We specifically add a variable for explicitly adopting the closed world assumption, representing whether the data is in-distribution or not, and discuss its role in a decomposed conditional probability. Inspired by the probabilistic view, we use a dividend/divisor structure for a classifier, which encourages neural networks to behave similarly to the decomposed confidence effect. The concept is illustrated in Figure 1, and we note the dividend/divisor structure is closely related to temperature scaling except that the scale depends on the input instead of a tuned hyperparameter. Second, we build on the input preprocessing method from ODIN [21] and develop an effective strategy to tune its perturbation magnitude (which is a hyperparameter of the preprocessing method) with only in-distribution data.

We then perform extensive evaluations on benchmark image datasets such as CIFAR10/100, TinyImageNet, LSUN, SVHN, as well as a larger scale dataset DomainNet, for investigating the conditions under which the proposed strategies do or do not work. The results show that the two strategies can significantly improve upon ODIN, achieving a performance close to, and in some cases surpassing, state-of-the-art methods [20] which use out-of-distribution data for tuning. Lastly, our systematical evaluation with DomainNet reveals the relative difficulties between two types of distribution shift: semantic shift and non-semantic shift, which are defined by whether a shift is related to the inclusion of new semantic categories.

In summary, the contribution of this paper is three-fold:

  • A new perspective of decomposed confidence for motivating a set of classifier designs that consider the closed-world assumption.

  • A modified input preprocessing method without tuning on OoD data.

  • Comprehensive analysis with experiments under the setting of learning without OoD data.

2 Background

Figure 2: An example scheme of semantic shift and non-semantic shift. It is illustrated with DomainNet [31] images. The setting with two splits (A and B) will be used in our experiments, where only real-A is the in-distribution data.

This work considers the OoD detection setting in classification problems. We begin with a dataset , denoting in-distribution data and categorical label for classes. is generated by sampling from a distribution . We then have a discriminative model with parameters learned with the in-domain dataset , predicting the class posterior probability .

When the learned classifier is deployed in the open world, it may encounter data drawn from a different distribution such that . Sampling from all possible distributions that might be encountered is generally intractable especially when the dimension is large, such as in the cases of image data. Note also that we can conceptually categorize the type of differences into non-semantic shift and semantic shift. Data with non-semantic shift is drawn from the distribution . Examples with this shift come from the same object class but are presented in different forms, such as cartoon or sketch images. Such shift is also a scenario be widely discussed in the problem of domain adaptation [30, 31]. In the case of semantic shift, the data is drawn from a distribution with . In other words, the data is from a class not seen in the training set . Figure 2 has an illustration.

The above separation leads to two natural questions that must be answered for a model to work in an open world: How can the model avoid making a prediction when encountering an input , or reject a low confidence prediction when ? In this work, we propose to introduce an explicit binary domain variable in order to represent this decision, with meaning that the input is while meaning (or equivalently ). Note that while generally the model cannot distinguish between the two cases we defined, we can still show that both of the questions above can be answered by estimating this single variable d.

The ultimate goal, then, is to find a scoring function which correlates to the domain posterior probability , in that a higher score from indicates a higher probability of . The binary decision now can be made by applying a threshold on . Selecting such a threshold is subject to the application requirement or the performance metric calculation protocol. With the above notation, we can view the baseline method [13] as a special case with a specific scoring function , where is obtained from a standard neural network classifier trained with cross-entropy loss. However, can become a learnable parameterized function, and different OoD methods can then be categorized by specific parameterizations and learning procedures. A key differentiator between methods is whether the parameters are learned with or without OoD data.

2.1 Related Methods

This section describes the two methods that are the most related to our work: ODIN [21] and Mahalanobis [20]. These two methods will serve as strong baselines in our evaluation, especially since Mahalanobis has further been shown to have significant advantages over ODIN. Note that both ODIN and Mahalanobis start from a vanilla classifier trained on , then have a scoring function which has extra parameters to be tuned. In their original work, those parameters are specifically tuned for each OoD dataset. Here we will describe methods to use them without tuning on OoD data.

ODIN comprises two strategies: temperature scaling and input preprocessing. The temperature scaling is applied to its scoring function, which has

for the logit of



Although ODIN originally involved tuning the hyperparameter with out-of-distribution data, it was also shown that a large value can generally be preferred, suggesting that the gain is saturated after 1000 [21]. We follow this guidance and fix in our experiments.

Mahalanobis comprises two parts as well: Mahalanobis distance calculation and input preprocessing. The score is calculated with Mahalanobis distance as follows:


The represents the output features at the th-layer of neural networks, while and are the class mean representation and the covariance matrix, correspondingly. The hyperparameter is . In the original method, is regressed with a small validation set containing both in-distribution and out-of-distribution data. Therefore they have a set of tuned for each OoD dataset. As a result, for the baseline that does not tune on OoD data we use uniform weighting .

Note that both methods use the input preprocessing strategy, which has a hyperparameter to be tuned. In their original works, this hyperparameter is tuned for each OoD dataset as well. Therefore we develop a version that does not require tuning with out-of-distribution data.

3 Approach

3.1 The Decomposed Confidence

[36, 29, 13]

observed that the softmax classifier tends to output a highly confident prediction, reporting that ”random Gaussian noise fed into an MNIST image classifier gives a predicted class probability of 91%”. They attribute this to the use of the softmax function which is a smooth approximation of an indicator function, hence tending to give a spiky distribution instead of a uniform distribution over classes

[13]. We acknowledge this view and further consider it as a limitation in the design of the softmax classifier. To address this limitation, our inspiration starts from reconsidering its outputs, the class posterior probability , which does not consider the domain d at all. In other words, current methods condition on domain based on the implicit closed world assumption. Thus, we use our explicit variable in the classifier, rewriting it as the quotient of the joint class-domain probability and the domain probability using the rule of conditional probability:


Equation 4 provides a probabilistic view of why classifiers tend to be overconfident. Consider an example : It is natural to expect that the joint probability is low (e.g. ) for its maximum value among C classes. One would also expect its domain probability is low (e.g. ). Therefore, calculating with Equation 4 gives a high probability (), demonstrating how overconfidence can result. Based on the form of Equation 4, we call and the decomposed confidence scores.

One straightforward solution for the above issue is to learn a classifier to predict the joint probability by having both supervision on class y and domain d. Learning to predict is preferred over because it can serve both purposes for predicting a class by and rejecting a prediction by thresholding. This idea relates to the work of [14], which adds an extra loss term to penalize a predicted non-uniform class probability when an out-of-distribution data is given to the classifier. However, this strategy requires out-of-distribution data for regularizing the training.

Without having supervision on domain d (i.e. without out-of-distribution data), there is no principled way to learn and

. This situation is similar to unsupervised learning (or self-supervised learning) in that we need to insert assumptions or prior knowledge about the task for learning. In our case, we use the dividend/divisor structure in Equation

4 as the prior knowledge to design the structure of classifiers, providing classifiers a capacity to decompose the confidence of class probability.

In the dividend/divisor structure for classifiers, we define the logit for class , which is the division between two functions and :


The quotient of the two scores is then normalized by the exponential function (i.e. softmax) for outputting a class probability , which is subject to cross-entropy loss.

With the exponential normalization effect of softmax, the cross-entropy loss can be minimized in two ways: increasing or decreasing . In other words, when the data is not in the high-density region of in-distribution, may tend towards smaller values. In such case, the is encouraged to be small so that the resulting logits can further minimize the cross-entropy loss. In the other case when the data is in the high density region, generally can reach a higher value relatively easier, thus its corresponding value is less encouraged to go small. The discussed interaction between and is the major driving force to encourage to behave similar to and to behave similar to .

3.1.1 Design Choices

Although the dividend/divisor structure provides a tendency, it does not necessarily guarantee the decomposed confidence effect to happen. The characteristic of and can largely affect how likely the decomposition could happen. Therefore we discuss a set of simple design choices to investigate whether such decomposition is generally obtainable.

Specifically we have , which uses features

from the penultimate layer of neural networks sequentially through another linear layer, batch normalization (

, optional for a faster convergence), and a sigmoid function

. The and represent the learnable weights. For

, we investigate three similarity measures, including inner-product (I), negative Euclidean distance (E), and cosine similarity (C) for

, , and , correspondingly:


The overall neural network model therefore has two branches ( and ) after its penultimate layer (See Figure 1). At training time, the model calculates the logit followed by the softmax function with cross-entropy loss on top of it. At testing time, the class prediction can be made by either calculating or (both will give the same predictions). For out-of-distribution detection, we use the scoring function or .

Note that when and , this method reduces to the baseline [13]. We call the three variants of our method DeConf-I, DeConf-E, and DeConf-C. For simplicity, the above names represent using for the scores. The use of will be indicated specifically.

3.1.2 Temperature Scaling

The in Equation 5 can be immediately viewed as a learned temperature scaling function discussed in [28] and a concurrent report [37]. However, our experiment results strongly suggest that is more than a scale. The achieves an OoD detection performance significantly better than baselines in many experiments, indicating its potential in estimating the . More importantly, the temperature scaling is generally used as a numerical trick for learning a better embedding [40], softening the prediction [15], or calibrating the confidence [9]. Our work provides a probabilistic view for its effect, indicating such temperature might relate to how strong a classifier assumes a closed world as a prior.

3.2 A Modified Input Preprocessing strategy

This section describes a modified version of the input preprocessing method proposed in ODIN [21]. The primary purpose of the modification is making the search of the perturbation magnitude to not rely on out-of-distribution data. The perturbation of input is given by:


In the original method [21] the best value of is searched with a half-half mixed validation dataset of and over a list of 21 values. The perturbed images are fed into the classification model for calculating the score . The performance of each magnitude is evaluated with the benchmark metric (TNR@TPR95, described later) and the best one is selected. This process repeats for each out-of-distribution dataset, and therefore the original method results in a number of values equal to the number of out-of-distribution datasets in the benchmark.

In our method, we search for the which maximizes the score with only the in-distribution validation dataset :


Our searching criteria is still based on the same observation made by [21]. They observe that the in-distribution images tend to have their score increased more than the out-of-distribution images when the input perturbation is applied. We therefore use Eq. 10 since we argue that an which makes a large score increase for in-distribution data should be sufficient to create a distinction in score. Our method also does not even require class labels although it is available in . More importantly, our method selects only one based on without access to the benchmark performance metric (e.g. TNR@TPR95), greatly avoiding the hyperparameter from fitting to a specific benchmark score. Lastly, we perform the search of on a much coarser grid, which has only 6 values: . Therefore, our search is much faster. Although overshooting is possible (e.g. the maximum value is at the middle of two scales in the grid) due to the coarser grid, it can be mitigated by reducing the found magnitude by one scale (i.e. divide it by two). This simple strategy consistently gains or maintains the performance on varied scoring functions, such as , , , and .

The method in this section is orthogonal to all the methods evaluated in this work. For convenience, we will add a * after the name of other methods to indicate a combination, for example, Baseline* and DeConf-C*.

4 Experiments

Baseline / ODIN* / Mahalanobis* / DeConf-C*
CIFAR-100 Imagenet(c) 79.0 / 90.5 / 92.4 / 97.6 25.3 / 56.0 / 63.5 / 87.8
Imagenet(r) 76.4 / 91.1 / 96.4 / 98.6 22.3 / 59.4 / 82.0 / 93.3
LSUN(c) 78.6 / 89.9 / 81.2 / 95.3 23.0 / 53.0 / 31.6 / 75.0
LSUN(r) 78.2 / 93.0 / 96.6 / 98.7 23.7 / 64.0 / 82.6 / 93.8
iSUN 76.8 / 91.6 / 96.5 / 98.4 21.5 / 58.4 / 81.2 / 92.5
SVHN 78.1 / 85.6 / 89.9 / 95.9 18.9 / 35.3 / 43.3 / 77.0
Uniform 65.0 / 91.4 / 100. / 99.9 2.95 / 66.1 / 100. / 100.
Gaussian 48.0 / 62.0 / 100. / 99.9 0.06 / 33.3 / 100. / 100.
CIFAR-10 Imagenet(c) 92.1 / 88.2 / 96.3 / 98.7 50.0 / 47.8 / 81.2 / 93.4
Imagenet(r) 91.5 / 90.1 / 98.2 / 99.1 47.4 / 51.9 / 90.9 / 95.8
LSUN(c) 93.0 / 91.3 / 92.2 / 98.3 51.8 / 63.5 / 64.2 / 91.5
LSUN(r) 93.9 / 92.9 / 98.2 / 99.4 56.3 / 59.2 / 91.7 / 97.6
iSUN 93.0 / 92.2 / 98.2 / 99.4 52.3 / 57.2 / 90.6 / 97.5
SVHN 88.1 / 89.6 / 98.0 / 98.8 40.5 / 48.7 / 90.6 / 94.0
Uniform 95.4 / 98.9 / 99.9 / 99.9 59.9 / 98.1 / 100. / 100.
Gaussian 94.0 / 98.6 / 100. / 99.9 48.8 / 92.1 / 100. / 100.
Table 1: Performance of four OoD detection methods. All methods in the table have no access to OoD data during training and validation. ODIN* and Mahalanobis* are modified versions that do not need any OoD data for tuning (see Section 2.1). The base network used in the table is DenseNet trained with CIFAR-10/100 (in-distribution data, or ID). All values are percentages averaged over three runs, and the best results are indicated in bold. Note that we only show the most common settings used in literature. The DeConf-C is selected since it shows the best robustness in our analysis, but it is not necessary to perform the best among all DeConf variants. Please see Figure 3 and Figure 4 for the summary. A more comprehensive version of the table is available in Supplementary.

4.1 Experimental Settings

Overall procedure: In all experiments, we first train a classifier on an in-distribution training set, then tune the hyperparameters (e.g. the perturbation magnitude ) on an in-distribution validation set without using its class labels. At testing time, the OoD detection scoring function calculates the scores from the outputs of . The scores is calculated for both in-distribution validation set and out-of-distribution dataset . The scores are then sent to a performance metric calculation function. The above procedure is the same as related works in this line of research [21, 20, 14, 34, 38, 19], except that we do not use OoD data for tuning the hyperparameters in the scoring function .

In-distribution Datasets: We use SVHN [27] and CIFAR-10/100 images with size 32x32 [17]

for the classification task. Detecting OoD with CIFAR-100 classifier is generally harder than CIFAR-10 and SVHN, since a larger amount of classes usually involves a wider range of variance, and thus it has a higher tendency to treat random data (

e.g. Gaussian noise) as in-distribution. For that reason, we use CIFAR-100 in our ablation and robustness study.

Out-of-distribution Datasets: We include all the OoD datasets used in ODIN [21], which are TinyImageNet(crop), TinyImageNet(resize), LSUN(crop), LSUN(resize), iSUN, Uniform random images, and Gaussian random images. We further add SVHN, a colored street numbers image dataset, to serve as a difficult OoD dataset. The selection is inspired by the finding in the line of works that uses a generative model for OoD detection [32, 26, 4]. Those works report that a generative model of CIFAR-10 assigns higher likelihood to SVHN images, indicating a hard case for OoD detection.

Networks and Training Details: We use DenseNet [16], ResNet [11], and WideResNet [39]

for the classifier backbone. DenseNet has 100 layers with a growth rate of 12. It is trained with batch size 64 for 300 epochs with weight decay 0.0001. The ResNet and WideResNet-28-10 are trained with batch size 128 for 200 epochs with weight decay 0.0005. In both training, the optimizer is SGD with momentum 0.9, and the learning rate starts with 0.1 and decreases by factor 0.1 at 50% and 75% of the training epochs. Note that we do not apply weight decay for the weights in the

function of DeConf classifier since they work as the centroids for classes, and those weights are initialized with He-initialization [10]. In the robustness analysis, the model may be indicated to have an extra regularization. In such case, we additional apply a dropout rate of 0.7 at the inputs for the dividend/divisor structure.

Evaluation Metrics:

We use the two most widely adopted metrics in the OoD detection literature. The first one is the area under the receiver operating characteristic curve (AUROC), which plots the true positive rate (TPR) of in-distribution data against the false positive rate (FPR) of OoD data by varying a threshold. Thus it can be regarded as an averaged score. The second one is true negative rate at 95% true positive rate (TNR@TPR95), which simulates an application requirement that the recall of in-distribution data should be 95%. Having a high TNR under a high TPR is much more challenging than having a high AUROC score; thus TNR@TPR95 can discern between high-performing OoD detectors better.

4.2 Results and Discussion

ODIN / Maha/ ODIN* / Maha* / DeConf-C*


Imagenet(r) 85.2 / 97.4 / 91.1 / 96.4 / 98.6 42.6 / 86.6 / 59.4 / 82.0 / 93.3
LSUN(r) 85.5 / 98.0 / 93.0 / 96.6 / 98.7 41.2 / 91.4 / 64.0 / 82.6 / 93.8
SVHN 93.8 / 97.2 / 85.6 / 89.9 / 95.9 70.6 / 82.5 / 35.3 / 43.3 / 77.0


Imagenet(r) 98.5 / 98.8 / 90.1 / 98.2 / 99.1 92.4 / 95.0 / 51.9 / 90.9 / 95.8
LSUN(r) 99.2 / 99.3 / 92.9 / 98.2 / 99.4 96.2 / 97.2 / 59.2 / 91.7 / 97.6
SVHN 95.5 / 98.1 / 89.6 / 98.0 / 98.8 86.2 / 90.8 / 48.7 / 90.6 / 94.0
Table 2: OoD detection with OoD data versus without OoD data with CIFAR-10/100 for the in-distribution (ID) data. The values of ODIN and Maha (abbreviation of Mahalanobis) are copied from the Mahalanobis paper [20] which are tuned with OoD data. The values of ODIN*, Maha*, and DeConf-C* are copied from Table 1 of our paper which do not have any access to OoD data. All methods in this table use the same DenseNet for the backbone. Note that the performance with different network backbone may have a mild difference. For example, Maha performs slightly better than DeConf-C* with ResNet-34.

OoD benchmark performance: We show an overall comparison for methods that train without OoD data in Table 1 with 8 OoD benchmark datasets. The ODIN* and Mahalanobis* are significantly better than the baseline, while DeConf-C* still outperforms them with a significant margin. These results clearly show that learning OoD detection without OoD data is feasible, and the two methods we proposed in Sections 3.1 and 3.2 combined are very effective for this purpose.

In Table 2 we further compare our results with the original ODIN [21] and Mahalanobis [20] methods which are tuned on each OoD dataset. We refer to the results of both original methods reported by [20] since it uses the same backbone network, OoD datasets, and metrics to evaluate OoD detection performance. In the comparison, we find our ODIN* and Mahalanobis* perform worse than the ODIN and Mahalanobis in a major fraction of the cases. The result is not surprising because the original methods gain advantage from using OoD data. However, our DeConf-C* still outperforms the two original methods in many of the cases. The cross-setting comparison further supports the effectiveness of the proposed strategies.

(a) CIFAR-10 classifier
(b) SVHN classifier
Figure 3: An ablation study with three variants in our DeConf method (Section 3.1). Plain means so that the dividend/divisor structure is turned off. Each bar in the figure is averaged with 24 experiments (8 OoD datasets listed in Table 1 with 3 repeats. Note that we use CIFAR-10 as OoD to replace the SVHN in the case of SVHN classifier). The backbone network is Resnet-34. The plain setting with inner-product is equivalent to a vanilla Resnet for classification. Overall, both scores from and are significant higher than random (AUROC=0.5) and corresponding plain baselines.
(a) CIFAR-100 classifier
(b) CIFAR-100 classifier with extra regularization (dropout 0.7)
Figure 4: An ablation study similar to Figure 3. This figure shows the performance of DeConf-I and all are improved by adding extra regularization.
Figure 5: The OoD detection performance of our input preprocessing (IPP) strategy, which selects the perturbation magnitude with only in-distribution data. The setting plain means the IPP is turned off. The in-distribution data is CIFAR-100. The backbone network is Resnet-34. Each value is averaged with the results on 8 OoD datasets listed in Table 1. Each method has its own scoring function (See Section 2.1 and 3), causing IPP to perform at varied levels of performance gain.

Ablation Study: We study the effect of applying DeConf and our modified input preprocessing (IPP) strategy separately. In Figure 3, it shows that both and from all three variants (I, E, C) of the DeConf strategy help OoD detection performance with CIFAR-10 and SVHN classifier, showing that the concept of DeConf is generally effective. However, the failure of DeConf-I and with CIFAR-100 classifier in Figure (a)a may indicate these functions have different robustness and scalability, which we will investigate in the next section. One downside of using the DeConf strategy is that the accuracy of the classifier may slightly reduce in the case with CIFAR-100 (A drop compared to a vanilla classifier). This could be a natural consequence of having an alternative term, i.e.

, in the model to fit the loss function. This may cause the lack of a high score for

, instead of assigning a lower score for the data away from the high-density region of in-distribution data. We see this effect is reduced and has only a accuracy drop when the extra regularization (dropout rate 0.7) is applied. Please see Supplementary for the comparison of in-distribution classification accuracy.

In Figure 5, the results show that tuning the perturbation magnitude with only in-distribution data is an effective strategy, allowing us to reduce the required supervision for learning. The supervision here means the binary label for in/out-of-distribution.

Robustness Study: This study investigates when the OoD detection method will or will not work. In Figure 6, it shows that the number of in-distribution training data can largely affect the performance of the OoD detector. Mahalanobis has the lowest data requirement, but the DeConf methods generally reach a higher performance in the high data regime. In Figure 6, we also examine scalability by varying the number of classes in the in-distribution data. In this test, DeConf-E* and DeConf-C* show the best scalability. Overall, DeConf-C* is more robust than the other two DeConf variants. Lastly, Figure 7 shows that high performing methods such as DeConf-E*, DeConf-C*, and Mahalanobis* are not sensitive to the type and depth of neural networks. Therefore, the number of in-distribution samples and classes are the main factors that affect OoD detection performance.

Enhancing the Robustness: The overfitting issue may be the cause of low OoD detection performance for some of the DeConf variants and . In Figure (b)b, the OoD detection performance is significantly increased with DeConf-I and all when extra regularization (dropout rate 0.7) is applied. Figure 8 provides further analysis for DeConf-I and its by varying the number of samples and classes in the training data. The performance with extra regularization is significantly better than the cases without it. Besides, the performance is also very similar between regularized and , indicating that overfitting is an important issue. Lastly, we note that the DeConf-E and DeConf-C have a reduced performance with extra regularization in Figure (b)b. It is an expected outcome since dropout generally harms the distance calculation between centroids and data since part of the feature is masked. The results indicate that the design of (I, E, C) might not be optimal for the problem, leaving room for future work to find a robust pair of and for the OoD detection problem.

Figure 6: Robustness analysis of 6 OoD detection methods. The left figure has classifiers trained on a varied number of samples in CIFAR-10. The right figure has classifiers trained on a varied number of classes in CIFAR-100. Each point in the line is an average of the results on 8 OoD datasets. The backbone network is Resnet-34. Please see Section 4.2 for a detailed discussion.
Figure 7: Robustness analysis using different neural network backbones. The in-distribution data is CIFAR-100. Each bar is averaged with the results on 8 OoD datasets.
Figure 8: Robustness analysis for and from DeConf-I. The + sign represents the model trained with extra regularization (dropout rate 0.7).

4.3 Semantic Shift versus Non-semantic Shift

One interesting aspect of out-of-distribution data that has not been explored is the separation of semantic and non-semantic shift. We therefore use a larger scale image dataset, DomainNet [31], to repeat an evaluation similar to Table 1. DomainNet has high-resolution images in 345 classes from six different domains. There are four domains in the dataset with class labels available when the experiments are conveyed. They are real, sketch, infograph, and quickdraw, resulting in different types of distribution shifts.

To create subsets with semantic shift, we separate the classes into two splits. Split A has class indices from 0 to 172, while split B has 173 to 344. Our experiment uses real-A for in-distribution and has the other subsets for out-of-distribution. With the definition given in Section 2, real-B has a semantic shift from real-A, while sketch-A has a non-semantic shift. Sketch-B therefore has both types of distribution shift. Figure 2 illustrates the setup. The classifier learned on real-A uses a Resnet-34 backbone. Its training setting is described in Section 4.1 except that the networks are trained for 100 epochs with initial learning rate of 0.01, and the images are center-cropped and resized to 224x224 in this experiment.

S NS Baseline / ODIN* / Maha* / DeConf-C*
real-B 75.1 / 69.9 / 53.6 / 69.8 15.3 / 15.4 / 5.09 / 14.0
sketch-A 75.5 / 80.7 / 59.5 / 84.5 20.1 / 31.2 / 7.30 / 37.5
sketch-B 81.8 / 85.7 / 60.4 / 89.1 25.2 / 36.8 / 7.55 / 44.1
infograph-A 79.6 / 82.7 / 81.5 / 89.0 23.5 / 27.8 / 21.6 / 45.4
infograph-B 82.1 / 85.3 / 80.9 / 90.9 24.8 / 31.7 / 21.9 / 49.6
quickdraw-A 78.8 / 96.4 / 67.4 / 96.9 21.1 / 79.9 / 3.38 / 83.1
quickdraw-B 80.5 / 96.9 / 66.1 / 97.4 22.1 / 83.6 / 2.38 / 86.6
Uniform 54.7 / 75.6 / 99.8 / 99.3 1.65 / 5.37 / 100. / 100.
Gaussian 71.3 / 95.5 / 99.9 / 99.4 0.64 / 46.9 / 100. / 100.
Table 3: Performance of four OoD detection methods using DomainNet. The in-distribution is the real-A subset. Each value is averaged over three runs. The type of distribution shift presents a trend of difficulty to the OoD detection problem: Semantic shift (S) Non-semantic shift (NS) Semantic Non-semantic shift.

The results in Table 3 reveal two interesting trends. The first one is that the OoD datasets with both types of distribution shifts are easier to detect, followed by non-semantic shift. The semantic shift turns out to be the hardest one to detect. The second observation is the failure of Mahalanobis*. In most cases it is even worse than Baseline, except detecting random noise. In contrast, ODIN* has performance gain in most of the cases, but has less gain with random noise. Our DeConf-C* still performs the best, showing that its robustness and scalability is capable of handling a more realistic problem setting, although there is still large room for improvement.

5 Conclusion

In this paper, we propose two strategies, the decomposed confidence and the modified input preprocessing. These two simple modifications to ODIN lead to a significant change in the paradigm, which does not need OoD data for tuning the method. Our comprehensive analysis shows that our strategies are effective and even better in several cases than the methods tuned for each OoD dataset. Our further analysis using a larger scale image dataset shows that the data with only semantic shift is harder to detect, pointing out a challenge for future works to address.


  • [1] J. Andrews, T. Tanay, E. J. Morton, and L. D. Griffin (2016) Transfer representation-learning for anomaly detection. In JMLR, Cited by: §1.
  • [2] A. Bendale and T. E. Boult (2016) Towards open set deep networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1563–1572. Cited by: §1.
  • [3] A. Bendale and T. Boult (2015) Towards open world recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1893–1902. Cited by: §1.
  • [4] H. Choi and E. Jang (2018) Generative ensembles for robust anomaly detection. arXiv preprint arXiv:1810.01392. Cited by: §4.1.
  • [5] C. Cortes, G. DeSalvo, and M. Mohri (2016) Learning with rejection. In International Conference on Algorithmic Learning Theory, pp. 67–82. Cited by: §1.
  • [6] A. R. Dhamija, M. Günther, and T. Boult (2018) Reducing network agnostophobia. Advances in Neural Information Processing Systems. Cited by: §1.
  • [7] Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    In international conference on machine learning, pp. 1050–1059. Cited by: §1.
  • [8] Y. Geifman and R. El-Yaniv (2017) Selective classification for deep neural networks. In Advances in neural information processing systems, pp. 4878–4887. Cited by: §1.
  • [9] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §3.1.2.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §4.1.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §4.1.
  • [12] M. E. Hellman (1970) The nearest neighbor classification rule with a reject option. IEEE Transactions on Systems Science and Cybernetics 6 (3), pp. 179–185. Cited by: §1.
  • [13] D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. International Conference on Learning Representations (ICLR). Cited by: §1, §2, §3.1.1, §3.1.
  • [14] D. Hendrycks, M. Mazeika, and T. G. Dietterich (2019)

    Deep anomaly detection with outlier exposure

    International Conference on Learning Representations (ICLR). Cited by: §1, §3.1, §4.1.
  • [15] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.1.2.
  • [16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.1.
  • [17] A. Krizhevsky et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.
  • [18] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §1.
  • [19] K. Lee, H. Lee, K. Lee, and J. Shin (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. International Conference on Learning Representations (ICLR). Cited by: §1, §4.1.
  • [20] K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: §1, §1, §2.1, §4.1, §4.2, Table 2.
  • [21] S. Liang, Y. Li, and R. Srikant (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: Generalized ODIN: Detecting Out-of-distribution Image without Learning from Out-of-distribution Data, §1, §1, §2.1, §2.1, §3.2, §3.2, §3.2, §4.1, §4.1, §4.2, footnote 1.
  • [22] A. Malinin and M. Gales (2018) Predictive uncertainty estimation via prior networks. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7047–7058. External Links: Link Cited by: §1.
  • [23] A. Malinin and M. Gales (2019) Reverse kl-divergence training of prior networks: improved uncertainty and adversarial robustness. In Advances in Neural Information Processing Systems, pp. 14520–14531. Cited by: §1.
  • [24] A. Malinin, B. Mlodozeniec, and M. Gales (2019) Ensemble distribution distillation. In International Conference on Learning Representations, Cited by: §1.
  • [25] M. Masana, I. Ruiz, J. Serrat, J. van de Weijer, and A. M. Lopez (2018) Metric learning for novelty and anomaly detection. arXiv preprint arXiv:1808.05492. Cited by: §1.
  • [26] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan (2018) Do deep generative models know what they don’t know?. arXiv preprint arXiv:1810.09136. Cited by: §4.1.
  • [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §4.1.
  • [28] L. Neumann, A. Zisserman, and A. Vedaldi (2018) Relaxed softmax: efficient confidence auto-calibration for safe pedestrian detection. NIPS MLITS Workshop. Cited by: §3.1.2.
  • [29] A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §3.1.
  • [30] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa (2015) Visual domain adaptation: a survey of recent advances. IEEE signal processing magazine 32 (3), pp. 53–69. Cited by: §2.
  • [31] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2018) Moment matching for multi-source domain adaptation. arXiv preprint arXiv:1812.01754. Cited by: Figure 2, §2, §4.3.
  • [32] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. A. DePristo, J. V. Dillon, and B. Lakshminarayanan (2019) Likelihood ratios for out-of-distribution detection. arXiv preprint arXiv:1906.02845. Cited by: §4.1.
  • [33] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma (2017) Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. International Conference on Learning Representations (ICLR). Cited by: §1.
  • [34] A. Shafaei, M. Schmidt, and J. Little (2019) A less biased evaluation of ood sample detectors. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §1, §1, §4.1.
  • [35] G. Shalev, Y. Adi, and J. Keshet (2018) Out-of-distribution detection using multiple semantic label representations. In Advances in Neural Information Processing Systems, pp. 7375–7385. Cited by: §1.
  • [36] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §3.1.
  • [37] E. Techapanurak and T. Okatani (2019) Hyperparameter-free out-of-distribution detection using softmax of scaled cosine similarity. CoRR abs/1905.10628. External Links: Link Cited by: §3.1.2.
  • [38] A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L. Willke (2018) Out-of-distribution detection using an ensemble of self supervised leave-out classifiers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 550–564. Cited by: §1, §4.1.
  • [39] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.1.
  • [40] X. Zhang, R. Zhao, Y. Qiao, X. Wang, and H. Li (2019) AdaCos: adaptively scaling cosine logits for effectively learning deep face representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10823–10832. Cited by: §3.1.2.


Classifier Image size #class Model Experiment Baseline DeConf-I DeConf-E DeConf-C
CIFAR-10 32x32 10 DenseNet Table 1,2 95.20.1 94.90.1 95.00.1 95.00.1
CIFAR-10 32x32 10 ResNet34 Figure 3 95.20.1 95.00.1 94.90.1 95.10.1
SVHN 32x32 10 ResNet34 Figure 3 96.90.1 96.80.1 96.50.1 96.70.1
CIFAR-100 32x32 100 DenseNet Table 1,2; Figure 7 77.00.2 75.80.4 76.40.1 75.90.1
CIFAR-100 32x32 100 WRN Figure 7 80.80.1 78.30.1 78.40.1 78.40.1
CIFAR-100 32x32 100 ResNet50 Figure 7 78.80.3 76.40.1 76.50.3 76.20.2
CIFAR-100 32x32 100 ResNet34 Figure 4,5,7 78.50.2 76.00.1 76.20.1 75.80.2
CIFAR-100 32x32 100 ResNet18 Figure 7 77.30.1 75.20.2 75.80.1 75.10.1
CIFAR-100 32x32 100 ResNet10 Figure 7 75.00.1 73.40.1 74.20.1 73.50.1
CIFAR-100 32x32 100 ResNet34 Figure 4 78.20.1 77.40.3 77.20.3 77.20.1
180x180 to
173 ResNet34 Table 3 73.60.1 73.00.1 73.41.5 72.20.5
Table 4: The summary of classifiers analyzed in the experiment section. Their in-domain classification accuracy is provided in the right four columns. The ”+” means that the classifier is trained with extra regularization (dropout rate 0.7).
Baseline / ODIN* / Maha* / DeConf-I* / DeConf-E* / DeConf-C*
CIFAR-100 Imagenet(c) 79.0(2.2) /90.5(1.1) /92.4(0.3) /84.4(2.3) /95.1(0.5) /97.6(0.2)
Imagenet(r) 76.4(3.2) /91.1(1.3) /96.4(0.2) /81.2(3.6) /97.4(0.3) /98.6(0.2)
LSUN(c) 78.6(1.1) /89.9(0.5) /81.2(0.6) /91.7(0.3) /90.1(0.3) /95.3(0.4)
LSUN(r) 78.2(2.4) /93.0(0.8) /96.6(0.2) /84.1(2.1) /97.8(0.2) /98.7(0.0)
iSUN 76.8(2.7) /91.6(1.1) /96.5(0.2) /82.1(2.9) /97.4(0.2) /98.4(0.0)
SVHN 78.1(3.5) /85.6(0.0) /89.9(0.2) /89.7(0.4) /94.0(0.6) /95.9(0.7)
Uniform 65.0(22.) /91.4(10.) /100.(0.0) /48.5(16.) /99.9(0.0) /99.9(0.0)
Gaussian 48.0(28.) /62.0(38.) /100.(0.0) /6.79(4.9) /99.9(0.0) /99.9(0.0)
CIFAR-10 Imagenet(c) 92.1(1.0) /88.2(4.2) /96.3(0.1) /98.2(0.0) /98.0(0.2) /98.7(0.1)
Imagenet(r) 91.5(1.4) /90.1(4.1) /98.2(0.0) /98.4(0.0) /98.2(0.2) /99.1(0.1)
LSUN(c) 93.0(0.5) /91.3(2.0) /92.2(0.4) /98.4(0.0) /98.6(0.2) /98.3(0.2)
LSUN(r) 93.9(0.4) /92.9(2.9) /98.2(0.0) /98.6(0.0) /98.8(0.0) /99.4(0.1)
iSUN 93.0(0.7) /92.2(3.4) /98.2(0.0) /98.6(0.0) /98.8(0.0) /99.4(0.0)
SVHN 88.1(4.8) /89.6(0.3) /98.0(0.3) /98.2(0.2) /98.4(0.6) /98.8(0.1)
Uniform 95.4(0.7) /98.9(0.7) /99.9(0.0) /99.2(0.5) /99.9(0.0) /99.9(0.0)
Gaussian 94.0(2.9) /98.6(1.7) /100.(0.0) /99.1(0.3) /99.9(0.0) /99.9(0.0)
Baseline / ODIN* / Maha* / DeConf-I* / DeConf-E* / DeConf-C*
CIFAR-100 Imagenet(c) 25.3(2.8) /56.0(3.1) /63.5(2.1) /31.0(3.4) /74.6(2.8) /87.8(1.7)
Imagenet(r) 22.3(3.1) /59.4(3.7) /82.0(1.6) /21.4(4.0) /87.6(1.7) /93.3(1.2)
LSUN(c) 23.0(2.2) /53.0(1.0) /31.6(1.3) /59.6(1.9) /51.0(1.0) /75.0(1.9)
LSUN(r) 23.7(2.5) /64.0(3.0) /82.6(1.8) /21.1(3.3) /89.8(1.5) /93.8(0.3)
iSUN 21.5(2.8) /58.4(4.1) /81.2(1.4) /17.6(3.3) /87.3(1.2) /92.5(0.2)
SVHN 18.9(4.9) /35.3(2.9) /43.3(2.7) /52.0(0.6) /67.1(3.4) /77.0(5.0)
Uniform 2.95(4.1) /66.1(46.) /100.(0.0) /0.0(0.0) /100.(0.0) /100.(0.0)
Gaussian 0.06(0.0) /33.3(47.) /100.(0.0) /0.0(0.0) /100.(0.0) /100.(0.0)
CIFAR-10 Imagenet(c) 50.0(2.8) /47.8(15.) /81.2(0.8) /92.0(0.2) /90.1(1.5) /93.4(1.2)
Imagenet(r) 47.4(4.4) /51.9(16.) /90.9(0.5) /93.6(0.2) /91.7(1.6) /95.8(0.9)
LSUN(c) 51.8(3.1) /63.5(7.8) /64.2(0.6) /92.5(0.4) /93.3(1.5) /91.5(1.2)
LSUN(r) 56.3(3.6) /59.2(18.) /91.7(0.3) /94.9(0.2) /95.7(0.1) /97.6(0.5)
iSUN 52.3(3.6) /57.2(18.) /90.6(0.7) /94.6(0.3) /95.4(0.2) /97.5(0.3)
SVHN 40.5(6.9) /48.7(3.2) /90.6(1.7) /91.4(1.1) /92.1(3.4) /94.0(0.6)
Uniform 59.9(12.) /98.1(2.6) /100.(0.0) /99.9(0.0) /100.(0.0) /100.(0.0)
Gaussian 48.8(26.) /92.1(11.) /100.(0.0) /99.9(0.0) /100.(0.0) /100.(0.0)
Table 5:

Performance of six OOD detection methods on 8 benchmark datasets. This is a full version of Table 1 in the main paper, which uses DenseNet for the backbone networks. The value in parentheses is the standard deviation.

Baseline / ODIN* / Maha* / DeConf-I* / DeConf-E* / DeConf-C*
CIFAR-100 Imagenet(c) 78.9(0.1) /84.8(0.6) /93.4(0.3) /88.2(0.6) /95.2(0.6) /95.3(0.6)
Imagenet(r) 75.1(0.8) /85.7(0.2) /96.3(0.1) /84.6(1.0) /97.0(0.4) /95.9(0.7)
LSUN(c) 78.8(0.6) /80.3(1.3) /79.8(0.3) /93.8(0.3) /92.6(0.2) /93.8(0.3)
LSUN(r) 76.2(1.4) /86.6(0.8) /96.3(0.2) /85.9(1.8) /97.0(0.7) /96.1(0.5)
iSUN 75.2(1.4) /85.9(0.8) /95.8(0.2) /84.7(1.4) /96.6(0.6) /95.7(0.5)
SVHN 75.1(2.5) /80.2(2.0) /80.9(1.1) /89.2(2.6) /93.8(0.8) /93.2(1.1)
Uniform 69.0(13.) /96.7(2.5) /100.(0.0) /79.3(8.3) /99.9(0.0) /99.9(0.0)
Gaussian 51.5(1.8) /93.7(1.7) /99.9(0.0) /60.8(23.) /99.9(0.0) /99.9(0.0)
CIFAR-10 Imagenet(c) 90.0(0.9) /81.2(2.4) /94.2(0.1) /98.2(0.2) /98.2(0.1) /96.0(0.2)
Imagenet(r) 87.3(1.3) /81.1(2.9) /96.5(0.1) /98.1(0.3) /98.1(0.3) /96.1(0.5)
LSUN(c) 92.0(1.7) /77.9(4.6) /87.7(0.2) /98.8(0.1) /98.5(0.0) /97.2(0.1)
LSUN(r) 91.6(1.2) /88.5(2.0) /97.2(0.1) /98.9(0.2) /99.0(0.1) /98.0(0.1)
iSUN 90.1(1.4) /86.1(2.5) /96.5(0.2) /98.8(0.2) /98.9(0.1) /97.6(0.1)
SVHN 87.7(2.4) /63.9(4.3) /87.8(1.6) /96.8(0.4) /96.1(1.4) /97.8(0.3)
Uniform 85.9(10.) /93.3(4.5) /99.9(0.0) /99.6(0.1) /99.9(0.0) /99.9(0.0)
Gaussian 89.9(10.) /97.1(2.0) /99.9(0.0) /99.7(0.0) /99.9(0.0) /99.9(0.0)
Baseline / ODIN* / Maha* / DeConf-I* / DeConf-E* / DeConf-C*
CIFAR-100 Imagenet(c) 24.1(0.6) /44.0(2.2) /68.2(1.4) /42.6(2.7) /73.4(3.7) /72.6(3.7)
Imagenet(r) 19.4(0.1) /45.5(1.4) /82.6(0.8) /30.4(3.0) /84.3(2.7) /76.5(3.8)
LSUN(c) 21.9(0.4) /34.8(2.4) /27.7(1.4) /66.1(2.2) /59.7(0.7) /65.7(2.3)
LSUN(r) 19.8(1.6) /48.2(3.0) /81.8(1.4) /29.4(5.2) /84.6(4.0) /76.8(3.3)
iSUN 17.7(0.5) /45.3(2.8) /80.4(0.8) /27.1(4.3) /83.0(3.1) /75.3(3.3)
SVHN 16.6(1.5) /27.5(5.0) /25.7(2.6) /43.7(10.) /60.8(5.3) /55.1(7.1)
Uniform 5.63(7.0) /76.4(27.) /100.(0.0) /4.11(5.8) /100.(0.0) /100.(0.0)
Gaussian 0.0(0.0) /46.6(20.) /100.(0.0) /0.06(0.0) /100.(0.0) /100.(0.0)
CIFAR-10 Imagenet(c) 54.6(2.6) /53.7(3.1) /74.6(0.6) /90.8(1.5) /91.1(0.9) /81.1(1.7)
Imagenet(r) 48.3(3.2) /53.1(4.3) /85.1(0.6) /90.5(1.8) /90.8(1.8) /81.4(2.4)
LSUN(c) 59.9(4.7) /50.9(6.1) /53.6(1.0) /93.9(0.5) /92.4(0.5) /87.3(1.0)
LSUN(r) 57.5(4.4) /68.1(4.2) /87.4(0.8) /95.8(1.0) /96.0(0.7) /90.9(0.9)
iSUN 53.7(3.8) /62.8(5.0) /84.6(0.9) /95.1(1.0) /95.3(0.5) /88.8(1.1)
SVHN 44.5(8.1) /29.7(6.2) /46.2(4.8) /84.5(2.5) /78.8(7.6) /89.5(2.1)
Uniform 27.9(20.) /74.5(20.) /100.(0.0) /100.(0.0) /100.(0.0) /100.(0.0)
Gaussian 52.7(40.) /87.1(9.3) /100.(0.0) /100.(0.0) /100.(0.0) /100.(0.0)
Table 6: Performance of six OOD detection methods on 8 benchmark datasets. The experiment here is the same as Table 1 but use Resnet-34 for the backbone network. The value in parentheses is the standard deviation.