The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.
Current out-of-distribution (OOD) detection benchmarks are commonly built by defining one dataset as in-distribution (ID) and all others as OOD. However, these benchmarks unfortunately introduce some unwanted and impractical goals, e.g., to perfectly distinguish CIFAR dogs from ImageNet dogs, even though they have the same semantics and negligible covariate shifts. These unrealistic goals will result in an extremely narrow range of model capabilities, greatly limiting their use in real applications. To overcome these drawbacks, we re-design the benchmarks and propose the semantically coherent out-of-distribution detection (SC-OOD). On the SC-OOD benchmarks, existing methods suffer from large performance degradation, suggesting that they are extremely sensitive to low-level discrepancy between data sources while ignoring their inherent semantics. To develop an effective SC-OOD detection approach, we leverage an external unlabeled set and design a concise framework featured by unsupervised dual grouping (UDG) for the joint modeling of ID and OOD data. The proposed UDG can not only enrich the semantic knowledge of the model by exploiting unlabeled data in an unsupervised manner, but also distinguish ID/OOD samples to enhance ID classification and OOD detection tasks simultaneously. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on SC-OOD benchmarks. Code and benchmarks are provided on our project page: https://jingkang50.github.io/projects/scood.READ FULL TEXT VIEW PDF
The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.
, deep learning models are still notorious for the following two drawbacks:1) their performance endures a significant drop when test data distribution has a large covariate shift from training ; 2)
they tend to recklessly classify a test image into a certain training class, even though it has a semantic shift from training, that is, it may not belong to any training class. Those defects seriously reduce the model trustworthiness and hinder their deployment in real, especially high-risk applications [13, 7]. To solve the problem, out-of-distribution (OOD) detection aims to distinguish and reject test samples with either covariate shifts or semantic shifts or both, so as to prevent models trained on in-distribution (ID) data from producing unreliable predictions .
Existing OOD detection methods mostly focus on calibrating the distribution of the softmax layer through temperature scaling , generative models [16, 26], or ensemble methods [3, 28]. Other solutions collect enormous OOD samples to learn the ID/OOD discrepancy during training [5, 12, 30]. Appealing experimental results are achieved by existing methods. For example, MCD  reports near-perfect scores across the classic benchmarks. The OOD detection problem seems completely solved.
However, by scrutinizing the common-used OOD detection benchmarks [11, 18, 12], we discover some irrationality on OOD splitting. Under the assumption that different datasets represent different data distributions, current benchmarks are commonly built by defining one dataset as ID and all others as OOD. Figure 1-a shows one popular benchmark that uses the entire CIFAR-10 test data as ID and the entire Tiny-ImageNet test data as OOD. However, we observed that around 15% Tiny-ImageNet test samples actually shares the same semantics with CIFAR-10’s ID categories (ref. Section 4). For example, Tiny-ImageNet contains six dog-breeds (golden retriever, Chihuahua) that match CIFAR-10’s class of dog, while their covariate shifts are negligible. In this case, the perfect performance on the above dataset-dependent OOD (DD-OOD) benchmarks may indicate that models are attempting to overfit the low-level discrepancy on the negligible covariate shifts between data sources while ignoring inherent semantics. This fails to meet the requirement of realistic model deployment.
To overcome the drawbacks of the DD-OOD benchmarks, in this work, we re-design semantically coherent out-of-distribution detection (SC-OOD) benchmarks, which re-organize ID/OOD set based on semantics and only focus on real images where the covariate shift can be ignored, as depicted in Figure 1-b. In this case, the ID set becomes semantically coherent and different from OOD. Existing methods suffer a large performance degradation on revised SC-OOD benchmarks as shown in Table 1, indicating that the OOD detection problem is still unresolved. 111In Table 1, both DD and SC-OOD benchmarks consider CIFAR-10 as ID and Tiny-ImageNet as OOD. All methods are tested using their released DenseNet models. AUPR corresponds to AUPR-Out in Section 4.
For an effective SC-OOD approach, we leverage an external unlabeled set like OE . Different from OE  whose unlabeled set is purely OOD, our unlabeled set is contaminated by a portion of ID samples. We believe it is a more realistic setting, as a powerful image crawler can easily prepare millions of unlabeled data but will inevitably introduce ID samples that are expensive to be purified.
With a realistic unlabeled set for SC-OOD, we design an elegant framework featured by unsupervised dual grouping (UDG) for the joint modeling of labeled and unlabeled data. The proposed UDG enhances the semantic expression ability of the model by exploring unlabeled data with an unsupervised deep clustering task, and the grouping information generated by the auxiliary task can also dynamically separate the ID and OOD samples in the unlabeled set. ID samples separated from the unlabeled set will join other given ID samples for classifier training, and the rest will be forced to produce a uniform posterior distribution like other OE methods . In this way, ID classification and OOD detection performances are simultaneously improved.
To summarize, the contributions of our paper are: 1) We highlight the problem of current OOD detection benchmarks and re-design them to address semantic coherency in out-of-distribution detection. 2) A concise framework using realistic unlabeled data is proposed, featuring the unsupervised dual grouping which not only enriches the semantic knowledge of the model in an unsupervised manner, but also distinguishes ID/OOD samples to enhance ID classification and OOD detection tasks simultaneously. 3) Extensive experiments demonstrate our approach achieves state-of-the-art performance on SC-OOD benchmarks.
Out-of-Distribution Detection. OOD detection aims to distinguish test images that come from different distribution comparing to training samples 
. The simplest baseline uses the maximum softmax probability (MSP) to identify OOD samples, which is based on the observation that DNNs tend to produce lower prediction probability for misclassified and OOD inputs. Follow-up work uses various techniques to improve MSP. ODIN  applies temperature scaling on the softmax layer to increase the separation between ID and OOD probabilities. Small perturbations are also introduced into the input space for further improvement. Some probabilistic methods attempt to model the distribution of the training samples and use the likelihood, or density, to identify OOD samples [16, 26, 24]. Besides, ensemble methods can also be used to robustify the models [3, 28]
. Recently, energy scores from the energy-based model are found theoretically consistent with the probability density and suitable for OOD detection. All the above methods only rely on ID samples for OOD detection.
Another group of methods for OOD detection utilizes a set of external OOD data, based on which the discrepancy between ID and OOD data is learned. The baseline work for this branch is OE . Based on the MSP baseline 
, a large-scale selected OOD set is introduced as outlier exposure (OE) and an additional objective is included, expecting DNNs to produce uniform softmax scores for extra samples. Afterwards, MCD proposes a network with two classifiers, which are forced to produce maximum entropy discrepancy for extra OOD samples. Some works explore the optimal strategies for external OOD data sampling . However, we find two problems with the previous method of using external OOD data: 1) in the realistic setting, a purified OOD set is difficult to obtain, as ID samples are inevitably introduced and expensive to filter out. 2) current methods only regard OOD set holistically, neglecting the abundant semantic information within the set. In this paper, we take a realistic unlabeled set that is a natural ID/OOD mixture, and hope to well explore the knowledge within it.
Deep clustering is an unsupervised learning method that trains DNNs using the cluster assignments of the resulting features
. In this paper, we integrate it into our main proposed unsupervised dual grouping (UDG), aiming to not only learn visual representations through self-supervised training on both labeled and unlabeled sets, but also do cluster-wise OOD probability estimation to filter out ID samples from the unlabeled set described in Section3.4.
In this section, we introduce our proposed end-to-end pipeline with unsupervised dual grouping (UDG) in detail.
Suppose that we have training set and testing set . Under closed-world assumption, both training and testing data are from in-distribution , , and . The subscript means that is fully labeled. Samples in () only belong to known classes that provided by labels in (). However, a more realistic setting suggests that also contains unknown classes that from out-of-distribution , , is composed by and . A model trained by is required to not only correctly classify samples from into , but also recognize OOD samples from . To this end, an unlabeled set is introduced to assist the training process, leading to . Ideally, unlabeled set should be purely from out-of-distribution, i.e. . However, in the real practice, is a mixture of both and with unknown separation. It is also mentioned that does not necessarily cover . In summary, our goal is to train an image classifier from training set , so that the model has the capacity to reject , in addition to classify samples from correctly.
To empower the classifier with OOD detection ability using both ID set and unlabeled set , our pipeline design is initiated by a classic OE architecture  that trains the network to classify ID samples correctly and forces high-entropy predictions on unlabeled samples, which is encapsulated in the classifier branch to be introduced in Section 3.3. Then, an unsupervised dual grouping (UDG) is proposed to group and altogether. Based on the grouping, ID samples from can be filtered out and redirected to to enhance the performance of classification branch, which is introduced in Section 3.4. With the grouping information produced by UDG, an auxiliary deep clustering branch is attached, aiming to discover the valuable but understudied knowledge in the unlabeled set. Finally, the entire training and testing procedure is summarized in Section 3.5. Figure 2 illustrates the proposed pipeline.
We firstly focus on the ID classification ability of the proposed model. A classifier is built, containing a backbone encoder with learnable parameter and a classification head with learnable parameter . A standard cross-entropy loss is utilized to train the classifier using data-label pairs in , formulated by Equation 1.
Now, we use the unlabeled set to help the network gain OOD detection capabilities, following the classic architecture in outlier exposure . Ideally, all samples from the unlabeled set are OOD, i.e. . In this case, since they do not belong to any one of the known classes, the network is forced to produce a uniform posterior distribution over all known classes for unlabeled samples. Therefore, an entropy loss is introduced in Equation 2 to flatten the model prediction on unlabeled samples .
However, in real practice, might be mixed with ID samples, leading to the assumption of inaccurate. The problem awaits solving by the proposed unsupervised dual grouping in Section 3.4.
In this section, we first introduce the basic operation of UDG and then focus on how UDG solves the mentioned problem on entropy loss when the ID samples are mixed in . Ideally, samples in should be removed from and return to with their corresponding labels, only leaving for entropy loss minimization. Fortunately, UDG has the ability to complete the task with In-Distribution Filtering (IDF) operation. Besides, the grouping information provided by UDG can be used by an auxiliary branch of deep clustering to explore the knowledge in the unlabeled data and in turn boost the semantic capability of the model.
Basic Operation. We divide the entire training set into
groups for every epoch. At the-th epoch, the encoder (denoted as ) extracts the feature of every training sample to form a feature set for the following grouping process. Any clustering methods can be implemented for grouping, while in this work, we use the classic -means algorithm  as in Equation 3, where restores group index of every sample and denotes the group index of sample at epoch .
Since then, every sample should belong to one of the groups. Formally, all samples that belong to -th group at the -th epoch form the set in Equation 4.
In-Distribution Filtering (IDF) with UDG. In this part, we introduce how we filter out ID samples on the scope of groups. Empirically, for a group that is dominated by a labeled class, the unlabeled data in the group is more likely to belong to the corresponding category. With the rule that the ID property of the sample can be estimated based on the group they belong to, the operation of In-Distribution Filtering (IDF) is proposed to filter out ID samples from the unlabeled set. For group , we define its group purity to show the proportion of samples belonging to class at epoch as Equation 5, where denotes all labeled samples within class .
With group purity prepared, IDF operator returns all unlabeled samples in a group with group purity over a threshold back to labeled set with their labels identical to the group majority, forming an updated labeled set according to Equation 6.
Complementary to the labeled set which is updated into , unlabeled set is also updated as . Using the updated sets, both the classification loss and entropy loss are modified as Equation 7 and Equation 8.
Auxiliary Task for UDG. The motivation for designing the auxiliary branch is to fully leverage the knowledge contained in the unlabeled set, expecting that the learned semantics can further benefit the model performance, especially on ID classification. It is expected that the learned semantics can further benefit the model performance, especially on ID classification. Fortunately, the groups that UDG provides are perfectly compatible with a deep clustering process , which is used as the unsupervised auxiliary task for knowledge exploration.
The intuition of deep clustering is that samples that lie in the same group are supposed to be in the same category. As the group index for every sample is provided in by Equation 3, a fully-connected auxiliary head with learnable parameter is trained to classify the sample into its corresponding group with auxiliary loss in Equation 9.
Finally, with the modified classification loss , modified entropy loss and auxiliary loss , the final loss can be calculated by Equation 10
with hyperparameterand . An end-to-end training process is performed to optimize encoder with parameter , classification head with parameter and auxiliary head with parameter simultaneously.
During testing, only classification head along with backbone encoder is utilized. The model will only make an in-distribution prediction if the value of maximum prediction passes a pre-defined threshold . Otherwise, the sample would be considered as out-of-distribution one. The testing process is formalized by Equation 11.
In this section, we introduce two benchmarks to reflect semantically coherent OOD detection. Two benchmarks consider two famous datasets of CIFAR-10/100  as in-distribution, respectively. Five other datasets including Texture 
, SVHN, Tiny-ImageNet , LSUN , and Places365  are prepared as OOD datasets. We re-split and according to the semantics of samples for SC-OOD benchmarks. There are two steps for re-splitting: 1) we first pick out the ID classes from the OOD datasets and mark all the images within the selected classes as ID samples. Table 2 shows exemplar Tiny-ImageNet ID classes corresponding to two CIFAR-10 classes. 2) we then conduct fine-grained filtering since many images from irrelevant OOD categories also contain ID semantics. Figure 3 shows exemplar Tiny-ImageNet ID images that are hidden by irrelevant labels. Eventually, we obtain CIFAR-10/100 SC-OOD benchmarks with detailed descriptions as follows.
CIFAR-10 is a natural object image dataset with 50,000 training samples and 10,000 testing samples from 10 object classes. Selected datasets for include 1) CIFAR-10 test set with all 10,000 images as ; 2) Entire Texture set with 5,640 images of textural images, all as ; 3) SVHN test set with 26,032 images of real-world street numbers, all as ; 4) CIFAR-100 test set with 10,000 object images that disjoint from CIFAR-10 classes, therefore all as ; 5) Tiny-ImageNet test set containing 10,000 images from 200 objects, with 1,207 images as and 8,793 images as ; 6)
LSUN test set containing 10,000 images for scene recognition, with 2 images asand 9,998 images as ; 7) Places365 test set containing 36,500 scene images, with 1,305 images as and 35,195 images as .
CIFAR-100 is a dataset of 100 fine-grained classes with 50,000 training samples and 10,000 testing samples. The classes between CIFAR-10 and CIFAR-100 are disjoint. Selected datasets for include 1) CIFAR-100 test set with all 10,000 images as ; 2) Entire Texture set with 5,640 images of textural images, all as ; 3) SVHN test set with 26,032 images of real-world street numbers, all as ; 4) CIFAR-10 test set with 10,000 object images that disjoint from CIFAR-100 classes, therefore all as ; 5) Tiny-ImageNet test set with 2,502 images as and 7,498 images as ; 6) LSUN test set with 2,429 images as and 7,571 images as ; 7) Places365 with 2,727 images as and 33,773 images as .
We use four kinds of metrics to evaluate the performance on both ID classification and OOD detection.
FPR95 is short for the false positive rate (FPR) at 95% true positive rate (TPR). It measures the portion of falsely recognized OOD when the most of ID samples are recalled.
computes the area under the receiver operating characteristic curve, evaluating the OOD detection performance. Samples fromare considered positive.
AUPR measures the area under the precision-recall curve. Depending on the selection of positiveness, AUPR contains AUPR-In, which regards as positive, as well as AUPR-Out, where is considered as positive. In Table 1 and Table 3, we use AUPR to represent the value of AUPR-Out due to its complementary polarities with AUROC.
CCR@FPR shows Correct Classification Rate (CCR) at the point when FPR reaches a value . The metric evaluates ID classification and OOD detection simultaneously and is formalized by Equation 3 in .
Among all the mentioned metrics, only FPR95 is expected to have a lower value on a better model. Higher values on any other metrics indicate better performance.
In this section, after describing the implementation details, the effect of each component is analyzed in the ablation study. Then we compare our method with previous state-of-the-art methods. Finally, a more in-depth exploration of our method is discussed.
Experimental Settings. Two experimental sets are performed with of CIFAR-10 and CIFAR-100 , respectively. Both of the training set contains 50,000 images. Tiny-ImageNet  training set is used as in both experiments. Testing is conducted according to Section 4. The main paper only reports the average metric values on all 6 datasets in each benchmark. All ablation and analytical experiments are performed on the CIFAR-10 benchmark.
Implementation Details. All experiments are performed with a standard ResNet-18 , trained by an SGD optimizer with a weight decay of and a momentum of . Two data-loaders are prepared with batch size of for and for . A cosine learning rate scheduler is used with an initial learning rate of , taking totally epochs. For hyperparameters of UDG, we set and for all experiments. Group number for CIFAR-10/100 is / with IDF threshold .
In this section, we will analyze the effect of every major component, including classification task , OE loss , auxiliary deep clustering task , and in-distribution filtering (IDF) operator by Table 3. In summary, OE loss and IDF operator are shown most effective for OOD detection, while the auxiliary deep clustering task can further improve the performance. Notably, we also report the basic classification accuracy on CIFAR-10 test set, denoted as ACC. Detailed explanation is conveyed as follows.
Effectiveness of Unlabeled Data. Table 3 is divided into two major blocks according to the usage of unlabeled Tiny-ImageNet. In this part, we discuss the differences brought by the introduction of unlabeled data. Exp operates the standard classification where only is used, and Exp is the standard OE method  using an additional . The result shows that the OOD detection ability is enhanced by a large margin, as FPR95 gets a significant 7.74% improvement. Figure 4-a compares the histogram of maximum prediction scores between Exp and . The overconfident property is largely reduced in favor of unlabeled samples, and the ID/OOD discrepancy is also enlarged. However, it is also worth noting that the ID classification accuracy is reduced from 94.94% to 91.87%. Since we use a realistic unlabeled set mixing both ID and OOD for OE loss, the unlabeled ID data will wrongly contribute to and therefore harm the classification performance.
Analysis of Unsupervised Dual Grouping. The contribution of UDG is twofold: 1) enabling the IDF operation; 2) creating an auxiliary loss . In this part, we especially focus on the analysis of . Exp operates a standard deep clustering on CIFAR-10 (). The fully-connected layer is finetuned on training set at the end of the training. A distressing result is obtained on all metrics. Even worse is Exp, which shows that expanding the training set with unlabeled data dominated by OOD further destroys the OOD detection ability of fully unsupervised methods. Possible explanation is that without the knowing the ID/OOD, deep clustering may easily group ID/OOD samples into one cluster, thus giving up the OOD detection capability. Therefore, although Exp and show that on the standard CIFAR training set, the auxiliary can benefit all metrics, the comparison between Exp and Exp unfortunately shows that once an OOD-mixed unlabeled data involved, a simple combination of classification task and unsupervised deep clustering task will damage the OOD detection ability. Fortunately, after introducing the OOD discrepancy loss , the comparison between Exp and Exp shows that the disadvantage of can be largely reduced, rekindling our hope of regaining the value of .
|ODIN ||52.00||82.00||73.13 / 85.12||0.36||1.29||6.92||39.37|
|EBO ||50.03||83.83||77.15 / 85.11||0.49||1.93||9.12||46.48|
|OE ||50.53||88.93||87.55 / 87.83||13.41||20.25||33.91||68.20|
|MCD ||73.02||83.89||83.39 / 80.53||5.41||12.3||28.02||62.02|
|UDG (ours)||36.22||93.78||93.61 / 92.61||13.87||34.48||59.97||82.14|
|ODIN ||81.89||77.98||78.54 / 72.56||1.84||5.65||17.77||46.73|
|EBO ||81.66||79.31||80.54 / 72.82||2.43||7.26||21.41||49.39|
|OE ||80.06||78.46||80.22 / 71.83||2.74||8.37||22.18||46.75|
|MCD ||85.14||74.82||75.93 / 69.14||1.06||4.60||16.73||41.83|
|UDG (ours)||75.45||79.63||80.69 / 74.10||3.85||8.66||20.57||44.47|
Effectiveness of In-Distribution Filtering. The contribution of IDF is in twofold: 1) improving ID classification by collecting ID samples from the unlabeled set for better , and 2) purify from unlabeled ID/OOD mixture into a clean OOD set for better . The comparison between Exp and (UDG but ) illustrates that IDF completes the above goals with a significant 10.4% benefit on FPR95 and an improvement on classification accuracy. By using a cleaner ID and OOD sets, the auxiliary finally becomes beneficial in the completed version of Exp.
Table 4 compares our proposed approach with previous state-of-the-art OOD detection methods. Here we only report the average metric values on all 6 OOD datasets for each benchmark due to limited space. Full results are shown in Appendix. Results show that our proposed UDG achieves better results on both SC-OOD benchmarks.
ODIN  and Energy-based OOD detector (EBO)  are two representative post-processing OOD methods. We report their best results after hyperparameter searching. Their performances are usually inferior to OE method.
Maximum Classifier Discrepancy (MCD) enlarges the entropy discrepancy between two branches to detect OODs . However, we find it significantly overfits the training OOD samples while is difficult to generalize to other OOD domains, leading to disappointing results.
UDG achieves state-of-the-art results on all metrics of OOD detection. In particular, FPR@95 is significantly reduced on both benchmarks. The full table in the appendix shows that UDG can not only have an advantage on revised SC-OOD datasets such as CIFAR-Places365, but also benefits classic OOD detection test dataset pairs such as CIFAR-Texture, which does not have semantic conflicts.
Figure 4-b shows the influence of the pre-defined cluster number . Generally, increasing helps converge to the optimal result. When is small, a larger group size will prevent any group from completely belonging to one class, making it difficult for IDF to filter out any ID samples. Also, large groups will inevitably include both ID and OOD samples, leading the deep clustering task to obscure the ID/OOD discrepancy. Therefore, our proposed method requires a certain large number of clusters. Fortunately, experiments show that the results are insensitive to the number of clusters when it is large (), reflecting the practicality of our method.
Apart from the IDF proposed in Section 3.4, there are two straightforward alternatives for ID sample filtering. One solution (denoted as “THRESH”) filters all unlabeled samples whose maximum softmax scores exceed a predefined threshold . Another solution (called “SORT”) sorts all unlabeled samples based on their maximum softmax scores and selects the top samples as new ID samples. We also denote our proposed group-based IDF strategy as “UDG”, which takes all unlabeled samples in a group with ID purity over as ID samples. Figure 4-c shows the comparison between them. Generally, our proposed UDG obtains the best performance on FPR95 comparing to other IDF strategies. SORT achieves the worst performance, since it directly includes a certain number of unlabeled images as ID from the beginning of the end-to-end training. The wrongly introduced samples will join classification task so that the error could accumulate, preventing the model from adopting the OOD detection ability properly. Smaller will let SORT include more unlabeled images without the guarantee of the filtering accuracy. THRESH has better performance due to its better control of unlabeled samples inclusion. However, it is still not comparable to UDG, which takes ID samples in a more conservative manner by exploiting a grouping mechanism to reduce overconfidence characteristics of the deep models. The results indicate that the group-based ID filtering performs more stably than sample-based methods.
In this paper, we highlight a problem of current OOD benchmarks which split ID/OOD according to the data source rather than the semantic meaning, and therefore re-design realistic and challenging SC-OOD benchmarks. An elegant pipeline named UDG is proposed to achieve the state-of-the-art result on SC-OOD benchmarks, with the usage of realistic unlabeled set. We hope that the more realistic and challenging SC-OOD setting provides new research opportunities for the OOD community and draw researchers’ attention to the importance of data review.
This work was supported by Innovation and Technology Commission of the Hong Kong Special Administrative Region, China (Enterprise Support Scheme under the Innovation and Technology Fund B/E030/18), NTU NAP, and RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).
WAIC, but why? Generative ensembles for robust anomaly detection. arXiv preprint arXiv:1810.01392. Cited by: §1, §2.
A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, Cited by: §1, §1, §1, §2, §2.
|ODIN||0.46 - 14.3 - 49.9 - 55.0||99.8 - 97.3 - 88.3 - 88.8|
|EBO||1.56 - 22.8 - 45.6 - 50.6||99.5 - 95.9 - 90.2 - 90.4|
|MCD||0.01 - 59.1 - 61.5 - 68.6||99.9 - 93.3 - 89.3 - 88.9|
|UDG||12.3 - 18.3 - 43.7 - 48.3||97.9 - 96.7 - 91.0 - 91.1|
In this section, we will explain more details of the SC-OOD benchmark formation mentioned in Section 4. We first comprehensively describe the difference between the proposed SC-OOD benchmark and DD-OOD benchmark. We scrutinized the existing famous OOD detection benchmarks (referred as DD-OOD) and find that they actually utilize nearest interpolation methods when resizing OOD images into ID image size. As shown in Figure A1, DD-OOD images look more coarse and grainy than ID images, resulting in a detectable sensory difference between ‘smooth’ ID images and ‘coarse’ OOD images. In this case, OOD detection methods targeting on DD-OOD benchmark could just impractically focus on low-level covariate shifts and ignore the high-level semantic differences for final decision. Therefore, we aim to propose a more challenging SC-OOD task to actually focus on semantics. In SC-OOD benchmarks, we use the alternative bi-linear interpolation method for resizing, which yields smoother images that are more similar to ID images. We believe it will encourage the models to focus more on semantics for OOD detection, reflecting the purpose of the SC-OOD benchmark. Afterward, we redirect the ID samples from OOD datasets, which has been explained in Section 4.
In sum, two steps from DD-OOD to SC-OOD: 1) using bi-linear interpolation instead of nearest for resizing; 2) re-splitting ID and OOD test sets according to semantics.
Table A1 shows the performance changes from DD-OOD to SC-OOD on CIFAR-10 + Tiny-ImageNet (TIN). Four states are recorded as OOD TIN set gradually changes: 1) TIN test set, nearest (interpolation), 2) TIN val set, nearest, 3) TIN val set, bi-linear, 4) TIN val set, bi-linear, with re-splitting as SC-OOD eventually. We use TIN val set because it contains ground-truth labels for easier re-splitting. The result shows that even changing test set into validation set will break the perfect performance of some existing methods. Bigger drop exists when interpolation methods change. This drop is understandable since the same interpolation will eliminate all major covariate shifts, but ID and OOD are not yet separated by semantics. However, semantic re-splitting continues to destroy model performance, but UDG gets minimal decrease and better overall scores on both metrics, showing a better understanding of semantics.
In this section, we visualize the heatmap activated by the previous method ODIN  and our proposed UDG on their prediction. We found that the semantic capabilities of UDG are significantly stronger than ODIN, since our model can focus more on the semantic area of the image, while ODIN usually distracts, sometimes even focuses on some irrelevant area (fourth image in Figure A2).
In this section, we visualize the in-distribution filtering (IDF) process that is described in Section 3.4. According to Figure A3, we find that the major unlabeled data that falls in the high-purity ID group is actually ID samples that belong to the corresponding category. We also notice that this method can also include these images with a relatively low confidence score, for example, for the cluster of birds, the last two images only have confidence around 0.3, which might be difficult to be filtered as ID if we only consider the confidence score. However, our method would be able to include them. According to the second image of cars, even though the network provides an incorrect pseudo label of truck for the car, our IDF strategy can correct the mistake. Even though there are also few mistakes introduced (scorpion in the ship’s group), it will be corrected when re-grouping in the next epoch. In addition, the overconfidence property of neural networks might give a high confidence score for wrong images, while the filtering strategy of UDG can also help prevent this mistake. In sum, the detailed visualization shows the reliability of the proposed IDF method.
Table A2 and Table A3 expand the average values reported in Table 4. We also do experiments on another network architecture of WideResNet-28 . The result generally has the same trend as ResNet-18 architecture. The proposed UDG method has advantages on almost all the metrics, showing that our method enhances ID classification and OOD detection ability. Notably, the advantages of our proposed method on Tiny-ImageNet, LSUN, and Places365 largely contribute to the good mean performance of all OOD detection metrics. We consider the above few datasets are difficult samples in the benchmark since many objects have similar but different semantics. A good result is also achieved on easy datasets of Texture and SVHN.
Here posts an answer we highlighted during the rebuttal period to help readers better understand our paper.
[On Motivation of SC-OOD] Classic OOD detection aims to train a ‘conservative’ model to distinguish samples with either a covariate shift on source distribution or a semantic shift on label distribution . However, we notice an impractical goal of classic OOD detection: to perfectly distinguish CIFAR cars from ImageNet cars, even though their covariate shift is negligible. The unrealistic goal will unfortunately result in an extremely narrow range of capabilities for deployed models, greatly limiting their use in real applications such as autonomous cars. In an attempt to address this problem, we form a new, realistic, and challenging SC-OOD task that is juxtaposed to classic OOD detection. SC-OOD re-defines the ‘distribution’ as label distribution only instead of the classic . Under the SC-OOD setting, models are required to: 1) well detect images from different label distributions, 2) correctly classify images within the same label distribution with negligible source distribution shifts, which is consistent with a popular research topic called robustness of deep learning.
Here we list our current shortcomings. Although the use of UDG mostly helps alleviate the classification decline of the OE method, it can not yet exceed the standard ID classification performance. More exploration is needed for better use of unlabeled data to achieve stronger ID classification while retaining OOD detection capabilities. Also, we will attempt to analyze UDG on larger datasets such as ImageNet with high-resolution images and complex semantics.
|Texture||52.27||90.81||94.07 / 82.32||0.10||1.32||20.84||79.77||95.02|
|SVHN||50.25||92.65||87.54 / 95.84||2.47||10.73||48.22||83.96||95.02|
|CIFAR-100||61.19||87.40||86.30 / 85.35||0.07||1.72||12.30||69.56||95.02|
|Tiny-ImageNet||65.32||87.32||89.41 / 81.17||0.40||2.44||14.16||71.86||92.54|
|LSUN||58.62||89.34||89.30 / 86.99||0.88||3.53||19.31||76.46||95.02|
|Places365||61.99||87.96||72.61 / 94.64||0.74||2.86||15.63||72.72||93.87|
|Mean||58.27||89.25||86.54 / 87.72||0.78||3.77||21.74||75.72||94.42|
|Texture||42.52||84.06||86.01 / 80.73||0.02||0.18||3.71||40.14||95.02|
|SVHN||52.27||83.26||63.76 / 92.60||1.01||4.00||11.82||44.85||95.02|
|CIFAR-100||56.34||78.40||73.21 / 80.99||0.10||0.38||4.43||30.11||95.02|
|Tiny-ImageNet||59.09||79.69||79.34 / 77.52||0.36||0.63||4.49||34.52||92.54|
|LSUN||47.85||84.56||81.56 / 85.58||0.21||0.85||9.92||46.95||95.02|
|Places365||53.94||82.01||54.92 / 93.30||0.47||1.68||7.13||39.63||93.87|
|Mean||52.00||82.00||73.13 / 85.12||0.36||1.29||6.92||39.37||94.42|
|Texture||52.11||80.70||83.34 / 75.20||0.01||0.13||2.79||31.96||95.02|
|SVHN||30.56||92.08||80.95 / 96.28||1.85||5.74||21.44||75.81||95.02|
|CIFAR-100||56.98||79.65||75.09 / 81.23||0.10||0.69||4.74||34.28||95.02|
|Tiny-ImageNet||57.81||81.65||81.80 / 78.75||0.33||0.95||6.01||40.40||92.54|
|LSUN||50.56||85.04||82.80 / 85.29||0.24||1.96||11.35||50.43||95.02|
|Places365||52.16||83.86||58.96 / 93.90||0.39||2.11||8.38||46.00||93.87|
|Mean||50.03||83.83||77.15 / 85.11||0.49||1.93||9.12||46.48||94.42|
|Texture||83.92||81.59||90.20 / 63.27||4.97||10.51||29.52||62.10||90.56|
|SVHN||60.27||89.78||85.33 / 94.25||20.05||38.23||55.43||74.01||90.56|
|CIFAR-100||74.00||82.78||83.97 / 79.16||0.80||4.99||18.88||58.18||90.56|
|Tiny-ImageNet||78.89||80.98||85.63 / 72.48||1.62||4.15||19.37||56.08||87.33|
|LSUN||68.96||84.71||85.74 / 81.50||1.75||7.93||21.88||61.54||90.56|
|Places365||72.08||83.51||69.44 / 92.52||3.29||7.97||23.07||60.22||88.51|
|Mean||73.02||83.89||83.39 / 80.53||5.41||12.30||28.02||62.02||89.68|
|Texture||51.17||89.56||93.79 / 81.88||6.58||11.80||27.99||71.13||91.87|
|SVHN||20.88||96.43||93.62 / 98.32||32.72||47.33||67.20||86.75||91.87|
|CIFAR-100||58.54||86.22||86.17 / 84.88||3.64||6.55||19.04||61.11||91.87|
|Tiny-ImageNet||58.98||87.65||90.9 / 82.16||14.37||18.84||33.65||66.03||89.27|
|LSUN||57.97||86.75||87.69 / 85.07||11.8||19.62||29.22||61.95||91.87|
|Places365||55.64||87.00||73.11 / 94.67||11.36||17.36||26.33||62.23||90.99|
|Mean||50.53||88.93||87.55 / 87.83||13.41||20.25||33.91||68.20||91.29|
|Texture||20.43||96.44||98.12 / 92.91||19.90||43.33||69.19||87.71||92.94|
|SVHN||13.26||97.49||95.66 / 98.69||36.64||56.81||76.77||89.54||92.94|
|CIFAR-100||47.20||90.98||91.74 / 89.36||1.50||10.94||40.34||75.89||92.94|
|Tiny-ImageNet||50.18||91.91||94.43 / 86.99||0.32||23.15||53.96||78.36||90.22|
|LSUN||42.05||93.21||94.53 / 91.03||14.26||37.59||60.62||81.69||92.94|
|Places365||44.22||92.64||87.17 / 96.66||10.62||35.05||58.96||79.63||91.68|
|Mean||36.22||93.78||93.61 / 92.61||13.87||34.48||59.97||82.14||92.28|
|Texture||84.04||75.85||85.72 / 58.63||0.41||3.67||16.26||45.84||76.65|
|SVHN||80.12||80.01||70.84 / 88.52||9.90||17.77||31.00||52.94||76.65|
|CIFAR-10||80.64||78.33||80.69 / 74.04||0.00||5.94||21.09||49.10||76.65|
|Tiny-ImageNet||83.32||77.85||86.97 / 61.73||2.43||7.55||24.69||48.29||69.56|
|LSUN||83.03||77.31||86.31 / 1.45||3.38||6.73||21.49||47.88||76.10|
|Places365||77.57||79.99||67.55 / 89.21||1.11||6.02||22.72||51.69||77.56|
|Mean||81.45||78.22||79.68 / 72.26||2.87||7.95||22.88||49.29||75.53|
|Texture||79.47||77.92||86.69 / 62.97||2.66||4.66||15.09||45.82||76.65|
|SVHN||90.33||75.59||65.25 / 84.49||4.98||12.02||23.79||46.61||76.65|
|CIFAR-10||81.82||77.90||79.93 / 73.39||0.09||3.69||15.39||47.20||76.65|
|Tiny-ImageNet||82.74||77.58||86.26 / 61.38||0.20||3.78||15.99||45.56||69.56|
|LSUN||80.57||78.22||86.34 / 63.44||1.68||5.59||17.37||45.56||76.10|
|Places365||76.42||80.66||66.77 / 89.66||1.45||4.16||18.98||49.60||77.56|
|Mean||81.89||77.98||78.54 / 72.56||1.84||5.65||17.77||46.73||75.53|
|Texture||84.29||76.32||85.87 / 59.12||0.82||3.89||14.37||44.60||76.65|
|SVHN||78.23||83.57||75.61 / 90.24||9.67||17.27||33.70||57.26||76.65|
|CIFAR-10||81.25||78.95||80.01 / 74.44||0.05||4.63||18.03||48.67||76.65|
|Tiny-ImageNet||83.32||78.34||87.08 / 62.13||1.04||6.37||21.44||47.92||69.56|
|LSUN||84.51||77.66||86.42 / 61.40||1.59||6.44||19.58||46.66||76.10|
|Places365||78.37||80.99||68.22 / 89.60||1.40||4.94||21.32||51.21||77.56|
|Mean||81.66||79.31||80.54 / 72.82||2.43||7.26||21.41||49.39||75.53|
|Texture||83.97||73.46||83.11 / 56.79||0.07||1.03||9.29||38.09||68.80|
|SVHN||85.82||76.61||65.50 / 85.52||3.03||8.66||23.15||45.44||68.80|
|CIFAR-10||87.74||73.15||76.51 / 67.24||0.35||3.26||16.18||41.41||68.80|
|Tiny-ImageNet||84.46||75.32||85.11 / 59.49||0.24||6.14||19.66||41.44||62.21|
|LSUN||86.08||74.05||84.21 / 58.62||1.57||5.16||18.05||41.25||67.51|
|Places365||82.74||76.30||61.15 / 87.19||1.08||3.35||14.04||43.37||70.47|
|Mean||85.14||74.82||75.93 / 69.14||1.06||4.60||16.73||41.83||67.77|
|Texture||86.56||73.89||84.48 / 54.84||0.66||2.86||12.86||41.81||70.49|
|SVHN||68.87||84.23||75.11 / 91.41||7.33||14.07||31.53||54.62||70.49|
|CIFAR-10||79.72||78.92||81.95 / 74.28||2.82||9.53||23.90||48.21||70.49|
|Tiny-ImageNet||83.41||76.99||86.36 / 60.56||0.22||8.50||21.95||43.98||63.69|
|LSUN||83.53||77.10||86.28 / 60.97||1.72||7.91||22.61||44.19||69.89|
|Places365||78.24||79.62||67.13 / 88.89||3.69||7.35||20.22||47.68||72.02|
|Mean||80.06||78.46||80.22 / 71.83||2.74||8.37||22.18||46.75||69.51|
|Texture||75.04||79.53||87.63 / 65.49||1.97||4.36||9.49||33.84||68.51|
|SVHN||60.00||88.25||81.46 / 93.63||14.90||25.50||38.79||56.46||68.51|
|CIFAR-10||83.35||76.18||78.92 / 71.15||1.99||5.58||17.27||42.11||68.51|
|Tiny-ImageNet||81.73||77.18||86.00 / 61.67||0.67||4.82||17.80||41.72||61.80|
|LSUN||78.70||76.79||84.74 / 63.05||1.59||5.34||18.04||44.70||67.10|
|Places365||73.86||79.87||65.36 / 89.60||1.96||6.33||22.03||47.97||69.83|
|Mean||75.45||79.63||80.69 / 74.10||3.85||8.66||20.57||44.47||67.38|
|Texture||50.16||89.68||92.45 / 81.81||0.00||0.04||12.16||76.32||96.08|
|SVHN||30.54||95.44||92.81 / 97.49||8.75||25.94||72.94||89.16||96.08|
|CIFAR-100||51.38||89.15||87.42 / 87.99||0.02||0.77||11.15||75.25||96.08|
|Tiny-ImageNet||56.98||88.96||90.14 / 84.19||0.03||0.71||13.85||75.72||93.69|
|LSUN||47.05||90.54||88.99 / 89.44||0.20||0.80||11.97||79.25||96.08|
|Places365||53.44||89.18||70.65 / 95.54||0.04||0.74||9.22||75.86||95.02|
|Mean||48.26||90.49||87.08 / 89.41||1.51||4.83||21.88||78.59||95.51|
|Texture||47.50||81.23||82.94 / 78.25||0.00||0.00||1.81||32.69||96.08|
|SVHN||51.17||85.36||68.02 / 93.53||1.10||3.54||13.08||53.04||96.08|
|CIFAR-100||52.92||79.47||73.57 / 82.59||0.00||0.36||3.97||30.55||96.08|
|Tiny-ImageNet||54.86||80.39||78.82 / 79.48||0.01||0.36||3.12||33.69||93.69|
|LSUN||46.53||81.86||75.70 / 85.03||0.25||0.68||3.91||33.49||96.08|
|Places365||49.03||81.49||49.84 / 93.60||0.04||0.55||3.72||33.14||95.02|
|Mean||50.33||81.63||71.48 / 85.41||0.23||0.91||4.94||36.10||95.51|
|Texture||40.44||89.55||91.16 / 84.41||0.00||0.00||5.41||71.35||96.08|
|SVHN||16.13||96.90||93.77 / 98.47||2.93||18.26||68.48||91.28||96.08|
|CIFAR-100||42.41||88.97||85.73 / 89.42||0.01||0.72||8.77||67.94||96.08|
|Tiny-ImageNet||45.81||89.55||89.55 / 86.72||0.03||0.61||9.93||73.79||93.69|
|LSUN||37.14||90.58||87.47 / 91.07||0.29||0.83||8.51||76.21||96.08|
|Places365||39.84||89.86||68.32 / 96.33||0.04||0.68||7.15||73.24||95.02|
|Mean||36.96||90.90||86.00 / 91.07||0.55||3.52||18.04||75.64||95.51|
|Texture||93.19||70.58||82.49 / 49.12||0.00||0.15||7.65||44.96||87.85|
|SVHN||88.68||81.37||74.43 / 86.75||3.28||8.65||28.28||66.86||87.85|
|CIFAR-100||83.29||76.58||77.17 / 72.50||0.03||0.72||10.47||45.36||87.85|
|Tiny-ImageNet||86.6||74.83||80.53 / 64.30||0.04||2.48||12.88||44.47||85.58|
|LSUN||93.06||70.14||72.62 / 63.38||0.55||2.81||10.51||36.16||87.85|
|Places365||93.13||70.42||49.04 / 84.32||0.10||2.39||9.65||36.37||86.48|
|Mean||89.66||73.99||72.71 / 70.06||0.67||2.87||13.24||45.7||87.24|
|Texture||35.14||92.44||95.27 / 87.17||5.27||8.94||31.17||79.23||94.95|
|SVHN||22.94||96.23||94.14 / 97.78||37.34||52.79||73.87||88.74||94.95|
|CIFAR-100||52.99||87.17||86.80 / 86.09||1.72||6.83||21.22||63.16||94.95|
|Tiny-ImageNet||55.53||87.43||90.20 / 82.58||4.58||13.91||28.61||64.92||92.72|
|LSUN||59.69||85.56||86.18 / 83.67||5.18||11.55||26.09||58.88||94.95|
|Places365||55.30||85.75||69.15 / 94.25||4.50||10.31||22.42||56.79||94.24|
|Mean||46.93||89.10||86.96 / 88.59||9.76||17.39||33.90||68.62||94.46|
|Texture||22.59||95.86||97.49 / 92.59||0.87||8.92||58.06||87.56||94.50|
|SVHN||17.23||97.23||95.43 / 98.64||45.32||60.75||78.46||89.84||94.50|
|CIFAR-100||43.36||91.53||92.08 / 90.21||5.19||12.28||37.79||77.03||94.50|
|Tiny-ImageNet||39.33||93.90||95.90 / 90.01||4.86||27.52||64.17||82.97||92.07|
|LSUN||30.17||95.25||96.06 / 94.05||13.28||36.98||66.03||86.35||94.50|
|Places365||35.24||94.31||89.24 / 97.55||8.39||27.67||61.10||83.75||93.33|
|Mean||31.32||94.68||94.36 / 93.84||12.98||29.02||60.93||84.58||93.90|
|Texture||84.24||76.10||85.25 / 58.36||0.24||2.19||9.78||46.20||80.25|
|SVHN||79.63||78.95||65.45 / 88.22||1.42||4.26||17.14||51.39||80.25|
|CIFAR-10||77.07||80.81||83.16 / 76.76||0.49||9.19||25.03||53.94||80.25|
|Tiny-ImageNet||81.25||79.12||87.75 / 63.33||0.31||5.34||24.75||51.64||72.92|
|LSUN||81.32||78.51||86.81 / 62.95||0.51||2.57||20.03||50.74||78.54|
|Places365||75.28||80.84||67.81 / 89.76||1.49||4.63||20.12||53.24||80.03|
|Mean||79.80||79.05||79.37 / 73.23||0.74||4.70||19.48||51.19||78.71|
|Texture||78.88||76.46||84.68 / 62.45||0.15||1.52||10.21||41.44||80.25|
|SVHN||92.26||68.41||49.07 / 81.28||1.73||2.93||8.02||28.93||80.25|
|CIFAR-10||78.22||80.14||81.43 / 76.26||0.06||3.09||15.78||50.75||80.25|
|Tiny-ImageNet||80.54||77.88||85.89 / 62.67||0.24||2.25||13.97||45.53||72.92|
|LSUN||78.11||78.66||85.57 / 65.68||0.19||1.26||11.69||45.32||78.54|
|Places365||73.62||80.57||63.79 / 90.13||0.86||2.79||13.03||47.47||80.03|
|Mean||80.27||77.02||75.07 / 73.08||0.54||2.31||12.12||43.24||78.71|
|Texture||84.22||76.13||85.08 / 58.51||0.08||1.55||10.04||44.24||80.25|
|SVHN||80.05||79.88||65.44 / 88.37||0.97||3.88||14.93||50.85||80.25|
|CIFAR-10||76.18||81.50||83.34 / 77.36||0.45||6.11||21.03||53.73||80.25|
|Tiny-ImageNet||80.78||79.94||88.02 / 64.18||0.06||4.92||22.31||51.82||72.92|
|LSUN||82.59||78.74||86.71 / 62.94||0.64||1.55||17.71||49.76||78.54|
|Places365||74.54||81.63||67.67 / 90.18||1.13||3.69||17.55||52.47||80.03|
|Mean||79.73||79.64||79.38 / 73.59||0.55||3.62||17.26||50.48||78.71|
|Texture||91.33||69.03||79.60 / 49.66||0.00||0.29||4.49||32.61||68.80|
|SVHN||87.03||73.48||52.89 / 84.73||1.74||2.90||6.68||33.88||68.80|
|CIFAR-10||86.89||73.79||76.15 / 68.38||0.26||2.88||13.40||39.94||68.80|
|Tiny-ImageNet||85.16||74.59||84.19 / 58.36||1.01||2.58||13.71||40.31||62.22|
|LSUN||88.67||72.04||83.06 / 54.33||1.13||3.58||15.95||39.58||67.29|
|Places365||86.83||74.05||59.58 / 85.28||1.24||3.66||14.85||41.07||69.77|
|Mean||87.65||72.83||72.58 / 66.79||0.90||2.65||11.51||37.90||67.61|
|Texture||93.07||67.00||78.92 / 46.52||0.02||0.52||5.50||32.16||74.01|
|SVHN||88.74||76.14||66.07 / 85.17||7.06||12.91||24.82||47.43||74.01|
|CIFAR-10||78.82||79.36||81.29 / 75.27||1.08||7.63||17.49||48.84||74.01|
|Tiny-ImageNet||83.34||78.35||87.34 / 61.78||1.06||8.84||24.40||47.64||66.49|
|LSUN||84.96||78.11||87.26 / 60.76||5.80||10.40||25.75||48.27||71.47|
|Places365||80.30||79.87||67.23 / 88.65||1.78||6.29||19.78||49.84||74.39|
|Mean||84.87||76.47||78.02 / 69.69||2.80||7.76||19.63||45.70||72.40|
|Texture||73.62||79.01||85.53 / 67.08||0.00||0.00||6.74||46.09||75.77|
|SVHN||66.76||85.29||76.14 / 92.33||8.00||15.83||32.57||58.05||75.77|
|CIFAR-10||82.35||76.67||78.52 / 72.63||0.51||3.90||15.29||44.79||75.77|
|Tiny-ImageNet||78.91||79.04||87.00 / 65.06||0.12||2.86||19.13||47.50||68.57|
|LSUN||77.04||79.79||87.49 / 66.93||2.51||6.01||22.33||49.14||73.93|
|Places365||72.25||81.49||66.72 / 90.65||1.19||3.28||17.59||50.82||76.10|
|Mean||75.16||80.21||80.23 / 75.78||2.05||5.31||18.94||49.40||74.32|