Since deep learning models have been implemented in many commercial applications, it is important to detect out-of-distribution (OOD) inputs correctly to maintain the performance of the models, ensure the quality of the collected data, and prevent the applications from being used for other-than-intended purposes. In this work, we propose a two-head deep convolutional neural network (CNN) and maximize the discrepancy between the two classifiers to detect OOD inputs. We train a two-head CNN consisting of one common feature extractor and two classifiers which have different decision boundaries but can classify in-distribution (ID) samples correctly. Unlike previous methods, we also utilize unlabeled data for unsupervised training and we use these unlabeled data to maximize the discrepancy between the decision boundaries of two classifiers to push OOD samples outside the manifold of the in-distribution (ID) samples, which enables us to detect OOD samples that are far from the support of the ID samples. Overall, our approach significantly outperforms other state-of-the-art methods on several OOD detection benchmarks and two cases of real-world simulation.READ FULL TEXT VIEW PDF
Semi-supervised learning (SSL) has been proposed to leverage unlabeled d...
Current out-of-distribution (OOD) detection benchmarks are commonly buil...
As deep learning methods form a critical part in commercially important
Deep neural networks (DNN) can achieve high performance when applied to
This paper proposes a method for OOD detection. Questioning the premise ...
Cross-dataset transfer learning is an important problem in person
Deep learning models are known to be overconfident in their predictions ...
After several breakthroughs of deep learning methods, deep neural networks (DNNs) have achieved impressive results and even outperformed humans in fields such as image classification 17]5]. Meanwhile, increasingly more commercial applications have implemented DNNs in their systems for solving different tasks with high accuracy to improve the performance of their products.
To achieve stable recognition performance, the inputs of these models should be drawn from the same distribution as the training data that was used to train the model . However, in the real-world, the inputs are uploaded by users, and thus the application can be used in unusual environments or be utilized for other-than-intended purposes, which means that these input samples can be drawn from different distributions and lead DNNs to provide wrong predictions. Therefore, for these applications, it is important to accurately detect out-of-distribution (OOD) samples.
In this work, we propose a new setting for unsupervised out-of-distribution detection. While previous studies [9, 14, 16, 25, 26] only use labeled ID data to train the neural network under supervision, we also utilize unlabeled data in the training process. Fig. 1 shows our experimental settings of OOD detection for food recognition. Though we do not know the semantic class of the unlabeled sample or whether the unlabeled sample is ID or OOD, we find that these data are helpful for improving the performance of OOD detection and this kind of unlabeled data can be obtained easily in real-world applications.
To utilize these unlabeled data, we also propose a novel out-of-distribution detection method for DNNs. Many OOD detection algorithms attempt to detect OOD samples using the confidence of the classifier [9, 14, 16, 26]. For each input, a confidence score is evaluated based on a pre-trained classifier, then the score is compared to a threshold, and a label is assigned to the input according to whether the confidence score is greater than the threshold. Those samples having lower confidence scores (which means they are closer to the decision boundary) are classified as OOD samples, as shown in the upper part of Fig. 2. In the previous works, they used CIFAR-10/CIFAR-100 as ID and other datasets, TinyImageNet/LSUN/iSUN as OOD. Though there is a small overlap of classes between ID and OOD, we follow the same setting for comparisons.
Although the existing methods are effective on some datasets, they still exhibit poor performance when ID dataset has big class numbers. For example, using CIFAR-100  (a natural image dataset with 100 classes) as the in-distribution (ID) dataset and TinyImageNet  (another natural image dataset with 200 classes) as the OOD dataset, which means current methods cannot separate the confidence scores of ID samples and OOD samples well enough.
To overcome this problem, we introduce a two-head deep convolutional neural network (CNN) with one common feature extractor and two separated classifiers as shown in Fig. 3. Since OOD samples are not clearly categorized into classes of ID samples or far from the distribution of ID samples, the two classifiers having different parameters will be confused and output different results. Consequently, as shown in the lower left part of Fig. 2, OOD samples will exist in the gap of the two decision boundaries, which makes it easier to detect OOD samples. To achieve better performance, we further fine-tune the neural network to correctly classify labeled ID samples, and maximize the discrepancy between the decision boundaries of two classifiers, simultaneously. Please note that we do not use labeled OOD samples for training in our method.
We evaluate our method on a diverse set of in- and out-of-distribution dataset pairs. In many settings, our method outperforms other methods by a large margin. The contributions of our paper are summarized as follows:
We propose a novel experimental setting and a novel training methodology for out-of-distribution detection in neural networks. Our method does not require labeled OOD samples for training and can be easily implemented on any modern neural architecture.
We propose utilizing the discrepancy between two classifiers to separate in-distribution samples and OOD samples.
|Method||Input pre-processing||Model change||Test complexity||Training data||AUROC|
|Hendrycks & Gimpel ||No||No||1||Labeled ID data||71.6|
|ODIN ||Yes||No||3||Labeled ID data||90.7|
|ELOC ||Yes||Ensemble||15||Labeled ID data||96.3|
|Proposed||No||Fine tune||2||Labeled ID data & unlabeled data||99.6|
Currently, there are several different methods for out-of-distribution detection. A summary of key methods described are shown in Table 1.
As the simplest method, Hendrycks & Gimpel 
attempted to detect OOD samples depending on the predicted softmax class probability, which is based on the observation that the prediction probability of incorrect and OOD samples tends to be lower than that of the correct samples. However, they also found that some OOD samples still can be classified overconfidently by pre-trained neural networks, which limits the performance of detection.
To improve the effectiveness of Hendrycks & Gimpel’s method , Lee et al.  used modified generative adversarial networks , which involves training a generator and a classifier simultaneously. They trained the generator to generate ‘boundary’ OOD samples that appear to be at the boundary of the given in-distribution data manifold, while the classifier is encouraged to assign these OOD samples uniform class probabilities in order to generate less confident predictions on them.
Liang et al. 
also proposed an improved solution, which applies temperature scaling and input preprocessing, called Out-of-DIstribution Detector for Neural Networks (ODIN). They found that by scaling the unnormalized outputs (logits) before the final softmax layer by a large constant (temperature scaling), the difference between the largest logit and the remaining logits is larger for ID samples than for OOD samples, which shows that the separation of the softmax scores between ID and OOD samples is increased. In addition, if they add some small perturbations to the input through the loss gradient, which increases the maximum predicted softmax score, the increases on ID samples are greater than those on OOD samples. Based on these observations, the authors first scaled the logits at a high temperature value to calibrate softmax scores and then pre-processed the input by perturbing it with the loss gradient to further increase the difference between the maximum softmax scores of ID and OOD samples. Their approach outperforms the baseline method.
extracted low- and upper-level features from DNNs to calculate a confidence score for detecting OOD samples. However, these two methods require 1,000 labeled OOD samples to train a logistic regression detector to achieve stable performance.
Some studies [11, 12, 27] utilized the hierarchical relations between labels and trained two classifiers to have different generality (a general classifier and a specific classifier) by using different levels of the label hierarchy. OOD samples can be detected by incongruence between the general classifier and the specific classifier, but the requirement of label hierarchy limits the application of these methods.
There are some other studies on open-set classification [1, 2, 6, 21, 23, 24, 30], which involve tasks very similar to OOD detection. Bendale & Boult  proposed a new layer called openMax that can calculate the score for an unknown class by taking the weighted average of all other classes obtained from a Weibull distribution.
The current state-of-the-art method for OOD detection is the ensemble of self-supervised leave-out classifiers proposed by Vyas et al. . They divided the training ID data into partitions and assign one partition as OOD and the remaining partitions as ID to train
classifiers by a novel loss function, called margin entropy loss, to increase the prediction confidence of ID samples and decrease the prediction confidence of OOD samples. During test time, they used an ensemble of theseclassifiers for detecting OOD samples in addition to temperature scaling and input pre-processing proposed in ODIN .
Compared to previous studies, our method fine-tunes the neural network by utilizing unlabeled data for unsupervised learning. Our unlabeled data is all or a part of test data.
In this section, we present our proposed method for OOD detection. First, we describe the problem statements in Section 3.1. Second, we illustrate the overall concept of our method in Section 3.2. Then, our loss function is explained in Section 3.3 and we detail the actual training procedure of our method in Section 3.4. Finally, we introduce the method used to detect OOD samples at inference time in Section 3.5.
We suppose that an ID image-label pair, , drawn from a set of labeled ID images , is accessible, as well as an unlabeled image, , drawn from unlabeled images . The ID sample can be classified into classes, which means . Please note that can be either an ID image or an OOD image and , so we do not know whether this image is from in- or out-of-distribution. Unlike previous methods, we use for unsupervised training, which is realistic for real-world applications.
The goal of our method is to distinguish whether the image is from in-distribution or not. For this objective, we have to train the network to predict different softmax class probabilities for ID samples and OOD samples.
Hendrycks & Gimpel  showed that the prediction probability of OOD samples tends to be lower than the prediction probability of ID samples; thus, OOD samples are closer to the class boundaries and more likely to be misclassified or classified with low confidence by the classifier learned from ID samples (the upper part of Fig. 2).
Based on their findings, we further propose a two-head CNN inspired by , consisting of a feature extractor network, , which takes inputs or , and two classifier networks, and , which take features from and classify them into classes. Classifier networks and output a
-dimensional vector of logits; then, the class probabilities can be calculated by applying the softmax function for the vector. The notationand are used to denote the -dimensional softmax class probabilities for input obtained by and , respectively. Differing from  which aligns the distributions of two datasets for domain adaptation, we train the network on different loss functions with a different training procedure to detect the difference between the distributions of two datasets.
We found that when the two classifiers ( and ) are initialized with random initial parameters and then trained on ID samples supervisedly, they will have different characteristics and classify OOD samples differently (the lower part side of Fig. 2). Fig. 4 shows the disagreement (L1 distance) between the two classifiers’ outputs of unlabeled ID (CIFAR-10) and OOD (TinyImageNet-resized and LSUN-resized) samples after training the network on labeled ID samples supervisedly. We can confirm that most OOD samples have larger discrepancy than ID samples in Fig. 4.
By utilizing this characteristic, if we can measure the disagreement between the two classifiers and train the network to maximize this disagreement, the network will push OOD samples outside the manifold of ID samples. Discrepancy, , is introduced to measure the divergence between the two softmax class probabilities for an input. Consequently, we can separate OOD samples and ID samples according to the discrepancy between the two classifiers’ outputs.
We define the discrepancy loss as the following equation:
where is the entropy over the softmax distribution.
When the network is trained to maximize this discrepancy term, it maximizes the entropy of ’s output, which encourages to predict equal probabilities of all the classes, and minimizes the entropy of ’s output, which encourages to predict high probability of one class simultaneously. Since OOD samples are outside the support of the ID samples, the discrepancy between the two classifiers’ outputs of OOD samples will be larger. This is demonstrated empirically in Section 4.
As the previous discussion in Section 3.2, we need to train our network to classify ID samples correctly and maximize at the same time. To achieve this, we propose a training procedure consisting of one pre-training step and two repeating fine-tuning steps. Pre-training step uses labeled ID samples to train the classifier. Then, both and unlabeled samples are used to train the network for separating ID and OOD samples while keeping correct classification of ID samples in fine-tuning steps. In principle, we use the test data as the unlabeled data. In addition, the unlabeled data can be only a part of the test data. In the ablation studies in Section 4.1.7 , we conduct experiments in the cases of varying sizes and types of unlabeled data.
Pre-training: First, we train the network to learn discriminative features and classify the ID samples correctly under the supervision of labeled ID samples. The network is trained to minimize the cross entropy with the following loss:
Fine-tuning: Once the network converges, we start to fine-tune the network to detect OOD samples by repeating the following two steps at mini-batch level.
Step B Then, we train the network to increase the discrepancy in an unsupervised manner in order to make the network detect the OOD samples that do not have the support of the ID samples (Step B in Fig. 3). In this step, we also use labeled ID samples to reshape the support. We add classification loss on the labeled ID samples. The same mini-batch size of labeled and unlabeled samples are utilized to update the model at this step. As a result, we train the network to minimize the following loss:
If the average discrepancy of unlabeled samples is greater than the margin , the unsupervised loss will equal its minimum value, zero; thus, the margin is helpful for preventing overfitting.
At inference time, in order to distinguish between in- and out-of-distribution samples, a straightforward solution would be to use the discrepancy defined in Section 3.3, but this term does not include the discrepancy of each class. We consider the L1 distance between the two classifiers’ outputs. When the distance is above a detection threshold , we assign the sample as an out-of-distribution sample, denoted by
In this section, we discuss our experimental settings and results. We describe a diverse set of in- and out-of-distribution dataset pairs, neural network architectures and evaluation metrics. We also demonstrate the effectiveness of our method by comparing it against the current state-of-the-art methods, resulting in our method significantly outperforming them. We ran all experiments using PyTorch 0.4.1.
As the benchmarks of OOD detection, ODIN  and Ensemble of Leave-Out Classifiers (ELOC)  introduced several benchmark datasets and evaluation metrics to evaluate the performance of OOD detectors.
Following [16, 26], we implemented our network based on two state-of-the-art neural network architectures, DenseNet  and Wide ResNet (WRN) . We used the modules of DenseNet/Wide ResNet until the average-pooling layer just before the last full-connected layer as the extractor, and one full-connected layer as the classifier.
In the pre-training step proposed in Section 3.4
, we used stochastic gradient descent (SGD) to train DenseNet-BC for 300 epochs and Wide ResNet for 200 epochs. The learning rate started at 0.1 and dropped by a factor of 10 at 50% and 75% of the training progress, respectively. After the pre-training step, we further fine-tuned the network in the fine-tuning steps proposed in Section3.4 for 10 epochs with learning rate 0.1, margin to detect OOD samples.
CIFAR-10 (contains 10 classes) and CIFAR-100 (contains 100 classes)  datasets were used as in-distribution datasets to train deep neural networks for image classification. They both consist of 50,000 images for training and 10,000 images for testing, with the image size of . The images in the train split were used as in our experiment.
The Tiny ImageNet dataset contains 10,000 test images from 200 different classes, which are drawn from the original 1,000 classes of ImageNet . TinyImageNet-crop (TINc) and TinyImageNet-resize (TINr) are constructed by either randomly cropping or downsampling each image to a size of .
For each in-distribution dataset (test split) and each out-of-distribution dataset, 1,000 images (labeled as ID or OOD) were randomly held out for validation, such as parameter tuning and early stopping, while the remaining test images containing unlabeled ID or OOD samples were used as for unsupervised training and evaluation. These datasets are provided as a part of ODIN  code release111github.com/facebookresearch/odin.
We followed the same metrics used by [16, 26] to measure the effectiveness of our method in distinguishing between in- and out-of-distribution samples. TP, TN, FP, FN are used to denote true positives, true negatives, false positives, and false negatives, respectively.
FPR at 95% TPR shows the false positive rate (FPR) at 95% true positive rate (TPR). True positive rate can be computed by TPR = TP / (TP+FN), while the false positive rate (FPR) can be computed by FPR = FP / (FP+TN).
Detection Error measures the minimum misclassification probability, which is calculated by the minimum average of false positive rate (FPR) and false negative rate (FNR) over all possible score thresholds.
is the Area Under the Receiver Operating Characteristic curve and can be calculated by the area under the FPR against TPR curve.
AUPR In is the Area Under the Precision-Recall curve and can be calculated by the area under the precision = TP / (TP+FP) against the recall = TP / (TP+FN) curve. For AUPR In, in-distribution images are specified as positive.
AUPR Out is similar to the metric AUPR-In. The difference is that out-of-distribution images are specified as positive in AUPR Out.
The results are summarized in Table 2, which shows the comparison of our method, ODIN  and Ensemble of Leave-Out Classifiers (ELOC)  on various benchmarks. In addition, ELOC  does not have results for iSUN as an OOD dataset because they use the whole iSUN as a validation dataset. We implemented two ensembled full-connected layers in ODIN  and ELOC , and their performance is almost the same as that of the single classifier (one full-connected layer) in their original papers.
Table 2 clearly shows that our approach significantly outperforms other existing methods, including ODIN  and ELOC  (which is the ensemble of five models), across all neural network architectures on all of the dataset pairs.
TinyImageNet-resize, LSUN-resize and iSUN, which contain the images with full objects as opposed to the cropped parts of objects, are considered more difficult to detect. Our proposal shows highly accurate results on these more challenging datasets.
Noticeably, our method is very close to distinguishing between in- and out-of-distribution samples perfectly on most dataset pairs. As shown in Fig. 5, we compared our OOD detector based on DenseNet-BC with ELOC  when in-distribution data is CIFAR-100 and out-of-distribution data is TinyImageNet-resize. These figures show that the proposed method has much less overlap between OOD samples and ID samples on all the dataset pairs compared to ELOC , indicating that our method separates ID and OOD samples very well.
Another merit of our method is that we can use a simple threshold to separate ID and OOD samples as shown in Fig. 5. On the other hand, it is very difficult to decide an interpretable threshold of that value in ELOC.
We also plot the histogram of the two classifiers’ maximum softmax scores of in Fig. 5. Fig. 5 shows that the distribution of ’s and differs significantly according to whether the sample is ID (CIFAR-100) or not (TINr) after the fine-tuning. For convenience, we use and to denote the probability output of and for class , respectively. The discrepancy loss makes OOD samples’ close to and close to 1. On the other hand, ID samples’ are almost the same as due to the support we added to ID samples in Step A and Eq. (3) in Step B.
Since our method needs to fine-tune the classifier to detect OOD samples which changes the decision boundary, we observed a 5% drop of classification accuracy compared to the original classifier before the fine-tuning. This problem could be solved by using the original classifier to classify ID samples with some runtime increase and it is still much more acceptable than ELOC  using an ensemble of five models which needs more runtime and computing resources.
|Detection Error (%)||0.7||0.5||0.9||1.5|
|ID Discrepancy Loss||0.05||0.06||0.08||0.05|
|OOD Discrepancy Loss||3.08||3.05||2.84||2.68|
|Detection Error (%)||0.3||0.4||1.2||3.8|
|ID Discrepancy Loss||0.62||0.26||1.05||1.08|
|OOD Discrepancy Loss||4.03||3.90||3.93||3.40|
|OOD for testing||TINc+LSUNc||LSUNc||TINc|
|Detection Error (%)||0.2||0.5||0.7|
|ID Discrepancy Loss||0.03||0.05||0.04|
|OOD Discrepancy Loss||3.53||3.20||2.39|
Since our approach accesses to unlabeled data , we further analyzed the effects of the following factors:
The size and the data balance of . We used CIFAR-100 as ID and TinyImageNet-crop as OOD and we changed the number of ID and OOD samples in for unsupervised training. The result are summarized in Table 3, which shows that our proposed method works under various settings. Even when 9,000 ID samples and 500 OOD samples are included in , our method still have better performance than [16, 26], which means our method is robust to the size of and the percentage of OOD data in . Please notice that we used all 9,000 ID samples and 9,000 OOD samples for testing, which means totally unseen samples were included during evaluation.
The selection of OOD data in . To show the effectiveness of our method, we also tried various pairs of OOD datasets for unsupervised training and evaluation. Table 4 shows that our method still works when multiple datasets are used as OOD or even when OOD dataset used for unsupervised training is different from the OOD dataset for evaluation.
The relationship between the discrepancy loss and the detection error. The mean discrepancy loss in Eq. (1) of ID and OOD samples in test dataset is shown in Table 3 and Table 4, respectively. These results show that the discrepancy loss of ID samples is smaller than OOD samples in all settings. The detection error is lower when the difference between the discrepancy loss of ID and OOD samples is larger, which means ID and OOD samples can be separated by the divergence between the two classifiers’ outputs.
Since our goal is to benefit applications in real world, we also evaluated our method in two cases of real-world simulation to demonstrate the effectiveness of our method.
Considering domain specific applications, we evaluated our method by two simulations of food and fashion applications because there are services focusing on these domains.
For food recognition, we used FOOD-101 , which is a real-world food dataset containing the 101 most popular and consistently named dishes collected from foodspotting.com. FOOD-101 consists of 750 images per class for training and 250 images per class for testing. The training images of FOOD-101  are not cleaned and contain some amount of noise. We evaluated our method on FOOD-101  as ID and TinyImageNet-crop (TINc)/LSUN-crop (LSUNc) as OOD.
For fashion recognition, we used DeepFashion , a large-scale clothes dataset. We used the Category and Attribute Prediction Benchmark dataset of DeepFashion , which consists of 289,222 images of clothes and 50 clothing classes. We used DeepFashion  as ID and TinyImageNet-crop (TINc)/LSUN-crop (LSUNc) as OOD.
We resized the FOOD-101  and DeepFashion  images to . For FOOD-101 , the original train split was used as ; 1,000 images from the original test split were used for validation and the remaining test images were used as . For DeepFashion , the original train split was used as ; 1,000 images from the original validation split were used for validation and the original test images were used as for unsupervised training and evaluation.
Table 5 shows the comparison of our method, ODIN  and ELOC  on real-world simulation datasets. These results clearly show that our architecture significantly outperforms the other existing methods, ODIN  and ELOC , by a considerable margin on all datasets. Furthermore, our method nearly perfectly detects non-food and non-fashion images.
In this paper, we proposed a novel approach for detecting out-of-distribution data samples in neural networks, which utilizes two classifiers to detect OOD samples that are far from the support of the ID samples. Our method does not require labeled OOD samples to train the neural network. We extensively evaluated our method not only on OOD detection benchmarks, but also on real-world simulation datasets. Our method significantly outperformed the current state-of-the-art methods on different DNN architectures across various in and out-of-distribution dataset pairs.
This work was partially supported by JST CREST JPMJCR1686 and JSPS KAKENHI 18H03254, Japan.
Food-101 – mining discriminative components with random forests.In ECCV, 2014.
Detecting out-of-distribution samples using low-order deep features statistics.In Submitted to ICLR, 2019.
Beyond novelty detection: Incongruent events, when general and specific classifiers disagree.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.