Log In Sign Up

A Unified Benchmark for the Unknown Detection Capability of Deep Neural Networks

Deep neural networks have achieved outstanding performance over various tasks, but they have a critical issue: over-confident predictions even for completely unknown samples. Many studies have been proposed to successfully filter out these unknown samples, but they only considered narrow and specific tasks, referred to as misclassification detection, open-set recognition, or out-of-distribution detection. In this work, we argue that these tasks should be treated as fundamentally an identical problem because an ideal model should possess detection capability for all those tasks. Therefore, we introduce the unknown detection task, an integration of previous individual tasks, for a rigorous examination of the detection capability of deep neural networks on a wide spectrum of unknown samples. To this end, unified benchmark datasets on different scales were constructed and the unknown detection capabilities of existing popular methods were subject to comparison. We found that Deep Ensemble consistently outperforms the other approaches in detecting unknowns; however, all methods are only successful for a specific type of unknown. The reproducible code and benchmark datasets are available at .


page 1

page 2

page 3

page 4


Conditional Gaussian Distribution Learning for Open Set Recognition

Deep neural networks have achieved state-of-the-art performance in a wid...

Open Set Recognition with Conditional Probabilistic Generative Models

Deep neural networks have made breakthroughs in a wide range of visual u...

Revisiting Open World Object Detection

Open World Object Detection (OWOD), simulating the real dynamic world wh...

diagNNose: A Library for Neural Activation Analysis

In this paper we introduce diagNNose, an open source library for analysi...

Benchmarking the Robustness of Deep Neural Networks to Common Corruptions in Digital Pathology

When designing a diagnostic model for a clinical application, it is cruc...

Stochastic Weight Matrix-based Regularization Methods for Deep Neural Networks

The aim of this paper is to introduce two widely applicable regularizati...

1 Introduction

Deep neural networks have achieved significant performance improvements over various tasks such as image classification Krizhevsky et al. (2012); Rawat and Wang (2017), object detection Ren et al. (2015); Zhao et al. (2019), speech recognition Hinton et al. (2012); Nassif et al. (2019)

, and natural language processing 

Mikolov et al. (2010); Devlin et al. (2019). Despite these remarkable achievements, current deep neural networks have a critical deficiency that should be seriously considered before these models can be deployed in real-world applications. That is, they tend to make predictions with over-confidence even if the predictions are incorrect or the inputs are not relevant to a target task. This issue is a major concern in applications where models can cause fatal safety problems, such as medical diagnoses or autonomous driving Mehrtash et al. (2020); Mohseni et al. (2020). To establish reliable and secure systems with deep neural networks, our models should possess the important property of confidence in the predictions, described as: “the model should know what it does not know.” Specifically, models should produce high confidence when encountering inputs likely to be correct predictions (i.e., known samples), whereas they should produce low confidence if predictions are likely to be incorrect or if inputs are semantically far from what was learned (i.e., unknown samples). Hence, it is necessary to have models that produce well-ranked confidence values according to how confident they are about the predictions.

Figure 1: The overall concept of unknown detection. A model deployed in real-world applications may take not only inputs from the training distribution (in-distribution; orange boxes), but also those from distributions that are irrelevant to the target task (near-OoD; violet boxes, far-OoD; green boxes). For the model to be reliable, it should have the capability of distinguishing certainly known inputs (i.e., correct predictions) and unknown ones (i.e., incorrect predictions and OoD samples) based on the confidence score of each prediction.

Many works have sought to devise models that possess such property from two individual perspectives. First, previous researches named selective classification El-Yaniv and Wiener (2010); Geifman and El-Yaniv (2017) or misclassification detection (MD) (i.e., failure prediction) Jiang et al. (2018); Corbière et al. (2019)

set the objective to provide well-aligned ordinal ranking of confidence values to clarify the separation between correct and incorrect predictions on samples from a single data distribution (i.e., in-distribution). Another line of research considered the problem of open-set recognition (OsR) or out-of-distribution detection (OoDD), which aim to estimate well-ranked confidence values between samples from the in-distribution and those from other distributions, referred to as out-of-distribution 

Bendale and Boult (2016); Liang et al. (2018); Lee et al. (2018b). In other words, the former focuses on the ranking of confidence values within the in-distribution, whereas the latter takes into account that between the in-distribution and out-of-distributions. We argue that these two separate views should be integrated to evaluate rigorously the true detection capability of deep neural networks on unknown inputs, as unknowns should include both incorrectly predicted in-distribution samples and unpredictable out-of-distribution samples.

Those different perspectives have led to individual benchmark settings, although previous works have attempted to solve a fundamentally similar issue Moon et al. (2020)

. MD considers the setting assuming that all inputs at test time come from the in-distribution. In other words, previous works for MD do not take into account inputs lying outside of the training data distribution. Therefore, a single dataset such as CIFAR-10/100 

Krizhevsky et al. (2009) is sufficient to evaluate the MD performance of a model Geifman et al. (2019); Corbière et al. (2021). OsR and OoDD can be categorized as the same problem; however, the benchmark settings are slightly different in the literature. Typically, OoDD considers OoD as a completely different distribution from the training distribution Lee et al. (2018a)

, and therefore, focuses on detecting inputs that are semantically far from the in-distribution dataset. For instance, SVHN 

Netzer et al. (2011), LSUN Yu et al. (2015) or synthetic data such as Gaussian noise are frequently used as OoD datasets when CIFAR-10 is the in-distribution dataset Liang et al. (2018); Lee et al. (2018b). Some previous works focusing on OsR were undertaken in settings similar to those of OoDD Oza and Patel (2019b); Perera et al. (2020), but a few works set OoD as a semantically close distribution Jang and Kim (2020); Oza and Patel (2019a), e.g., images containing the same object but from other datasets Perera et al. (2020). These individually different benchmark settings make it difficult to compare existing methods proposed for basically similar problems, i.e., MD, OsR, and OoDD.

We thus introduce a more fundamental task, termed unknown detection, which integrates the problems of MD, OsR and OoDD, independently studied in the literature. The goal of unknown detection is to make a neural network model produce well-aligned confidence scores between inputs likely to be predicted correctly and those regarded as unknowns, including inputs likely predicted incorrectly and not relevant to the task of the model. If a model produces such well-aligned confidence scores, it can be successfully deployed for safety-critical applications, as the system can request user intervention when the confidence score of the input is below a predefined threshold.

Figure 1

depicts the overall concept of our unknown detection problem. Suppose that a practitioner wants to deploy the tiger vs. lion classifier in a real-world application. In this scenario, the classifier may take inputs containing various objects that can be classified into three categories: i) target objects, e.g., a

lion or a tiger; ii) objects semantically similar to the target objects (i.e., near-OoD), such as a leopard; and iii) objects completely different from the target objects (i.e., far-OoD), e.g., an automobile or an airplane. If the model can produce predictions with well-aligned confidence values as depicted on the right side of Figure 1, it implies that the model has the capability to distinguish correctly predicted inputs (red boxes) from others including incorrect predictions, near- and far-OoD (blue boxes) when the confidence threshold is properly set. Therefore, it can be utilized with a high degree of reliability for safety-critical systems.

To examine the unknown detection capability of deep neural networks, we propose new benchmark settings based on popular datasets, such as CIFAR-100 and ImageNet 

Deng et al. (2009)

, which cover the complete spectrum of the unknown detection problem. Specifically, our benchmark setting consists of three categories of datasets: in-distribution, near-OoD, and far-OoD. Near- and far-OoD are designed to be distinguished according to how the distribution is semantically overlapped with the in-distribution; i.e., near-OoD shares some degree of semantic concepts with the in-distribution dataset, whereas far-OoD contains completely different semantics. In this work, the concept of the superclass is utilized to distinguish near- and far-OoD classes given in-distribution classes. For example, the given superclass structure of CIFAR-100 was used as a criterion to choose near-OoD classes for the proposed CIFAR-100-based benchmark. In the ImageNet-based benchmark, the WordNet 

Miller (1995) tree structure was used, as all classes in ImageNet inherit the hierarchical structure of WordNet. The distance between class keywords on WordNet was computed to determine whether classes are semantically similar (i.e., near-OoD) or very different (i.e., far-OoD).

With the proposed benchmarks, we evaluate the unknown detection capability of several popular methods proposed for MD, OsR, and OoDD. Our experimental results reveal that Deep Ensemble Lakshminarayanan et al. (2017) is most competitive in terms of unknown detection, although there is still room for improvement in terms of OoDD. Surprisingly, we found that existing methods for a particular task do not work well on other tasks. For example, the effective methods for MD do not show good performance on OoDD; on the other hand, popular methods for OoDD can even harm the MD task.

Our contributions are summarized as follows:

  • We suggest the new task unknown detection which integrates MD, OsR, and OoDD, to evaluate the true detection capability of unknown inputs. Its goal is to detect correctly predicted inputs (i.e., known inputs) against incorrectly predicted inputs as well as task-irrelevant inputs (i.e., unknown inputs).

  • We propose unified benchmarks for evaluating the unknown detection capability of a neural network model; these consist of in-distribution, near-OoD, and far-OoD datasets.

  • We evaluate the unknown detection capabilities of popular methods for the tasks of MD, OsR, and OoDD with the proposed benchmarks. Although Deep Ensemble beats other comparative approaches in terms of unknown detection, we found that it still has room for improvement when it comes to detecting OoDD.

2 Related Work

A variety of methods for each task, MD, OsR, and OoDD, have been actively proposed to detect test inputs that are incorrectly predicted (i.e., MD) or not relevant to a target task (i.e., OsR or OoDD) in the context of image classification. Previous methods were developed under a specific problem setting, an incomplete approach from the perspective of unknown detection, and were evaluated on individual task-specific benchmarks accordingly as summarized in Table 1.

Misclassification detection (MD). MD aims to align the ordinal ranking of confidence values well to clarify the separation between correct and incorrect predictions of unseen samples from the training distribution. That is, MD assumes that the test inputs are the i.i.d.

samples of the training distribution. Essentially, predicted class probabilities via a softmax function can be used for detecting samples likely to be classified incorrectly. Hendrycks and Gimpel 

Hendrycks and Gimpel (2017) suggested exploiting the softmax outputs of conventional deep classifiers to separate correctly and incorrectly classified samples, demonstrating that this simple approach can achieve relatively good performance on MD. A selective classifier abstains the predictions of samples whose confidence values are below a threshold such that the error rate set by a user can be achieved while having an optimal coverage rate Geifman and El-Yaniv (2017). Geifman et al. Geifman et al. (2019) observed that neural networks produce overconfident predictions especially for easy samples (i.e., correctly classified at the early stage of training) as training proceeds. To mitigate this issue, they proposed using earlier snapshots of the trained model and showed that it improves the quality of the confidence values in terms of the area under the risk-coverage curve (AURC). Corbiére et al. Corbière et al. (2019) and Moon et al. Moon et al. (2020) proposed the directly learning of the true class probability and the ordinal ranking of confidence values, respectively.

Benchmark settings MD
Corbière et al. (2019) MNIST, SVHN, CIFAR-10/-100 - -
Geifman et al. (2019) SVHN, CIFAR-10/-100, ImageNet - -
Sensoy et al. (2018) MNIST, CIFAR-10 - -
CIFAR-10 vs CIFAR-10 - -
Moon et al. (2020) SVHN, CIFAR-10, CIFAR-100 - -
Tiny ImageNet
- -
Lee et al. (2018b) CIFAR-10/-100 vs
Tiny ImageNet
- -
Liang et al. (2018) CIFAR-10/-100 vs
Tiny ImageNet
- -
Hendrycks et al. (2019)
Tiny ImageNet
- -
Oza and Patel (2019b, a)
Tiny ImageNet
Tiny ImageNet
Perera et al. (2020) CIFAR-10 vs CIFAR-100 - -
Sun et al. (2020)
Tiny ImageNet
Tiny ImageNet
- -
Jang and Kim (2020) CIFAR-100 vs CIFAR-100 - -
Table 1: Benchmark settings and target problems in the literature on MD, OsR, or OoDD.

Out-of-distribution detection (OoDD) (a.k.a. Open-set recognition (OsR)).111Hereafter, we use the term OoDD only as the problem of OsR is fundamentally identical to that of OoDD. The goal of OoDD is to ensure that a model provides well-aligned confidence values that clearly distinguish between in-distribution inputs and OoD inputs. Hendrycks and Gimpel Hendrycks and Gimpel (2017) demonstrated that deep classifiers with softmax outputs are also effective for OoDD along with MD, as noted earlier. It was also found that post-processing techniques such as temperature scaling or input perturbation can effectively enhance the OoDD performance of standard classifiers Liang et al. (2018); Lee et al. (2018b). To produce low confidence values for OoD inputs, several studies have suggested modifying the model architecture, such as by adding a confidence estimation branch DeVries and Taylor (2018). Intuitively, exploiting OoD samples explicitly during training can boost the OoDD performance of a model. In the literature, such OoD samples for training (also called external samples) consist of natural images Hendrycks et al. (2019) or generated synthetic images Lee et al. (2018a); Vernekar et al. (2019)

. With regard to synthetic images, they can be generated by a generative adversarial network (GAN) 

Lee et al. (2018a); Neal et al. (2018); Ge et al. (2017) or a conditional variational auto-encoder Vernekar et al. (2019). Extreme value theory can also be employed to update softmax scores for detecting OoD inputs. Bendale et al. Bendale and Boult (2016) introduced the OpenMax layer to estimate the softmax values of inputs from an unknown distribution using extreme value theory. Several studies Oza and Patel (2019b, a) have utilized auto-encoder architectures with extreme value theory.

Other studies attempted to solve OoDD under an unsupervised learning setting. A likelihood provided by a generative model (e.g., PixelCNN++ 

Salimans et al. (2017), Glow Kingma and Dhariwal (2018)) or a hybrid model Nalisnick et al. (2019b) can be used as a criterion to distinguish in-distribution inputs from OoD inputs Ren et al. (2019). However, it is known that such likelihood-based approaches tend to assign a high likelihood to what are clearly OoD inputs Choi et al. (2018); Nalisnick et al. (2019a). Contrastive learning Chen et al. (2020)

, a recently emerging method of self-supervised learning, also improves the OoDD performance 

Winkens et al. (2020).

With regard to benchmark settings, it is most common to set datasets that are completely different from the in-distribution dataset as the OoD datasets Liang et al. (2018); Lee et al. (2018b); Ge et al. (2017). However, not all studies share such settings, even when they focus on the same objective of OoDD. A few studies Oza and Patel (2019b); Perera et al. (2020); Neal et al. (2018) construct the in-distribution dataset and OoD dataset by selecting only specific classes from candidate datasets to consider a notion of semantic similarity, for example, non-animal classes from CIFAR-10 vs. animal classes from CIFAR-100 Oza and Patel (2019b), or vehicle classes from CIFAR-10 vs. vehicle classes from CIFAR-100 Perera et al. (2020). On the other hand, other studies Jang and Kim (2020); Sun et al. (2020) create an in-distribution dataset and an OoD dataset by intra-dataset division. Roady et al. Roady et al. (2020) also pointed out the problem of individual benchmark settings in the OoDD literature and compared several methods proposed for OoDD under the same benchmark setting. However the comparison is limited to in-distribution vs. far-OoD, which merely covers a narrow scope of our unknown detection problem.

Overall, previous works on OoDD have assumed that unknown inputs at test time come from entirely different distributions relative to the in-distribution. However, this assumption does not consider the distinguishability between correct predictions and incorrect predictions of the in-distribution samples. Because good detection capability of OoD inputs does not guarantee good detection capability of incorrect predictions as confirmed in our experiments, one cannot ensure that a model produces well-ranked confidence values in terms of unknown detection.

3 Unknown Detection Benchmark Datasets

In this section, we present the detailed procedure of how to construct benchmark datasets for the unknown detection task. Specifically, we design two benchmark datasets, the CIFAR- and ImageNet-based datasets, to consider different data scales in terms of the number of samples and the image resolution.

As summarized in Table 1, existing studies have focused on only a subset of the unknown detection problem, e.g., MD or OoDD tasks, and thus have been individually evaluated on their own benchmark settings. However, all detection targets, including incorrect predictions in MD and near-/far-OoD inputs in OoDD, should be taken into consideration when we define the unknown samples for a model. Thus, our unknown detection benchmark datasets basically consist of the following categories: in-distribution, near-OoD, and far-OoD datasets.

The in-distribution dataset is used to train a model. Given that the model barely achieves perfect accuracy, test samples from this dataset can be classified into correct (i.e., known) and incorrect (i.e., unknown) predictions, as in the typical setting of the MD problem. The OoD dataset in our benchmark consists of near- and far-OoD datasets that are used to evaluate the detection capability on inputs not relevant to a given task. Here, we employ the semantic distance between classes as the notion of the distance. More concretely, a specific class is regarded as the near-OoD class if there is at least one class in the in-distribution dataset which shares a superclass. For example, if two classes A and B belong to a class C (i.e., C is a superclass of A and B), then B is considered as a near-OoD class of A. Otherwise, it is considered as the far-OoD class if it comes from totally different sources (e.g., digits vs. CIFAR classes).

Dataset Train Validation Test


In-distribution Dataset
CIFAR-40 18,000 2,000 4,000
External Dataset
Tiny ImageNet158-FIX 79,000 2,000 -
Near-OoD Dataset
CIFAR-60 - - 6,000
Far-OoD Dataset
LSUN-FIX - - 4,000
SVHN - - 4,000
DTD - - 5,640
Gaussian - - 10,000


In-distribution Dataset
ImageNet-200 250,000 10,000 10,000
External Dataset
External ImageNet-394 492,500 19,700 -
Near-OoD Dataset
Near ImageNet-200 - - 10,000
Far-OoD Dataset
Food-32 - - 10,000
Caltech-45 - - 4,792
Places-82 - - 10,000
Table 2: Summary of the proposed benchmark datasets.

One more dataset is constructed to compare the existing methods in a fair setting: the external dataset. This dataset is used for hyperparameter tuning or for the training of certain methods if they require an auxiliary dataset such as OE 

Hendrycks et al. (2019). Note that some existing works select the optimal hyperparameters using randomly sampled data from the target OoD. For these methods, the external dataset is used for hyperparameter tuning, as the aforementioned scenario is practically infeasible. Importantly, samples from the in-distribution dataset should not be semantically overlapped with those from far-OoD or from an external dataset so as to evaluate the unknown detection capability rigorously. For example, it is difficult to evaluate the performance accurately or to find optimal hyperparameters if OoD classes or external dataset classes are semantically similar to those in the in-distribution dataset. To this end, we employ the WordNet tree to measure the semantic distance between classes approximately. According to this approach, we eliminated several classes in the far-OoD dataset or external dataset to minimize the semantic overlap with the in-distribution classes.

The following sections describe the construction processes of the proposed benchmark datasets based on CIFAR-100 and ImageNet for the unknown detection task. Table 2 summarizes the composition of the constructed unified benchmark datasets.

Superclass Classes
beaver dolphin otter seal whale
aquarium fish flatfish ray shark trout
orchids poppies roses sunflowers tulips
bottles bowls cans cups plates
apples mushrooms oranges pears sweet peppers
fruit and
electrical devices
clock computer keyboard lamp telephone television
bed chair couch table wardrobe
bee beetle butterfly caterpillar cockroach
bear leopard lion tiger wolf
large man-made
outdoor things
bridge castle house road skyscraper
cloud forest mountain plain sea
large natural
outdoor scenes
large omnivores
and herbivores
camel cattle chimpanzee elephant kangaroo
fox porcupine possum raccoon skunk
crab lobster snail spider worm
baby boy girl man woman
crocodile dinosaur lizard snake turtle
hamster mouse rabbit shrew squirrel
maple oak palm pine willow
bicycle bus motorcycle pickup truck train
lawn-mower rocket streetcar tank tractor
Table 3: All superclasses and classes belonging to CIFAR-100. Classes shown in green and yellow compose CIFAR-40 and CIFAR-60, respectively.

3.1 CIFAR-based Benchmark

The first benchmark dataset for the unknown detection task is based on CIFAR-100, a relatively small-scale dataset with low resolution images. As shown in Table 1, CIFAR-100 has been frequently used for evaluations on all MD and OoDD tasks. In this work, CIFAR-100 is used to construct the in-distribution and the near-OoD dataset; the far-OoD datasets consists of four different ones: resized LSUN, SVHN, the Describable Textures Dataset (DTD) Cimpoi et al. (2014), and Gaussian noise. Note that some works utilize the 80 million Tiny Images dataset Torralba et al. (2008) if their methods require an external dataset, e.g., OE. However, the creators withdrew the dataset and requested that it not be used owing to some offensive and prejudicial contents which arose during the process of automatically collecting the images Birhane and Prabhu (2021). Hence, we use Tiny ImageNet 222, a small-scale version of ImageNet, instead as the external dataset when necessary.

In-distribution & Near-OoD datasets. CIFAR-100 is a dataset for 100-class classification, and its classes have a hierarchical structure. There are 20 superclasses with each having five subclasses, as shown in Table 3. For example, orchids, poppies, rose, sunflowers, and tulips classes in CIFAR-100 belong to the superclass flowers. Based on this hierarchical structure, we can easily construct the in-distribution and near-OoD datasets, as it can be assumed that classes belonging to a specific superclass share some high-level semantics. Specifically, for each superclass, two classes are randomly selected as the in-distribution datasets, while the remaining three classes are considered as the near-OoD ones. As a result, 40 classes form the in-distribution dataset, termed CIFAR-40, and 60 classes constitute the near-OoD dataset, referred to as CIFAR-60, as shown in Table 3.

Far-OoD datasets. For the far-OoD datasets, we use datasets commonly employed as OoD datasets in the literature: resized LSUN, SVHN, DTD, and Gaussian Noise. To ensure that the far-OoD is semantically far from the in-distribution, we compose the far-OoD classes such that they do not share semantic concepts with in-distribution classes.

Resized LSUN with ten scene categories has been widely used in reference to the OoDD task Liang et al. (2018); Lee et al. (2018b). However, as presented in the top row of Figure 2, images in resized LSUN contain artificial noise caused by inappropriate image processing, causing the OoDD performance to be overestimated significantly. Tack et al. Tack et al. (2020) reported this problem by showing that resized LSUN is easily detectable with a simple smoothness score based on the total variation distance. Hence, we reconstruct the dataset, referred to as LSUN-FIX, by employing a fixed resize operation following a procedure in Tack et al. (2020). Samples from LSUN-FIX are shown in the bottom row of Figure 2.

We also empirically observed that the detection performance on LSUN reported in the literature is greatly overestimated, as shown in Section 4. For example, in the literature, the Mahalanobis detector Lee et al. (2018b), a typical approach for OoDD, scored as the area under the receiver operating characteristic (AUROC) at the experimental setting of CIFAR-100 vs. LSUN. However, it showed AUROC in our experimental settings of detecting LSUN-FIX against CIFAR-40, indicating that the model virtually undertakes random guessing for the OoDD problem 333We excluded the Mahalanobis detector from the comparison based on this finding..

It is important to note that classes in the far-OoD datasets should not be similar to those in the in-distribution dataset CIFAR-40. To examine the similarity between classes in these datasets, we utilize the WordNet tree, which shows a hierarchical structure of concepts by encoding a super-subordinate relationship of words. First, the subtree having nodes corresponding to the CIFAR-100 classes is extracted from the overall WordNet tree. One part of this subtree is illustrated in Figure 3(a). Then, a particular class in the far-OoD dataset is removed if its class name matches at least one node in the subtree. Obviously, the classes in the SVHN, DTD, and Gaussian noise datasets do not need to be examined. For LSUN-FIX, it is confirmed that all classes do not overlap with those in CIFAR-100, therefore we did not remove any classes from LSUN-FIX. Hence, it is guaranteed that our far-OoD datasets, SVHN, DTD, Gaussian noise, and LSUN-FIX, do not semantically overlap with the in-distribution dataset CIFAR-40.

Figure 2: Examples of original images (red box; top row) and reconstructed images (green box; bottom row). The original images have artificial noise, leading to easy detection without considering the semantic information.

External dataset. Tiny ImageNet consists of 200 classes with low resolution images from the original ImageNet dataset Le and Yang (2015). It also has downsampled images with artificial noise, leading to the same problem found in the resized LSUN dataset, as shown in Figure 2. Thus, we reconstructed the Tiny ImageNet dataset by resizing the corresponding images of ImageNet with a fixed resize operation as used to construct LSUN-FIX 444Hereafter, the term ”Tiny ImageNet-FIX” refers to the reconstructed Tiny ImageNet dataset..

As mentioned earlier, the external dataset should not overlap with the in-distribution or the near-OoD dataset. Images of CIFAR-100 and Tiny ImageNet were collected based on certain words in WordNet. Each word in WordNet has its own WordNetID (wnid), meaning that wnid can be used to examine the overlapping between classes in CIFAR-100 and Tiny ImageNet on the WordNet tree. However, the classes in CIFAR-100 do not have explicit wnid, while those in Tiny ImageNet do have these. Accordingly, we manually assigned the corresponding wnid to the CIFAR-100 classes. For example, the CIFAR-100 classes camel and mouse are assigned wnid n02437136 and n02330245, respectively.

(a) CIFAR-100 subtree
(b) Tiny ImageNet subtree
Figure 3: Part of (a) the CIFAR-100 subtree and (b) the Tiny ImageNet subtree. The nodes with wnid are classes of each dataset. The orange nodes are removed from Tiny ImageNet because they are descendant nodes of the CIFAR-100 classes.

The procedure used to examine the similarity between CIFAR-100 classes and Tiny ImageNet classes is conducted on a subtree of the WordNet tree. Here, the subtree consists of nodes corresponding to the classes in Tiny ImageNet, whereas the subtree used to construct the far-OoD datasets consists of those in CIFAR-100. A particular class in Tiny ImageNet is removed if its wnid is identical to a specific wnid of CIFAR-100 classes or is a descendant of nodes corresponding to CIFAR-100 classes. Figure 3(b) presents part of this subtree. The leaf nodes with wnid and the red nodes indicate classes in Tiny ImageNet and CIFAR-100, respectively. Orange nodes such as bullet train or school bus have CIFAR-100 classes (i.e., train and bus) as their ancestors, and they can be regarded as classes semantically overlapping the CIFAR-100 classes. According to this procedure, we removed such classes (i.e., orange nodes) from the Tiny ImageNet dataset. As a result, 42 classes in total were removed from Tiny ImageNet; therefore, the remaining 158 classes comprise Tiny ImageNet158-FIX, which is the external dataset of our CIFAR-based benchmark.

3.2 ImageNet-based Benchmark

The well-known ImageNet is the most popular large-scale dataset typically used for image classification. To evaluate the unknown detection performance with a large-scale dataset, we also developed an ImageNet-based benchmark following similar procedures employed for our CIFAR-based benchmark. The WordNet tree fits perfectly when used to determine the semantic distance between ImageNet classes, as all classes of ImageNet come from WordNet Deng et al. (2009). To construct our ImageNet-based benchmark, ImageNet classes were divided into three categories based on a hierarchical structure of the WordNet tree: the in-distribution, the near-OoD, and the external datasets. As the far-OoD datasets, we considered Food-101 Bossard et al. (2014), Caltech-256 Griffin et al. (2007), and Places-365 Zhou et al. (2017) after removing classes that overlapped with the ImageNet classes.

(a) near-OoD/external classes
(b) far-OoD classes
Figure 4: Construction process of the near-/far-OoD, and the external dataset in ImageNet-based benchmark: (a) shows how the near-OoD classes and the external classes are determined. The classes in the red box are near-OoD class candidates with the in-distribution class plunger, and (b) explains how the far-OoD classes are examined. The blue box contains nodes to be examined to determine if a class having the keyword knife should be removed from the far-OoD datasets.

In-distribution dataset. We constructed the in-distribution dataset ImageNet-200 with 200 classes identical to those in Tiny ImageNet. Note that the images in this benchmark are not downsampled to preserve the original image resolutions. Specifically, all ImageNet training images from the 200 classes were used to construct the training and validation datasets for our ImageNet-based benchmark.

Near-OoD & External datasets. To categorize ImageNet classes according to the degree of semantic-level similarity, we utilize the WordNet tree, as in the procedure to examine the far-OoD and external classes in the CIFAR-based benchmark. The subtree of WordNet consists of nodes corresponding to ImageNet 1000 classes. Figure 4 shows a part of the subtree. We regard classes that are descendants of the parent nodes of in-distribution classes (i.e., ImageNet-200) as near-OoD class candidates. For example, in Figure 4(a), can opener is one of the near-OoD class candidates because it is a descendant node of hand tool which is the parent node of the in-distribution class plunger. On the other hand, garden tool which is the parent node of the in-distribution class lawn mower, does not have any descendants. Accordingly, there are no near-OoD class candidates under garden tool.

Only 200 classes are randomly selected among all near-OoD candidates, while the remaining candidates are discarded. The external dataset consists of the remaining 394 ImageNet classes, excluding 200 in-distribution classes, 200 near-OoD classes, and the discarded classes. Consequently, the selected 200 classes formed a near-OoD dataset referred to as Near ImageNet-200, and the remaining 394 classes formed the external dataset External ImageNet-394. Note that Near ImageNet-200 consists of ImageNet validation images corresponding to 200 near-OoD classes, whereas External ImageNet-394 is constructed using ImageNet training images from 394 external-OoD classes.

Far-OoD dataset. For the far-OoD datasets, we adopt other high-resolution datasets, Food-101, Caltech-256 and Places-365. To make the far-OoD dataset clearly semantically far from the in-distribution dataset, we eliminate classes similar to the in-distribution classes from them. The keywords in the names of the far-OoD classes are used as the starting point for the similarity examination process.

We utilize the same subtree used to examine the near-OoD classes in the ImageNet-based benchmark. First, we inspect if there are any nodes matching a keyword in the names of the far-OoD classes. A far-OoD class will be removed if i the matching node is one of the ImageNet classes, ii sibling nodes of the matching node contain at least one of the ImageNet classes, or iii ancestor or descendant nodes of the matching nodes have at least one of the ImageNet classes.

Figure 4(b) gives an example of the examination process when the name of a certain far-OoD class includes knife. The subtree has a node including knife in its name. Therefore, we check all of sibling, descendant, and ancestor nodes (see the blue box in the figure). Far-OoD classes having knife in their names will be removed because they meet the aforementioned removal criteria ii or iii.

Variations Food101 Caltech256 Places365
cup cakes cup cake
donuts donut
oysters oyster
billiards billiard
chopsticks chopstick
head phones head phone
badlands badland
artists loft artist loft
butchers shop butcher shop
childs room child room
cup cakes cupcakes
hot dog hotdog
chess board chessboard
dumb bell dumbbell
head phones headphones
barndoor barn door
fishpond fish pond
shopfront shop front
chimp chimpanzee
artist loft artist’s loft
butchers shop butcher’s shop
child room child’s room
Table 4: Examples of grammatical variations of the keywords in each far-OoD dataset. The left side of the arrow is the original class name and the right side is its possible variation.

Each dataset has its own naming rule and we therefore consider grammatical variations of the keywords, such as plurals, abbreviations, and different spacings, among others. For example, for cup cakes in Food-101, cup, cups, cake, cakes, cupcake, and cupcakes are considered keywords. In a case such as chimp in Caltech-256, we check images in the class to clarify whether it is a short form of chimpanzee, and consider chimpanzee as a keyword for that class. Table 4 provides several examples of grammatical variations of the types considered in this work.

The Food-101, Caltech-256, and Places-365 classes have 32, 45, and 82 remaining classes after the examining processes, as described above, and they were renamed Food-32, Caltech-45, and Places-82, respectively. Consequently, the far-OoD datasets are constituted by images from the remaining classes of Food-101, Caltech-256, and Places-365.

4 Experimental Results

In this section, we present the comparison methods, experimental settings and evaluation metrics for the proposed unknown detection task. Each following subsection provides the unknown detection performances of the comparison methods on the proposed CIFAR- and ImageNet-based benchmarks.

Comparison methods. We considered a total of eight methods including methods belonging to categories relevant to the unknown detection task: MD and OoDD. We mostly adopted hyperparameters reported in the literature and then manually searched them when the optimal hyperparameter values were not clearly specified or when the experimental settings (e.g., datasets, architectures) differed from ours.

Among the existing methods developed for MD, we chose two methods: CRL Moon et al. (2020) and EDL Sensoy et al. (2018). CRL utilizes a ranking loss based on the frequency of correct predictions for each sample for better estimations of the confidence scores. As a result, a model trained with CRL produces high confidence scores on correctly predicted samples. This method has a single hyperparameter to control the balance between cross-entropy and ranking losses. We set this value to following Moon et al. (2020)

. EDL treats the softmax output as a categorical distribution rather than a point estimate by introducing the Dirichlet density. There are two hyperparameters corresponding to the type of objective function, along with a substitute layer for the softmax layer. We empirically found their optimal choices with our in-distribution validation set, and they were set to the Type II maximum likelihood and an exponential layer, respectively.

For the OoDD methods, we chose three popular methods: ODIN Liang et al. (2018)

, Outlier Exposure (OE) 

Hendrycks et al. (2019), and OpenMax Bendale and Boult (2016). The hyperparameters of the OoDD methods were selected based on the OoDD performance in terms of AUROC with our validation datasets (refer to Table 2), as they were designed for that purpose. Note that the external dataset does not overlap with the near-/far-OoD datasets; therefore, we can ensure that a model cannot directly access the test datasets (i.e., the constructed OoD datasets). ODIN improves the OoDD performance through temperature scaling and by adding small perturbations to the input samples. Using the gradients of a model’s output with respect to the input pixels, input perturbations are added to broaden the gap of the confidence values between the in-distribution and OoD datasets. The temperature and the perturbation magnitude were searched for each experiment within the ranges used in Lee et al. Lee et al. (2018b). OE suggested a training method that leverages an external dataset consisting of outliers. By explicitly exposing a model for such outliers, OE substantially improves the OoDD performance. It has a hyperparameter that controls the effect of the learning on outliers. Because OE sets it to the same value across different experimental settings, we used the suggested value of

in the literature. OpenMax utilizes extreme value theory to estimate the distribution of inputs that are not likely from the in-distribution with a Weibull distribution. It hence has hyperparameters related to the Weibull distribution, such as the number of inputs that most greatly deviate from the mean vector to fit the extreme distribution and the number of mostly activated classes used to estimate the unknown class probability. We conducted a grid search following

Bendale and Boult (2016) to find the optimal values during each experiment.

We also evaluate several previous methods which regularize a model to have a high quality of confidence estimates: AugMix Hendrycks et al. (2020), Monte Carlo dropout (MCdropout) Gal and Ghahramani (2016), and Deep Ensemble. AugMix suggested a data augmentation technique that mixes several randomly chosen data augmentations. It is focused on improving both the robustness and uncertainty of the predictions of a model under a distribution shift, such as corrupted inputs. Although AugMix has several hyperparameters, Hendrcycks et al. Hendrycks et al. (2020) empirically confirmed that a model’s performance is not sensitive to their values. Accordingly, we used the hyperparameter values reported in the literature. MCdropout enables estimations of the posterior distribution by sampling several stochastic predictions using a dropout technique during the test phase. For MCdropout, confidence values are estimated by averaging stochastic predictions Kendall and Gal (2017). The ResNet architecture for MCdropout used in our experiments stems from Zhang et al. (2019). Specifically, dropout layers with a dropout rate of are applied to all convolutional layers, and a dropout layer with is added before the last fully connected layer. For Deep Ensemble, confidence values are estimated by averaging predictions from five baseline models.

Experimental settings. We evaluate all comparison methods under the same experimental settings on each benchmark. For the CIFAR-based benchmark, we employed the ResNet18555 He et al. (2016) architecture. All models were trained using SGD with a momentum of , an initial learning rate of , and a weight decay of for epochs with a mini-batch size of . The learning rate was reduced by a factor of at epochs and epochs. Also, we employed a standard data augmentation scheme, i.e., random horizontal flip and

random cropping after padding with

pixels on each side.

For the ImageNet-based benchmark, we adopted standard ImageNet training settings666 Every model was trained using SGD with a momentum of , an initial learning rate of , and a weight decay of for epochs with a mini-batch size of . The learning rate was reduced by a factor of at epochs and epochs. The data augmentation scheme consisted of random resized cropping and random horizontal flip.

OE was trained from scratch and the mini-batch size of the external dataset (i.e., Tiny ImageNet158-FIX or ImageNet-394 for CIFAR- and ImageNet-based benchmark, respectively) was set to twice that of the in-distribution dataset following the original training recipe Hendrycks et al. (2019).

Evaluation metrics.

The most commonly used performance metric for MD and OoDD is the area under the receiver operating characteristic curve (AUROC). Because these tasks are essentially intended to distinguish between correct predictions (or in-distribution samples) and incorrect ones (or OoD samples), measuring the ranking performance of confidence values can be a natural choice. However, evaluating unknown detection performances using AUROC could be problematic. As Ashukha et al. 

Ashukha et al. (2020) pointed out, AUROC cannot be used directly to compare MD performances across different methods, as each model has its own correct and incorrect predictions. In other words, each model induces an individual classification problem with its own positives (i.e., correct predictions) and negatives (i.e., incorrect predictions). Similar to the MD problem, the unknown detection task considers incorrect predictions as unknown samples; accordingly, AUROC is not suitable for precisely comparing unknown detection performances between different methods.

We thus adopt the area under the risk-coverage curve (AURC) Geifman et al. (2019) as a primary measure of the unknown detection performance. AURC measures the area under the curve of risk at the sample coverage. The sample coverage is defined as the ratio of samples whose confidence values exceed a certain confidence threshold. Let be a dataset consisting of

labeled samples from a joint distribution over

, where is the input space and is the label set for the classes. Given a softmax classifier with a model parameter set , we can obtain the predicted class probabilities for each and confidence score , e.g., the maximum class probability or entropy, associated with the prediction. Then, a subset of , , consisting of samples whose confidence scores are above a predefined threshold can be constructed. Therefore, the sample coverage at the confidence threshold , is computed as

where denotes the cardinality of a set. The risk represents the error rate of the samples in . Therefore, the risk at , , can be simply computed as

where denotes an indicator function and represents the predicted class, i.e., . Note that OoD samples from both near- and far-OoD should be considered as errors in our unknown detection task since, by definition, their true classes do not belong to any of the in-distribution classes. Consequently, a low AURC value implies that assigns higher confidence values to correctly predicted in-distribution samples as compared to incorrect predictions and OoD samples. In other words, a classifier with lower AURC would be more reliable in that it shows strong confidence only for what it knows. In this work, we consider the maximum class probability as a confidence estimator, i.e., .

Although our primary goal is to examine the unknown detection performance of comparison targets based on the AURC, we also evaluate their MD and OoDD performances separately to gain useful insight into the behavior of each method. For example, it has yet to be revealed as to whether methods showing superior OoDD performance are effective with MD as well, and vice versa. To this end, we additionally consider other performance metrics that are commonly used for OoDD: AUROC and the false positive rate at the 95% true positive rate (FPR-95%-TPR). Note that the MD performance is measured by AURC following Geifman et al. (2019); Moon et al. (2020).

Far-OoD Detection
Gaussian Noise
Method ACC() AURC()
Baseline 74.430.51 735.503.92 75.252.68
CRL 75.150.42 727.229.11 68.532.83
EDL 74.400.26 724.8111.52 73.391.39
MCdropout 76.000.48 735.584.04 66.161.61
Ensemble 79.45 693.49 47.27
Augmix 75.770.22 718.166.24 65.940.89
ODIN 74.430.51 743.536.27 89.144.02
OE 72.850.61 697.673.64 93.636.05
OpenMax 74.010.34 734.569.37 162.6112.19
Table 5:

Comparison of unknown detection performance with other methods on the CIFAR-based benchmark. The means and standard deviations over five runs are reported.

and indicate that lower and higher values are better, respectively. Red represents the best performance among all methods, whereas blue represents the best performance among other methods except Outlier Exposure. The AURC values are multiplied by , and the other values are percentages.

4.1 CIFAR-based Benchmark

The performance comparison results on CIFAR-based benchmark are summarized in Table 5. Overall, Deep Ensemble performs competitively with the best unknown detection performance. It shows better unknown detection performance than OE despite the fact that OE exploits outliers during the training phase. This stems from the fact that Deep Ensemble achieves the highest classification accuracy on the in-distribution dataset, as expected, while OE loses of its classification accuracy compared to the baseline.

The other methods for detecting OoD inputs, such as ODIN or OpenMax, show similar or even worse unknown detection performance relative to that of the baseline. As MD performance of ODIN and OpenMax shows, we observe that this occurred because their approaches distort the softmax outputs of the in-distribution inputs to create a separation between the confidence scores of the in-distribution inputs and those of OoD inputs. Similarly, we infer that the reason for the accuracy loss and poor MD performance of OE is the side effect of learning the uniformly distributed outputs for auxiliary inputs.

On the other hand, CRL and EDL, methods for MD, show better unknown detection performance than the baseline. Additionally, they generally outperform the baseline in the in-distribution vs. near-/far-OoD setting. Although the improvement is not significant, this observation demonstrates the possibility that improving confidence estimates of in-distribution inputs positively affects the confidence estimates of the OoD inputs. AugMix shows competitive performance overall, which implies that training with appropriate regularization methods can improve the confidence estimates of both in-distribution and OoD inputs. MCdropout improves the MD performance of the baseline, but its performance on other tasks are generally worse than the baseline.

Figure 5(a) summarizes the improvement ratio of the unknown detection performance of the comparison methods over the baseline on the CIFAR-based benchmark. The methods of Deep Ensemble, AugMix, or OE show improved unknown detection performance compared to the baseline, but the other methods such as ODIN has lower performance than even the baseline. These results clearly support the aim of our study: focusing only on a specific problem setting, MD or OoDD, hinders any effective and valid comparison of the unknown detection performance of deep neural networks.

It is worth noting that none of the comparison methods achieve a significant improvement on near-OoD detection, indicating that a close semantic distance between in-distribution and near-OoD causes the misalignment of the ordinal ranking between the confidence scores of the in-distribution inputs and near-OoD inputs, as discussed later in Section 4.3.

(a) CIFAR-based benchmark
(b) ImageNet-based benchmark
Figure 5: Performance improvement ratio of each comparison method in unknown detection. The center line (, green dashed) indicates the baseline performance.

4.2 ImageNet-based Benchmark

We also examine the unknown detection performance of the comparison methods on the ImageNet-based benchmark, as summarized in Table 6. In this benchmark, Deep Ensemble still outperforms the other comparison methods with the best classification accuracy and the best unknown detection performance. Unlike the results with the CIFAR-based benchmark, OE does not degrade the classification accuracy on the in-distribution dataset, and it increases the MD performance. This observation implies that the performance of OE is fairly sensitive to the dataset used as outliers during training, which limits the practical applicability of OE given that manual refinement of large-scale outliers is infeasible.

Far-OoD Detection
Near ImageNet-200
Method ACC() AURC()
Baseline 81.090.65 585.747.13 42.124.79
CRL 81.890.36 578.201.36 36.541.10
EDL 77.971.48 585.155.15 46.324.95
MCdropout 81.110.50 585.006.43 39.101.80
Ensemble 84.30 569.98 29.63
Augmix 80.890.92 581.043.70 39.723.06
ODIN 81.090.65 590.136.36 45.334.62
OE 83.210.43 575.966.89 34.374.47
OpenMax 81.320.77 621.2811.29 59.052.85
Table 6: Comparison of unknown detection performance over the comparison methods on the ImageNet-based benchmark. The means and standard deviations over five runs are reported. and indicate that lower and higher values are better, respectively. Red represents the best performance among all methods, whereas blue represents the best performance among other methods except Outlier Exposure. The AURC values are multiplied by , and the other values are percentages.

ODIN and OpenMax perform well in near-/far-OoD detection as they target the detection of OoD inputs. However, their MD and unknown detection performance are worse than the baseline, as similarly observed in CIFAR-based benchmark results. This observation confirms again that increasing the gap between the confidence values of the in-distribution inputs and those of the OoD inputs by directly manipulating the softmax distribution is not a desirable approach in terms of unknown detection.

CRL shows competitive unknown detection performance. There is no noticeable performance improvement in near-/far-OoD detection, but CRL produces good confidence estimates for in-distribution inputs, which positively affects the unknown detection performance. AugMix shows improved performance in both unknown detection and MD from the baseline, but this improvement was not found for the detection of OoD inputs. This likely stems from that it is because of interpolated inputs created using diverse classes. Similar to the CIFAR-based benchmark results, MCdropout makes confidence estimates of in-distribution inputs much better than the baseline, but its performance on the other tasks are not much different from the baseline. EDL cannot easily classify numerous classes compared to the other methods compared here.

As in the results on the CIFAR-based benchmark, near-OoD is the most difficult task when seeking an improvement from the baseline. Although the ImageNet-based benchmark has far more diverse classes and far more images than the CIFAR-based benchmark, the models still have difficulty distinguishing the near-OoD inputs from the in-distribution inputs according to their confidence values.

Figure 5(b) illustrates the unknown detection improvement ratios of each method compared here over the baseline on ImageNet-based benchmark. It shows a tendency similar to that on the CIFAR-based benchmark overall. The methods of Deep Ensemble, AugMix, and OE are improved from the baseline, and ODIN shows low performance, as in the results on the CIFAR-based benchmark. Notably, the performance of OpenMax is significantly degraded on this benchmark, although it shows competitive far-OoD detection performance on Caltech-45 and Places-82.

(a) Baseline
(b) Deep Ensemble
Figure 6: Density plot of the maximum class probability for each category on the ImageNet-based benchmark.
(a) CIFAR-60: Baseline
(b) CIFAR-60: Deep Ensemble
(c) CIFAR-60: Outlier Exposure
(d) SVHN: Baseline
(e) SVHN: Deep Ensemble
(f) SVHN: Outlier Exposure
(g) DTD: Baseline
(h) DTD: Deep Ensemble
(i) DTD: Outlier Exposure
Figure 7: Visualization of the softmax outputs of the baseline, Deep Ensemble, and Outlier Exposure on the CIFAR-based benchmark. The softmax values for a specific OoD class are averaged over all images in that class. The classes on each column in one heatmap consist of 20 classes from CIFAR-40 (in-distribution) sampled from each superclass. The classes on each row are sampled from the corresponding OoD dataset. For near-OoD cases, (a), (b), and (c), classes on the -th column and row belong to the same superclass.

4.3 Discussion

It is intuitively understandable that far-OoD inputs should have the lowest confidence values in their predictions among other input categories, meaning that known inputs and far-OoD inputs would be the most distinguishable. It is, however, not safe to assume which category should be more distinguishable from known inputs between incorrectly predicted inputs and near-OoD inputs, which belong to in-distribution and OoD, respectively.

Figure 6 shows a density plot of the maximum class probabilities (MCP) produced by the baseline (Figure 6(a)) and by Deep Ensemble (Figure 6(b)) on the ImageNet-based benchmark. Near-OoD inputs generally have higher MCP than the far-OoD inputs. This observation can support the validity of the benchmark datasets that our near-OoD dataset is semantically closer to the in-distribution dataset compared to the far-OoD datasets. Deep Ensemble is more capable of detecting far-OoD inputs with MCP than the baseline, but it still assigns a slightly higher MCP to near-OoD inputs than incorrectly predicted inputs similar to the baseline. Specifically, on the ImageNet-based benchmark, AURC of the baseline for the known inputs vs. the near-OoD detection setting is (averaged over five runs, multiplied by ), which is much worse than AURC of the known inputs vs. incorrectly predicted inputs setting, at . This tendency is also observed in the other comparison methods. We infer that this occurs because near-OoD inputs in our benchmarks share the high-level semantics with the in-distribution classes.

Figure 7 visualizes the distribution of the softmax values from the baseline, Deep Ensemble, and OE for the near-/far-OoD classes in the CIFAR-based benchmark. The first row in this figure shows the softmax distribution associated with near-OoD classes (i.e. classes in CIFAR-60). All three models produce relatively high softmax values to the semantically close classes. On the other hand, all models produce widespread softmax probabilities for far-OoD classes, as demonstrated in the second and third rows in Figure 7. Interestingly, the softmax distributions for far-OoD classes from DTD (the third row in the figure) show similar patterns: the probability mass is concentrated on a specific in-distribution class, forest. This implies that conventional deep softmax classifiers cannot be guaranteed to produce low predictive confidence for inputs far from the in-distribution. The third column in Figure 7 shows that OE generally produces uniformly distributed outputs for OoD inputs, as intended.

One noticeable observation is that the softmax outputs of Deep Ensemble for near-OoD classes do not differ much from those of the baseline, whereas this method outperforms the baseline on the near-OoD detection task, as shown in Table 5. This occurs because Deep Ensemble widens the gap of the confidence values between the in-distribution and OoD inputs by assigning higher softmax values to these confident predictions rather than reducing the confidence values of the near-OoD inputs.

As described in Figure 7 and as shown by our experiment results, it is challenging to separate in-distribution inputs and near-OoD inputs based on confidence estimates from deep neural networks trained with a limited dataset. This is a task-specific problem, which means that not every task requires the detection of near-OoD inputs. For example, it is not necessary to detect new dog breeds as unknown classes if the target task is to classify dogs vs. cats. However, a model should consider new dog breeds as unknown classes if the task is to classify dog breeds. The detection capability of near-OoD inputs should be a great concern, especially for safety-critical applications such as autonomous driving. An autonomous car should identify a new traffic sign as unknown to prevent the occurrence of fatalities. This discussion leads us to important research topics: how can we train deep classifiers that have powerful detectability on a wide range of unknowns including near-OoD inputs? and is it possible to build a universal classifier whose confidence outputs can be adjusted by conditioning on a target task?

5 Conclusion

In this paper, we highlight that measuring only specific detection capabilities is not enough to evaluate the confidence estimate quality of a deep neural network; hence, we define the unknown detection task by integrating MD and OoDD to measure the true detection capability. Unknown detection aims to distinguish the confidence scores of unknown inputs, including incorrectly predicted inputs and OoD inputs, from those of known inputs that are correctly predicted. To evaluate the unknown detection performance, we propose unified benchmarks consisting of three categories: in-distribution, near-OoD and far-OoD. With the proposed benchmark datasets, we examine the unknown detection performance of popular methods proposed for MD and OoDD. Our experimental results demonstrate that the existing methods proposed for a specific task show lower performance than even the baseline on other tasks. Although Deep Ensemble shows the most competitive performance for unknown detection, it remains areas requiring improvements on detecting OoDD, especially on near-OoD inputs. We believe that the proposed unknown detection task and reconfigured benchmark datasets are valuable as a starting point for investigating the trustworthiness of deep neural networks.


This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. NRF-2021R1C1C1011907).


  • A. Ashukha, A. Lyzhov, D. Molchanov, and D. Vetrov (2020)

    Pitfalls of in-domain uncertainty estimation and ensembling in deep learning

    In Proceedings of International Conference on Learning Representations, Cited by: §4.
  • A. Bendale and T. E. Boult (2016) Towards open set deep networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1563–1572. Cited by: §1, §2, §4.
  • A. Birhane and V. U. Prabhu (2021) Large image datasets: a pyrrhic win for computer vision?. In Proceedings of the IEEE Winter Conference on Application of Computer Vision, Vol. , pp. 1536–1546. Cited by: §3.1.
  • L. Bossard, M. Guillaumin, and L. Van Gool (2014)

    Food-101 – mining discriminative components with random forests

    In Proceedings of European Conference on Computer Vision, pp. 446–461. Cited by: §3.2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    Proceedings of International Conference on Machine Learning

    Vol. 119, pp. 1597–1607. Cited by: §2.
  • H. Choi, E. Jang, and A. A. Alemi (2018)

    WAIC, but why? generative ensembles for robust anomaly detection

    arXiv preprint arXiv:1802.04865. Cited by: §2.
  • M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613. Cited by: §3.1.
  • C. Corbière, N. THOME, A. Bar-Hen, M. Cord, and P. Pérez (2019) Addressing failure prediction by learning model confidence. In Proceedings of Advances in Neural Information Processing Systems, Vol. 32, pp. 2902–2913. Cited by: §1, Table 1, §2.
  • C. Corbière, N. Thome, A. Saporta, T. Vu, M. Cord, and P. Pèrez (2021) Confidence estimation via auxiliary models. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. Cited by: §1, §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. Cited by: §1.
  • T. DeVries and G. W. Taylor (2018) Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865. Cited by: §2.
  • R. El-Yaniv and Y. Wiener (2010) On the foundations of noise-free selective classification.. Journal of Machine Learning Research 11 (53), pp. 1605–1641. Cited by: §1.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Proceedings of International Conference on Machine Learning, Vol. 48, pp. 1050–1059. Cited by: §4.
  • Z. Ge, S. Demyanov, Z. Chen, and R. Garnavi (2017) Generative openmax for multi-class open set classification. In Proceedings of British Machine Vision Conference, Cited by: §2, §2.
  • Y. Geifman and R. El-Yaniv (2017) Selective classification for deep neural networks. In Proceedings of Advances in Neural Information Processing Systems, Vol. 30, pp. 4878–4887. Cited by: §1, §2.
  • Y. Geifman, G. Uziel, and R. El-Yaniv (2019) Bias-reduced uncertainty estimation for deep neural classifiers. In Proceedings of International Conference on Learning Representations, Cited by: §1, Table 1, §2, §4, §4.
  • G. Griffin, A. Holub, and P. Perona (2007) Caltech-256 object category dataset. Technical Report Technical Report 7694, California Institute of Technology. Cited by: §3.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In Proceedings of European Conference on Computer Vision, pp. 630–645. Cited by: §4.
  • D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations, Cited by: §2, §2.
  • D. Hendrycks, M. Mazeika, and T. Dietterich (2019) Deep anomaly detection with outlier exposure. In Proceedings of International Conference on Learning Representations, Cited by: Table 1, §2, §3, §4, §4.
  • D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan (2020) AugMix: a simple method to improve robustness and uncertainty under data shift. In Proceedings of International Conference on Learning Representations, Cited by: §4.
  • G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine 29 (6), pp. 82–97. Cited by: §1.
  • J. Jang and C. O. Kim (2020) One-vs-rest network-based deep probability model for open set recognition. arXiv preprint arXiv:2004.08067. Cited by: §1, Table 1, §2.
  • H. Jiang, B. Kim, M. Y. Guan, and M. R. Gupta (2018) To trust or not to trust a classifier. In Proceedings of Advances in Neural Information Processing Systems, pp. 5546–5557. Cited by: §1.
  • A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Proceedings of Advances in Neural Information Processing Systems, pp. 5580–5590. Cited by: §4.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Proceedings of Advances in Neural Information Processing Systems, Vol. 31, pp. 10215–10224. Cited by: §2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical Report University of Toronto. Cited by: §1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    In Proceedings of Advances in Neural Information Processing Systems, Vol. 25, pp. 1097–1105. Cited by: §1.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of Advances in Neural Information Processing Systems, Vol. 30. Cited by: §1.
  • Y. Le and X. Yang (2015) Tiny imagenet visual recognition challenge. CS 231N. Cited by: §3.1.
  • K. Lee, H. Lee, K. Lee, and J. Shin (2018a) Training confidence-calibrated classifiers for detecting out-of-distribution samples. In Proceedings of International Conference on Learning Representations, Cited by: §1, §2.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018b) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Proceedings of Advances in Neural Information Processing Systems, Vol. 31. Cited by: §1, §1, Table 1, §2, §2, §3.1, §3.1, §4.
  • S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In Proceedings of International Conference on Learning Representations, Cited by: §1, §1, Table 1, §2, §2, §3.1, §4.
  • A. Mehrtash, W. M. Wells, C. M. Tempany, P. Abolmaesumi, and T. Kapur (2020) Confidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Transactions on Medical Imaging 39 (12), pp. 3868–3878. Cited by: §1.
  • T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Proceedings of Annual Conference of the International Speech Communication Association, Vol. 2, pp. 1045–1048. Cited by: §1.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §1.
  • S. Mohseni, M. Pitale, V. Singh, and Z. Wang (2020) Practical solutions for machine learning safety in autonomous vehicles. In

    Proceedings of the Association for the Advancement of Artificial Intelligence Workshop on Artificial Intelligence Safety

    Cited by: §1.
  • J. Moon, J. Kim, Y. Shin, and S. Hwang (2020) Confidence-aware learning for deep neural networks. In Proceedings of International Conference on Machine Learning, Vol. 119, pp. 7034–7044. Cited by: §1, Table 1, §2, §4, §4.
  • E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan (2019a) Do deep generative models know what they don’t know?. In Proceedings of International Conference on Learning Representations, Cited by: §2.
  • E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan (2019b) Hybrid models with deep and invertible features. In Proceedings of International Conference on Machine Learning, Vol. 97, pp. 4723–4732. Cited by: §2.
  • A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan (2019) Speech recognition using deep neural networks: a systematic review. IEEE Access 7, pp. 19143–19165. Cited by: §1.
  • L. Neal, M. Olson, X. Fern, W. Wong, and F. Li (2018) Open set learning with counterfactual images. In Proceedings of European Conference on Computer Vision, pp. 613–628. Cited by: §2, §2.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In Proceedings of Advances in Neural Information Processing Systems Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §1.
  • P. Oza and V. M. Patel (2019a) C2AE: class conditioned auto-encoder for open-set recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2307–2316. Cited by: §1, Table 1, §2.
  • P. Oza and V. M. Patel (2019b) Deep cnn-based multi-task learning for open-set recognition. arXiv preprint arXiv:1903.03161. Cited by: §1, Table 1, §2, §2.
  • P. Perera, V. I. Morariu, R. Jain, V. Manjunatha, C. Wigington, V. Ordonez, and V. M. Patel (2020) Generative-discriminative feature representations for open-set recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11814–11823. Cited by: §1, Table 1, §2.
  • W. Rawat and Z. Wang (2017) Deep convolutional neural networks for image classification: a comprehensive review. Neural Computation 29 (9), pp. 2352–2449. Cited by: §1.
  • J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lakshminarayanan (2019) Likelihood ratios for out-of-distribution detection. In Proceedings of Advances in Neural Information Processing Systems, Vol. 32, pp. 14707–14718. Cited by: §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems, Vol. 28, pp. 91–99. Cited by: §1.
  • R. Roady, T. L. Hayes, R. Kemker, A. Gonzales, and C. Kanan (2020) Are open set classification methods effective on large-scale datasets?. PLOS ONE 15 (9), pp. 1–18. Cited by: §2.
  • T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma (2017) PixelCNN++: a pixelcnn implementation with discretized logistic mixture likelihood and other modifications. In Proceedings of International Conference on Learning Representations, Cited by: §2.
  • M. Sensoy, L. Kaplan, and M. Kandemir (2018) Evidential deep learning to quantify classification uncertainty. In Proceedings of Advances in Neural Information Processing Systems, Vol. 31, pp. 3179–3189. Cited by: Table 1, §4.
  • X. Sun, Z. Yang, C. Zhang, K. Ling, and G. Peng (2020)

    Conditional gaussian distribution learning for open set recognition

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13480–13489. Cited by: Table 1, §2.
  • J. Tack, S. Mo, J. Jeong, and J. Shin (2020)

    CSI: novelty detection via contrastive learning on distributionally shifted instances

    In Proceedings of Advances in Neural Information Processing Systems, Vol. 33, pp. 11839–11852. Cited by: §3.1.
  • A. Torralba, R. Fergus, and W. T. Freeman (2008)

    80 million tiny images: a large data set for nonparametric object and scene recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (11), pp. 1958–1970. Cited by: §3.1.
  • S. Vernekar, A. Gaurav, V. Abdelzad, T. Denouden, R. Salay, and K. Czarnecki (2019) Out-of-distribution detection in classifiers via generation. arXiv preprint arXiv:1910.04241. Cited by: §2.
  • J. Winkens, R. Bunel, A. G. Roy, R. Stanforth, V. Natarajan, J. R. Ledsam, P. MacWilliams, P. Kohli, A. Karthikesalingam, S. Kohl, T. Cemgil, S. M. A. Eslami, and O. Ronneberger (2020) Contrastive training for improved out-of-distribution detection. arXiv preprint arXiv:2007.05566. Cited by: §2.
  • F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao (2015) LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §1.
  • Z. Zhang, A. V. Dalca, and M. R. Sabuncu (2019) Confidence calibration for convolutional neural networks using structured dropout. arXiv preprint arXiv:1906.09551. Cited by: §4.
  • Z. Zhao, P. Zheng, S. Xu, and X. Wu (2019) Object detection with deep learning: a review. IEEE Transactions on Neural Networks and Learning Systems 30 (11), pp. 3212–3232. Cited by: §1.
  • B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017) Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (6), pp. 1452–1464. Cited by: §3.2.