One Versus all for deep Neural Network Incertitude (OVNNI) quantification

06/01/2020 ∙ by Gianni Franchi, et al. ∙ 20

Deep neural networks (DNNs) are powerful learning models yet their results are not always reliable. This is due to the fact that modern DNNs are usually uncalibrated and we cannot characterize their epistemic uncertainty. In this work, we propose a new technique to quantify the epistemic uncertainty of data easily. This method consists in mixing the predictions of an ensemble of DNNs trained to classify One class vs All the other classes (OVA) with predictions from a standard DNN trained to perform All vs All (AVA) classification. On the one hand, the adjustment provided by the AVA DNN to the score of the base classifiers allows for a more fine-grained inter-class separation. On the other hand, the two types of classifiers enforce mutually their detection of out-of-distribution (OOD) samples, circumventing entirely the requirement of using such samples during training. Our method achieves state of the art performance in quantifying OOD data across multiple datasets and architectures while requiring little hyper-parameter tuning.



There are no comments yet.


page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have reached state-of-the-art performance on machine learning

[32, 16]

, and computer vision tasks

[53, 28]. The significant progress has been leading to their adoption in a wide range of decision making systems, including safety critical ones. Yet, one of the main weaknesses of these techniques appears to be the fact that they tend to be overconfident [17] in their decisions. This issue is difficult to tackle, as the high inner complexity of DNNs results in a poor output explainability.

Figure 1: Distribution of classifications scores. Here (a), (b) and (c) represent the histograms of confidence scores of Resnet50 [18] trained on the CIFAR10 [27] training set and tested on SVHN [48]

and CIFAR10 testing set, using Maximum Class Probability (MCP)

[20], Deep Ensembles [30], and OVNNI, respectively. We can see that our proposed algorithm OVNNI outperforms Deep Ensembles (state of the art) and MCP (baseline) on detecting OOD data, since it brings more OOD data to a low score.

In order to address this crucial issue, we propose to rely on a finer quantification of the uncertainty of DNN. In contrast to most Bayesian DNN techniques [3, 25, 15, 42, 14], or to frequentist techniques such as Deep Ensembles [30]

, our approach relies on One vs All (OVA) training. In the statistical learning community, ensembles of OVA or One vs One (OVO) base classifiers for multi-class prediction have been particularly popular in association with Support Vector Machines (SVM), due to SVM being essentially a binary classifier, and to the simplicity of the aggregation rules supported by fundamental theoretical results 

[26, 29, 58]. The most popular rule in case of OVA ensembles, winner-takes-all (WTA), assigns the testing sample to the class for which the membership score is the highest. For a binary output, the WTA rule creates in the input space multiple unclassifiable regions, for which the class assignment is not unique, and the standard solution is to rely on continuous membership scores. In contrast to SVM-based learning, nowadays the OVA approach has been mostly discarded when training deep classifiers, in favor of All vs All (AVA) learning.

In this paper, we propose to use OVA learning in order to improve the quantification of the epistemic uncertainty of the DNN. The underlying idea of our approach is that the score of a base classifier should be adjusted by a factor which approximates its local reliability in the input space from which the test sample originated. Initially for SVM learning, the reliability has been linked to the average value of the local objective function [40], which is approximated using the closest training samples belonging to the respective class. In our algorithm, we propose to adjust the OVA scores by the score provided by an AVA DNN which will play thus the role of approximating the local class-specific objective function. This strategy allows for a particularly effective detection of out-of-distribution (OOD) samples in the testing data, as we can discriminate between samples belonging to unclassifiable regions equally close to some classifiable regions, and samples belonging to unclassifiable regions far from all classifiable regions.

Fig. 1 presents the distribution of the scores provided by the baseline, Deep Ensembles (the current state of the art) and our method, respectively. The baseline is the single AVA classifier, for which the class assignment is performed based on the Maximum Class Probability (MCP). The baseline is unable to discriminate among in- and out-of-distribution samples, illustrated in blue/yellow and orange in the histograms, respectively. Deep Ensembles lowers the OOD scores, but the in-distribution membership is still overestimated. Finally, OVNNI successfully assigns low scores to the OOD samples, while keeping at the same time the in-distribution scores high.

Our main contributions are the following. We propose an efficient non-Bayesian technique for uncertainty quantification in OOD data classification, that reaches state of the art results on calibration and on OOD data detection on a variety of datasets, and on all typical metrics. Secondly, we perform an extensive study in which we compare, for different applications, with the most significant approaches tackling uncertainty estimation, including Deep Ensembles. Lastly, our conclusions are in line with those of other works which defend the interest of One versus All classifier aggregation [54], and our results rehabilitate this approach in the novel context of uncertainty estimation for DNN.

2 Related work

OOD detection is not a novel problem and has been studied before the deep learning revival in various branches of machine learning under slightly different taks: anomaly 


, outlier 


or novelty detection

[56]. In the last few years, this task has seen increased attention from different communities and has been addressed with: predictive uncertainty estimation, ensemble methods, image reconstruction, etc. In the following we review briefly some of the methods related to our approach.

Classification with a background class. In multiple computer vision tasks, e.g., object detection [53, 39], it is common to use a background class in addition to the known classes to classify. This leads to a better separation of the classification space and a more discriminative classifier. While this seems to be a reasonable and straightforward approach, for OOD detection, it is likely to suffer from negative dataset bias [57] and thus not generalize to other background objects not-seen during training. In our approach, we also use a part of the classes as background when training the individual classifiers, however the overlap of their decision boundaries coupled with the AVA model, better distinguishes in- from out-of-distributions samples.

Anomaly detection by reconstruction.

Anomalies can be detected by training an autoencoder 

[10, 1] or generative model [55, 37] on in-distribution data and use the quality of the reconstruction as a proxy OOD as the autoencoder is unlikely to decode accurately patterns not seen during training. Training such models for accurate and robust reconstruction requires large amounts of data.

Bayesian approaches Bayesian Neural Networks (BNN) [47] are elegant, intuitive and easy to reason models, that can capture the epistemic uncertainty through the exploitation of the distributions of their weights. In spite of recent progress that make them more tractable [3], they are still limited to small or medium-size networks, while most DNNs in usually enclose millions of parameters. Gal and Ghahramani [15]

aimed for a method to imitate BNNs. To this end they proposed Monte Carlo Dropout (MC Dropout) to estimate the posterior predictive network distribution by sampling different subsets of neurons at each forward pass during test time and aggregate their predictions. In computer vision, MC Dropout is the most popular instance of BNNs due to its speed and simplicity. It has been extended to other tasks,

e.g., semantic segmentation [23]

, pose estimation 

[24]. However, the benefits of Dropout are more limited for convolutional layers, where specific architectural design choices must be made [23, 46]. Recent OOD benchmarks for semantic segmentation [19, 37] show that MC Dropout still admits many false positives.

Ensembles. Ensemble methods are prominent techniques for measuring epistemic uncertainty. They have the potential to encapsulate a true diversity in the weights of the composing models, contrarily to the dispersion introduced by MC Dropout [13], which ultimately focuses on a single mode. Lakshminarayan et al[30] propose training an ensemble of DNNs with different initialization seeds. Vyas et al[59] train an ensemble of classifiers in a self-supervised way on different subsets of the training data, using the left-out data as OOD. Izmailov et al[22] collect weight checkpoints from local minima and average them or fit a distribution over them and sample networks [42]. Franchi et al[14] track weights trajectories across training and compute their distributions, further used for sampling an ensemble of networks. Our approach also exploits ensembles, however each network is specialized on a different classification task. We exploit the complementarity in this ensemble for better OOD predictions.

Traditional use of OVA/OVO ensembles These aggregation techniques are popular for performing multi-label classification based on an ensemble of binary base classifiers. For OVO, instead of the baseline max-voting aggregation strategy, pairwise coupling [61] or ECOC [12] have been widely used, but the quadratically increasing number of base classifiers may limit significantly OVO applicability in the case of large label sets. In contrast, OVA fusion uses a linearly increasing number of base classifiers, and relies in most works on a Winner-Takes-All class assignment based on the maximum class response. To the best of our knowledge, these ensembling methods have not been used for estimating the epistemic uncertainty of DNNs.

Deep OOD detection.

A recent line of approaches addresses OOD detection through DNNs specific heuristics. Hendrycks and Gimpel 

[20] established a standard baseline for OOD detection relying on the Maximum Class Probability from softmax. [11] attach a confidence branch to a classification network and train it to predict OOD samples, while ODIN [36] learns a temperature scaling for softmax values and adversarial perturbation to better distinguish OOD data. Lee et al[35]

get a class conditional Gaussian distributions with respect to features that they tune on a dataset with OOD data and In distribution data. Lambert

et al[31] attenuate uncertainty by training on a large composite dataset leading to a more robust DNN. Zendel et al[62] propose a semantic segmentation dataset for checking the confidence score of DNNs. The authors of [2] train a DNN to predict OOD confidence score.
Lee et al[34] train a GAN along with the classifier to produce near-distribution examples and enforce lower classifier confidence on GAN samples. Malinin and Gales [43] use Dirichlet networks to build a distribution over the prediction distributions for OOD detection. Most of these methods rely on a OOD dataset during training and are likely to specialize on specific anomalies from this data  [21]. In contrast, in our approach we do not require OOD examples during training, as we leverage the multiple one-versus-all classifiers.

3 One Versus all for deep Neural Network Incertitude (OVNNI)

This section focuses first on the necessary details on the traditional AVA training of a DNN. Then we describe our approach based on additional OVA training.

3.1 Notations

  • The training/testing sets are denoted respectively by , , where and with or represent respectively the observed sample and the corresponding label, with and the size of the training and testing sets. are input vectors and are class labels. Unless otherwise specified, and , , will refer to training data.

  • is the random variable associated with observed samples and

    the one associated with classes.

  • The DNN is a function of the observed data with or and vector that contains the trainable weights. We call the output of the DNN associated with the weights on the data .

  • is the loss function used to measure the dissimilarity between the output

    of the DNN and the expected output . Different loss functions can be considered according to the type of task. Here we will focus on the cross entropy that will be introduced in the next section.

3.2 All Versus All training of Deep Neural Networks

For image classification, the goal of a DNN is to map the input data to a probabilistic prediction that we denote with a class label. During training, an optimization algorithm will improve the weights in order to fit as much as possible the output to the ground truth vector of class labels. The loss is expected to measure the similarity between and . Classically we use Cross entropy defined on a batch of size by:


The minimization of this loss function is usually based on gradient methods. Computing the optimal value of each parameter involves a bin-to-bin measure of similarity, which may lead to overfitting issues.

A solution might be to use One Versus All training.

3.3 From One Versus All (OVA) to OVNNI

The current state of the art on uncertainty estimation is Deep Ensembles [30]

. This technique relies on ensembling multiple DNN models trained in parallel in order to optimize the same loss. In contrast to random forests

[5], or Bagging[4] the diversity arises from the fact that different embodiments of the same model will converge towards different local optima during training. Conversely, in our approach the diversity is provided by the one-versus-all (OVA) models constructed using different labelings of the training set.

The OVA strategy is conceptually simple, since at its core it involves training a binary classification DNN. One classifier is trained for each class, and prediction is then performed by running the obtained binary classifiers on the testing sample and choosing the prediction with the highest confidence score. Yet, the multiple classifiers involved will learn multiple probabilistic predictions, denoted by with a binary random variable for each class . We add a super script on , to inform that 1) weights are different from the ones trained to perform the AVA classification that we denote , and 2) they are also different from the weights of other classes different of .

By training one class versus all the other classes, the DNN learns in some sense the out of distribution classes, however with the significant advantage of not relying on explicitly provided OOD data, in contrast to other strategies [51, 44]. In addition to the OVA base classifiers, we also perform an All versus all training that we aggregate with the probabilities of the OVA models in the following way as shown in Figure 2.

Let us denote by

the discrete random variable, that is taking its value in the list of all classes, and let us denote by

a binary random variable that takes values or , with meaning that the data belongs to class . Hence the OVA DNN of the class provides , while the AVA DNN provides for all in . We consider that the final confidence score for a data to belong to class is:


This score is high if AVA and OVA are confident and low in the other case, multiplying OVA and AVA scores also helps to increase the accuracy since AVA has lower accuracy than OVA.

3.4 Uncertainty with OVNNI

We consider that a measure of confidence must satisfy the following properties: (1) be bounded, (2) exhibit low values for OOD data, (3) have a confidence value that aligned to the accuracy of the algorithm, (4) get more confident if additional training samples are provided. The first point assures that we know what is the maximum and minimum of confidence. The second point is to ensure to detect OOD data, which is crucial since it provides information on the reliability of the DNN on one data. The third point is linked to the calibration [17], which is crucial to rely on the model predictions. The last point concerns the fact that we want to reduce the uncertainty when increasing the dataset.

We use as a measure of confidence for OVNNI the probability . This measure is bounded by 0 and 1. In the experimental section, we show that it has a state of the art calibration and OOD results.

Figure 2: From AVA and OVA to OVNNI process in the case we deal with a database composed of just three classes.

3.5 Visualizing OVA and AVA embedding

In this subsection, we perform two experiments to determine the behavior of the representations learned by the DNN with the different techniques. For both experiments we train a simple DNN composed of 3 hidden layers followed by a batch normalization on MNIST dataset


In the first experiment, we have considered as training data only the images with the digits ’0’,’1’ and ’2’ images (the 3 first classes). Then we perform inference on the official test set composed of images with these classes and the OOD images which are composed of other classes. We represent in Figure 3 the softmax of a classical AVA training, a deep ensemble training and the OVNNI training. One can see that in contrast to other techniques, OVNNI results do not necessarily belong to the 2-dimensional simplex. In addition, OVNNI brings the OOD data far away from the simplex vertices which highlights its potential to detect OOD data.

In the second experiment, we performed a classical AVA training, and we also performed the OVA training. Hence for the OVA training, we have 10 DNNs (since the dataset has 10 classes which are the 10 digits).The OOD class is composed of images of the NotMNIST dataset [49]. Hence, we apply the DNNs on this test dataset and on the AVA case, we collect for each data the feature space of the DNN just before the classification of each data. In the OVA case, we collect the same feature space but for the DNN of the predicted class. We reduce the dimension of each of these feature spaces using T-SNE [41]

and Principal Component Analysis (PCA)

[60] and we plot the results in Figure 4. We can see that in the AVA case the OOD data are in the center of Figure 4 mixed with the other classes and in the OVA case they are closer to the border whatever the dimensionality reduction algorithm we use. This is crucial because it shows that OVA learns a more interesting descriptor than OVA.

Figure 3: Results on MNIST - 3 classes experiments. We represent in these figures the softmax prediction outputs obtained by the baselines (a) MCP, (b) Deep Ensemble, and (c) by OVNNI, respectively.
Figure 4: Results of the MNIST / NotMNIST experiment. We represent the projection on a 2D space of the feature space of the baseline MCP in figures (a) and (c), and of OVNNI in figures(b) and (d). We use PCA [60] and t-SNE [41] as dimensionality reduction algorithm.

4 Experiments

We continue by illustrating the performance of OVNNI for detecting OOD data by conducting five experiments. In the rest of this section we will describe the experimental protocol, followed by the five experiments.

4.1 Experimental protocol

The detection of OOD data can be done either by techniques that measure the uncertainty, or by techniques that detect OOD data. We first have compared our OVNNI to three other uncertainty estimation techniques: MC Dropout [15], Deep Ensembles [30], and TRADI [14]. The major interest of these techniques comes from the fact that, since they estimate uncertainty, they also estimate the epistemic uncertainty and therefore the OOD data. We also have compared our approach to two other techniques: ODIN [36] and ConfidNET [9] and serve as references in unsupervised techniques for detecting OOD data. As a baseline algorithm, we use the maximum class probability (MCP) with AVA trained DNN. We denote this approach as MCP. As an additional baseline we consider one-class Support Vector Machine[45, 50]

, a classic method for outlier detection. We train it on AVA logits.

Note that we have not compared our OVNNI to techniques trained to learn OOD such as [51, 44], since in these case the OOD data are in the training set, making this technique able to detect just with trained OOD data. To balance OVA training which typically has more samples available for the ”All” class, we use weighted cross-entropy to train for each class, with weights for a given class based on , where is the proportion of data samples of this class in the training set. In addition, for a fair comparison in all experiments we use the same number of models for ensemble and Bayesian methods. We conducted several experiments in two target applications: image classification (2 experiments) and semantic pixel segmentation (3 experiments). We considered 7 metrics, in addition to accuracy. Details and results are given below.

Metrics. The metric should focus on several points. The first one is the error/success on predicting if the DNN model has some knowledge about specific data. This involves detecting if the data is OOD or not. For that, we use three solutions proposed in [20]. We first only used the confidence score of the OOD data and on the in distribution test data. Based on these confidence scores, and as in [20, 19], we evaluated the AUC, AUPR and the FPR-95%-TPR, that are indicators of the accuracy of detecting OOD data

However, these measures give no information about the number of good predictions (that should be high) and of bad predictions (that should be low). This information is crucial since, although it is important to have a low score with the OOD data, the DNN should also reach a high confidence score for well-classified data, and low confidence scores elsewhere. In case the DNN does not reach this point then it might be unusable.

For that, the authors in [9] propose to use metrics similar to the one used by Hendrycks et al.  [20] but rather than classifying into classes “OOD” or “In distribution”, they classify as “correctly classified” or “not correctly classified” (this latter class contains both bad predictions and predictions on OOD data, see [9] for more details).

We also used the Expected Calibration Error (ECE) [17], which uses the -bin histograms of confidence scores and accuracy. The ECE performs a bin-to-bin difference between the two histograms, than an average over the bins. Similarly to [17] we set . This metric, by measuring the difference between the expected accuracy and confidence, is an indicator of the quality of the confidence, and should be close to 0.

OOD classification with MNIST [33]. Concerning the classification, we used in a first experiment MNIST [33] which is a dataset composed of digit images as training dataset and NotMnist  [49] which contains letter images as OOD dataset. We first trained a classifier to learn to recognize the images of digits then tested it on the test set of MNIST and NotMnist hoping that the classifier would distinguish digits form letters. The DNN used for this experiment is fully connected and composed of 3 layers as in [30, 14]. Results are shown in Tables 1 and 2 (MNIST rows).

Dataset OOD technique Accuracy/mIoU AUC AUPR AUPR ECE (%)
Error Success

MNIST/Not MNIST 3 hidden layers
Baseline (MCP) 98.8 92.7 96.1 81.4 0.305
MCP + One class SVM 98.8 97.4 98.4 95.9 0.072
MC Dropout 98.2 88.1 89.8 81.7 0.494
Deep Ensemble 98.9 97.7 98.4 95.8 0.462
TRADI 98.6 97.1 98.4 94.6 0.407
ODIN 98.8 94.2 96.8 85.6 0.500
ConfidNET 98.2 97.4 98.8 94.1 0.461
Ensemble OVA (ours) 97.2 99.0 99.5 97.3 0.179
OVNNI (ours) 98.8 99.1 99.6 97.9 0.066

CIFAR10 ResNet50
Baseline (MCP) 93.1 83.9 92.9 67.5 0.606
MCP +One class SVM 93.1 79.7 90.9 63.5 0.203
MC Dropout 93.1 83.9 92.9 67.5 0.606
Deep Ensemble 95.0 95.8 97.7 92.1 0.422
ODIN 93.1 83.9 93.3 67.2 0.606
ConfidNET 93.1 85.1 94.6 61.2 0.706
Ensemble OVA (ours) 89.3 91.8 95.8 87.1 0.468
OVNNI (ours) 93.3 94.3 97.3 91.1 0.187
Camvid ENET Baseline (MCP) 85.8/52.9 79.7 52.1 92.6 0.146
MC Dropout 80.3/48.6 80.2 56.1 89.3 0.168
Deep Ensemble 88.0/58.2 83.2 54.3 94.0 0.112
TRADI 83.4/51.4 83.2 55.9 93.8 0.110
ConfidNET 83.4/52.8 81.3 58.3 92.6 0.121
Ensemble OVA (ours) 87.9/52.8 91.7 69.6 98.4 0.060
OVNNI (ours) 93.1/66.1 94.0 75.7 99.0 0.025
StreetHazards PSPNet (ResNet50) Baseline (MCP) 90.0/54.6 91.6 50.8 98.9 0.055
MC Dropout 88.0/47.9 88.8 51.8 97.8 0.092
Deep Ensemble 90.2/55.0 92.2 52.0 99.0 0.051
TRADI 90.2 /54.6 92.1 51.4 99.1 0.049
ConfidNET 90.0/54.6 88.9 37.0 97.9 0.10
Ensemble OVA (ours) 89.7/54.0 92.4 52.3 99.1 0.048
OVNNI (ours) 90.0/54.6 93.0 53.4 99.2 0.048
BDD Anomaly PSPNet (ResNet50) Baseline (MCP) 89.9/52.8 81.4 62.5 91.5 0.159
MC Dropout 88.7/49.5 76.0 55.7 88.2 0.181
Deep Ensemble 91.0/57.6 85.5 67.3 93.9 0.170
TRADI 89.9/52.1 81.9 63.2 91.8 0.157
ConfidNET 89.9/52.8 78.3 56.4 91.2 0.232
Ensemble OVA (ours) 89.9/52.8 91.2 86.2 95.7 0.072
OVNNI (ours) 90.7/55.4 91.9 86.6 95.9 0.081
Table 1: Comparative results obtained on the Calibration task.
Dataset OOD technique AUC AUPR FPR-95%-TPR
MNIST/Not MNIST 3 hidden layers Baseline (MCP) 94.0 96.0 24.6
MCP + One class SVM 96.9 98.0 12.5
MC Dropout 91.8 94.9 35.6
Deep Ensemble 97.2 98.0 9.2
TRADI 96.7 97.6 11.0
ODIN 94.9 96.7 17.5
ConfidNET 97.9 99.0 12.7
Ensemble OVA (ours) 98.9 99.4 5.9
OVNNI (ours) 99.3 99.6 3.5

CIFAR10 ResNet50
Baseline (MCP) 80.4 89.7 61.5
MCP + One class SVM 78.8 89.6 61.5
MC Dropout 80.4 89.7 62.6
Deep Ensemble 93.0 96.2 19.3
ODIN 80.3 89.9 61.3
ConfidNET 84.8 94.0 68.3
Ensemble OVA (ours) 88.5 93.0 30.9
OVNNI (ours) 92.2 95.8 23.3
Camvid ENET Baseline (MCP) 75.4 10.0 65.1
MC Dropout 75.4 10.7 63.2
Deep Ensemble 79.7 13.0 55.3
TRADI 79.3 12.8 57.7
ConfidNET 81.9 13.8 55.8
Ensemble OVA (ours) 97.1 71.1 13.5
OVNNI (ours) 96.1 61.2 16.5

StreetHazards PSPNet (ResNet50)
Baseline (MCP) 88.7 6.9 26.9
MC Dropout 69.9 6.0 32.0
Deep Ensemble 90.0 7.2 25.4
TRADI 89.2 7.2 25.3
ConfidNET 83.6 2.3 26.2
Ensemble OVA (ours) 91.6 12.7 21.9
OVNNI (ours) 91.2 12.6 22.2
BDD Anomaly PSPNet (ResNet50) Baseline (MCP) 86.0 5.4 27.7
MC Dropout 85.2 5.0 29.3
Deep Ensemble 87.0 6.0 25.0
TRADI 86.1 5.6 26.9
ConfidNET 85.4 5.1 29.1
Ensemble OVA (ours) 87.0 4.9 29.0
OVNNI (ours) 87.2 6.7 25.0
Table 2: Comparative results obtained on the OOD task.

OOD classification with CIFAR10 [27]. W also trained a network on CIFAR10 composed of classes airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks. We have considered as OOD SVHN dataset [48]. Many papers [42] train on CIFAR10 and test on the test set of CIFAR10 with noise or on STL-10 [8]. It turns out that the first test aims more at measuring random uncertainty and the second one the capability to adapt to the domain. We rather have preferred to consider as an OOD dataset SVHN which is a color image dataset of digits, that guarantees that the OOD data really comes from a distribution different from that of CIFAR10. The DNN we used on this experiment is Resnet50 [18], which has the advantage of being popular in the community. Results are shown in Tables 1 and 2 (CIFAR10 rows).

OOD Segmentation with Camvid [7]. We used Camvid, a dataset conventionally used in works dealing with segmentation or uncertainty theory and deep learning  [23, 9, 14]. This is dataset is an “easy” dataset but allows you to quickly validate results. To test the ability of OVNNI to detect OOD pixels, we trained on all Camvid classes except 3 classes (pedestrian, bicycle, and car), that we deleted, by marking the corresponding pixels as unlabeled. These three classes correspond to OOD classes. Thus this experimental protocol proposed by  [14] makes it possible to validate that the trained DNN will detect the pixels on which it has not been trained as OOD. The DNN for this experiment is Enet [52]. Results are shown in Tables 1 and 2 (Camvid rows).
OOD Segmentation with StreetHazards [19]. StreetHazards is a large-scale dataset that contains different sets of synthetic images of street scenes. More precisely, this dataset is composed of 5125 images for training and 1500 test images. The training dataset contains 13 classes and the test dataset is composed of the 13 training classes and 250 OOD classes, making it possible to test the robustness of the algorithms with all possible scenarios. For this experiment we used PSPnet [63] with the experimental protocol in [19]. The architecture used for the PSPnet is ResNet50. Results are shown in Tables 1 and 2 (StreetHazards rows).

OOD Segmentation with BDD Anomaly [19]. BDD Anomaly dataset is a subset of BDD dataset, composed of 6688 street scenes for the training set and 361 for the testing set. The training set contains 17 classes, and the test dataset is composed of the 17 training classes and 2 OOD classes. For this experiment we used PSPnet [63] with the experimental protocol in [19]]. The architecture used for the PSPnet is ResNet50. Results are shown in Tables 1 and 2 (BDD Anomaly rows).

4.2 Discussions

On MNIST we can see in Tables 1 and 2 that OVNNI has competitive results for detecting OOD data; more specifically, its calibration score (ECE) is the best. With respect to the metrics proposed by Hendryck et al., OVNNI is the most effective in detecting OOD images, improving the best AUC by 1.4% the best AUPR by 0.6% and the best FP of 63.2%. Concerning the metrics proposed by Corbière et al., OVNNI improves the AUC by 1.41%, the AUPR Error by 0.80% the AUPR success by 2.14% and the ECE by 70.6%.

On CIFAR10, although Deep Ensembles achieve good results on all the measurements as well except on the ECE, note that OVNNI is better calibrated. This can also be seen in the histogram in Figure 5. The difference between OVNNI and Deep Ensembles is low and the crucial requirement of DNN is to have a good calibration. Hence, having a good calibration is more important than having a good AUC or AUPR. Also, we have represented the accuracy vs confidence curves in Figure 4. These curves are defined in [30] and are constructed by evaluating the accuracy of all data where the DNN has reached confidence thresholds. These curves show the performance of the OVNNI confidence index over CIFAR10. Finally, we have illustrated the OVNNI calibration on CIFAR10 in the calibration curve in Figure 4. The calibration plot is defined on [17] and is constructed by taking bins of data based on their confidence score. Then on each bin, we evaluate the accuracy, as it should ideally be comparable to the confidence score. These curves show once again the good performance of OVNNI in terms of calibration.

On Camvid we note that OVNNI improves the results of the state of the art by up to 77% with regard to the metrics proposed by Corbière et al. [9], and by up to 77% for calibration as well. Concerning the metrics proposed by Hendryck et al., OVNNI improves the measurements by a maximum of 22%.

On StreetHazards we show in Table 2 that OVNNI has better results than the state of the art by improving the best results by up to 42.8%. In Table 1 OVNNI improves the result by a least 2.6% and improves state-of-the-art ECE by 2%. These results show the interest of using OVNNI for semantic segmentation.

Finally, on BDD Anomaly OVNNI improves the calibration by at least 48% which is highly relevant, given the importance of this metric. Furthermore concerning the other metrics, OVNNI improves the results by at least 22%. Furthermore, in Figure 5 we have illustrated the confidence accuracy curve of several algorithms. These curves underline again that OVNNI reaches the best performance in terms of calibration.

Overall, these results show that OVNNI improves the calibration of networks by rendering the confidence in their results more in line with their expected results. Making DNN models more reliable is crucial, especially in areas where the model should not be overconfident. In [17] the authors show that good accuracy of DNNs comes with a price, namely their reliability. In this work, we propose a solution that increases accuracy in most cases, while at the same time improving the calibration and the OOD detection performance. The conceptual simplicity of this solution is a significant asset for its adoption, and the results also convey the message that ’one vs all training’ can still have an interest for a finer understanding of epistemic uncertainty in DNNs.

Figure 5: (a) and (c) Accuracy vs confidence plot on the CIFAR10 \SVHN and BDD Anomaly experiments, respectively. (b) calibrationn plot on the CIFAR10 \SVHN.
Figure 6: Results of OVNNI on BDD Anomaly. The first column is the input image, the second is the ground truth, the third is prediction and the fifth is the confidence score of OVNNI. For comparison, we add the MCP confidence score in the fourth column. We can see that OVNNI has a low score on the motorcycle on the three first rows and on the train on the last row which correspond to the OOD classes.
Figure 7: Results of OVNNI on StreetHazards. The first column is the input image, the second is the ground truth, the third is prediction and the last is the confidence score of OVNNI. For comparison, we add the MCP confidence score in the fourth column. We can see that OVNNI has a low score on the chair, the seat, the rocket and the spider which correspond to the OOD classes.

5 Conclusions

In this work, we presented an approach based on One versus all training and mixed with a modern approach based on deep learning. We show that the combination of these approaches reaches states of the art performance on all segmentation experiments. Regarding classification tasks, OVNNI exhibits the best calibration performance. Concurrent approaches suffer from a lack of performance in calibration in most datasets, hence the scores that they provide are overconfident, potentially leading to dangerous scenarios in critical applications. In addition to the reported performance, our approach needs little hyperparameter tuning and is easy to implement.

Future work involves first extending this strategy to new tasks such as medical image analysis. One could also use this framework for active learning since active learning algorithms require techniques that can detect OOD data.


  • [1] C. Baur, B. Wiestler, S. Albarqouni, and N. Navab (2018) Deep autoencoding models for unsupervised anomaly segmentation in brain mr images. In International MICCAI Brainlesion Workshop, pp. 161–169. Cited by: §2.
  • [2] P. Bevandić, I. Krešo, M. Oršić, and S. Šegvić (2019) Simultaneous semantic segmentation and outlier detection in presence of domain shift. In

    German Conference on Pattern Recognition

    pp. 33–47. Cited by: §2.
  • [3] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §1, §2.
  • [4] L. Breiman (1996) Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: §3.3.
  • [5] L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §3.3.
  • [6] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander (2000) LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 93–104. Cited by: §2.
  • [7] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla (2008) Segmentation and recognition using structure from motion point clouds. In European conference on computer vision, pp. 44–57. Cited by: §4.1.
  • [8] A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    pp. 215–223. Cited by: §4.1.
  • [9] C. Corbière, N. Thome, A. Bar-Hen, M. Cord, and P. Pérez (2019) Addressing failure prediction by learning model confidence. In Advances in Neural Information Processing Systems, pp. 2898–2909. Cited by: §4.1, §4.1, §4.1, §4.2.
  • [10] C. Creusot and A. Munawar (2015) Real-time small obstacle detection on highways using compressive rbm road reconstruction. In 2015 IEEE Intelligent Vehicles Symposium (IV), pp. 162–167. Cited by: §2.
  • [11] T. DeVries and G. W. Taylor (2018-02) Learning Confidence for Out-of-Distribution Detection in Neural Networks. arXiv e-prints, pp. arXiv:1802.04865. External Links: 1802.04865 Cited by: §2.
  • [12] T. G. Dietterich and G. Bakiri (1994) Solving multiclass learning problems via error-correcting output codes. Journal of artificial intelligence research 2, pp. 263–286. Cited by: §2.
  • [13] S. Fort, H. Hu, and B. Lakshminarayanan (2019) Deep ensembles: a loss landscape perspective. arXiv preprint arXiv:1912.02757. Cited by: §2.
  • [14] G. Franchi, A. Bursuc, E. Aldea, S. Dubuisson, and I. Bloch (2019) TRADI: tracking deep neural network weight distributions. arXiv preprint arXiv:1912.11316. Cited by: §1, §2, §4.1, §4.1, §4.1.
  • [15] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §1, §2, §4.1.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [17] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §1, §3.4, §4.1, §4.2, §4.2.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Figure 1, §4.1.
  • [19] D. Hendrycks, S. Basart, M. Mazeika, M. Mostajabi, J. Steinhardt, and D. Song (2019) A benchmark for anomaly segmentation. arXiv preprint arXiv:1911.11132. Cited by: §2, §4.1, §4.1, §4.1.
  • [20] D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: Figure 1, §2, §4.1, §4.1.
  • [21] D. Hendrycks, M. Mazeika, and T. Dietterich (2018) Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. Cited by: §2.
  • [22] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson (2018) Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407. Cited by: §2.
  • [23] A. Kendall, V. Badrinarayanan, and R. Cipolla (2015)

    Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding

    arXiv preprint arXiv:1511.02680. Cited by: §2, §4.1.
  • [24] A. Kendall and R. Cipolla (2016) Modelling uncertainty in deep learning for camera relocalization. In Proceedings of the International Conference on Robotics and Automation (ICRA), Cited by: §2.
  • [25] A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Advances in neural information processing systems, pp. 5574–5584. Cited by: §1.
  • [26] J. Kittler, M. Hatef, R. P. Duin, and J. Matas (1998) On combining classifiers. IEEE transactions on pattern analysis and machine intelligence 20 (3), pp. 226–239. Cited by: §1.
  • [27] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: Figure 1, §4.1.
  • [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [29] L. I. Kuncheva (2002) A theoretical study on six classifier fusion strategies. IEEE Transactions on pattern analysis and machine intelligence 24 (2), pp. 281–286. Cited by: §1.
  • [30] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: Figure 1, §1, §2, §3.3, §4.1, §4.1, §4.2.
  • [31] J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun MSeg: a composite dataset for multi-domain semantic segmentation. Cited by: §2.
  • [32] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • [33] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.5, §4.1.
  • [34] K. Lee, H. Lee, K. Lee, and J. Shin (2017) Training confidence-calibrated classifiers for detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325. Cited by: §2.
  • [35] K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: §2.
  • [36] S. Liang, Y. Li, and R. Srikant (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §2, §4.1.
  • [37] K. Lis, K. Nakka, P. Fua, and M. Salzmann (2019) Detecting the unexpected via image resynthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2152–2161. Cited by: §2, §2.
  • [38] F. T. Liu, K. M. Ting, and Z. Zhou (2008) Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. Cited by: §2.
  • [39] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.
  • [40] Y. Liu and Y. F. Zheng (2005) One-against-all multi-class svm classification using reliability measures. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2, pp. 849–854. Cited by: §1.
  • [41] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Figure 4, §3.5.
  • [42] W. Maddox, T. Garipov, P. Izmailov, D. Vetrov, and A. G. Wilson (2019) A simple baseline for bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476. Cited by: §1, §2, §4.1.
  • [43] A. Malinin and M. Gales (2018) Predictive uncertainty estimation via prior networks. In Advances in Neural Information Processing Systems, pp. 7047–7058. Cited by: §2.
  • [44] M. Masana, I. Ruiz, J. Serrat, J. van de Weijer, and A. M. Lopez (2018) Metric learning for novelty and anomaly detection. arXiv preprint arXiv:1808.05492. Cited by: §3.3, §4.1.
  • [45] M. M. Moya and D. R. Hush (1996) Network constraints and multi-objective optimization for one-class classification. Neural Networks 9 (3), pp. 463–474. Cited by: §4.1.
  • [46] J. Mukhoti and Y. Gal (2018) Evaluating bayesian deep learning methods for semantic segmentation. CoRR abs/1811.12709. External Links: Link, 1811.12709 Cited by: §2.
  • [47] R. M. Neal (1996) Bayesian learning for neural networks. Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 0387947248 Cited by: §2.
  • [48] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: Figure 1, §4.1.
  • [49] NotMnist dataset. Note: Cited by: §3.5, §4.1.
  • [50] P. Oliveri (2017) Class-modelling in food analytical chemistry: development, sampling, optimisation and validation issues–a tutorial. Analytica chimica acta 982, pp. 9–19. Cited by: §4.1.
  • [51] A. Papadopoulos, M. R. Rajati, N. Shaikh, and J. Wang (2019) Outlier exposure with confidence control for out-of-distribution detection. arXiv preprint arXiv:1906.03509. Cited by: §3.3, §4.1.
  • [52] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147. Cited by: §4.1.
  • [53] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.
  • [54] R. Rifkin and A. Klautau (2004) In defense of one-vs-all classification. Journal of machine learning research 5 (Jan), pp. 101–141. Cited by: §1.
  • [55] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pp. 146–157. Cited by: §2.
  • [56] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt (2000) Support vector method for novelty detection. In Advances in neural information processing systems, pp. 582–588. Cited by: §2.
  • [57] T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars (2017) A deeper look at dataset bias. In Domain adaptation in computer vision applications, pp. 37–55. Cited by: §2.
  • [58] K. Tumer and J. Ghosh (1996) Error correlation and error reduction in ensemble classifiers. Connection science 8 (3-4), pp. 385–404. Cited by: §1.
  • [59] A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L. Willke (2018) Out-of-distribution detection using an ensemble of self supervised leave-out classifiers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 550–564. Cited by: §2.
  • [60] S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: Figure 4, §3.5.
  • [61] T. Wu, C. Lin, and R. C. Weng (2004) Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5 (Aug), pp. 975–1005. Cited by: §2.
  • [62] O. Zendel, K. Honauer, M. Murschitz, D. Steininger, and G. Fernandez Dominguez (2018) Wilddash-creating hazard-aware benchmarks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 402–416. Cited by: §2.
  • [63] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §4.1, §4.1.