1 Introduction
The most popular application of deep learning involves the use of a single model trained to convergence by some stochastic optimization method on a supervised dataset. It is hard to deny that this approach has led to impressive wins in a variety of industries. The reason deep models are successful is mainly related to their predictive power, their predictions are usually right, i.e. the models are accurate. The latter is an average statement, and, unfortunately, at the individual datapoint level, it is often difficult to know what the confidence
of the model in its own prediction is. Accordingly, deep systems are currently being deployed in scenarios where making mistakes is cheap. However, before machine learning widens its adoption to fields with critical usecases, we need to develop systems that are able to say “I don’t know” when their prediction is likely to be wrong.
More concretely, deep models are now applied to diverse fields such as physics [ABCG15, RWR18, HRH18], biology [AIP15], healthcare [LAA17, NPAA18, LKN18], or autonomous driving [KG17, MKG18] to name a few. In these cases, quantifying and processing model uncertainty is of crucial importance [KA13, AOS16], as the main goal is to automate decision making while providing strong risk guarantees. Properly calibrated confidence functions should enable the identification of inputs for which predictions are likely to be erroneous, and that should be for instance flagged for human intervention.
The softmax probabilities outputted by deep classifiers can be erroneously interpreted as prediction confidence. Unfortunately, sometimes, high confidence predictions can be woefully incorrect, and fail to indicate when they are likely mistaken; see
[GSS15] and the references therein. Figure 1 shows an example of this; a seal picture is wrongly classified as a worm, whereas its softmax value is .As a consequence, uncertainty quantification for deep learning is an active area of research. The Bayesian framework offers a principled approach to do probabilistic inference [HvC93, Nea96, BB98]; however, at the scale of modern deep neural networks, even approximate Bayesian and frequentist methods face serious computational issues [GG16, LPB17].
In this paper, we propose a family of simple classification algorithms that provide uncertainty quantification at a modest additional computational cost. The basic idea is as follows: after training endtoend a deep classifier on inputoutput pairs () to obtain an accurate taskdependent representation of the data, we then fit an ensemble of models on . The simplicity of this new dataset allows us to compute explicit uncertainty estimates. In particular, we explore four concrete instances of uncertainty algorithms, based on Stochastic Gradient Descent [MHB17], Stochastic Gradient Langevin Dynamics [WT11], the Bootstrap, see Section 8.4 of [FHT01], and Monte Carlo Dropout [GG16]
. The core idea has some connections with transfer learning
[YCBL14, RASC14, DJV14]. By sequentially tackling two tasks (representation learning and uncertainty quantification), these algorithms performed on the last layer of the neural networks reduce the computational cost associated with the approximated inference compared to their full network versions.Our experiments suggest that there is limited value in adding multiple uncertainty layers to highlevel representations in deep classifiers.^{1}^{1}1
The code, implemented in Keras
[C15]and Tensorflow
[AAB15], is available at https://github.com/nbrosse/uncertainties. As expected, in terms of selective classification, lastlayer algorithms outperform a pointestimate network baseline trained on SGD in datasets like ImageNet.2 Related Work
Uncertainty estimation has a rich history, and we describe here the work most closely related to ours. A review of Bayesian neural networks is provided in [Gal16]. In particular, [GG16] proposes MonteCarlo dropout, a Bayesian technique for estimating uncertainties in neural networks by applying dropout at test time.
Frequentist approaches mainly focus on selective classification, calibration, and outofdistribution detection. These concepts are introduced and detailed in the following sections. In particular, selective classification and outofdistribution rely on a confidence function which outputs a score of confidence, in addition to the predicted class. In selective classification, uncertain inputs are considered as rejected or left out by the classifier, which enables to draw a riskcoverage curve. In outofdistribution, uncertain inputs are considered as outofdistribution samples. A simple way to deal with selective classification is by using the softmax value of the chosen class as the confidence function. It has been shown to outperform MCDropout on ImageNet [GEY17]. A general technique on top of an uncertainty estimate is developed in [GUE18], and a novel loss is introduced in [TBB19] to train neural networks to abstain from predicting. A direct optimization of the ROC curve (for the binary decision abstention/classification) is presented in [ASK18]. We address calibration in Section 3.1 and outofdistribution detection in Appendix B.
The ensemble technique that we adopt in this paper has shown satisfactory results both for classical metrics (such as accuracy) and uncertaintyrelated ones [LPC15, SHF16, LPB17, GIP18, HLP17].
In the context of decision making, the idea of using the last layer of a pretrained regression neural network to compute uncertainty estimates has been explored in several related fields: Bayesian optimization [SRS15]
[ZLA18], and as an uncertainty source for exploration in reinforcement learning
[RTS18, ABA18]. Combining neural networks with Gaussian processes has also been suggested as a way to decouple representation and uncertainty [CPRD16, IG17].3 Problem Description
In this work, we study classification tasks. Let be a feature space, and a finite label set with classes. We assume access to a training dataset of
points independently distributed according to a pair of random variables
. We define the test set analogously,. For classification tasks, the standard output of a neural network provides a probability distribution over the
classes, by applying the softmax function to the final logits. Let us denote by
the set of realvalued parameters of the network (weights and bias). The network is usually trained using variants of stochastic gradient descent with the crossentropy loss (the negative log likelihood of the multinomial logistic regression model): where is the output probability distribution over predicted by the network.The classifier is generally obtained just by taking the argmax, for . This rule corresponds to the optimal decision when the misclassification cost is independent of the classes, while it can be easily generalized to heterogeneous costs, see e.g. Section 1.5.1 of [Bis06]. The performance of can be measured by the accuracy; however, to take advantage of uncertainty estimates associated to the classifier , other metrics need to be defined.
3.1 Uncertainty Metrics
Neural networks for classification tasks output a probability distribution over . The notion of calibration is thus relevant: a model is calibrated if, on average over input points , the predicted distribution does match the true underlying distribution over the classes (note that, in most works, the authors focus on matching only). When calibrated, the output provides an appropriate measure of uncertainty associated to the decision . However, despite strong accuracies, modern neural networks are often miscalibrated. Fortunately, remarkably simple methods exist to alleviate this issue, such as temperature scaling, [GPSW17]. Calibrated neural networks are important for model interpretability; however, they do not offer a systematic and automated way to neither improve accuracy nor detect outofdistribution samples.
Selective classification is a key metric to measure quality of uncertainty estimates. It is also sometimes referred to as abstention, and the concept is not restricted to deep learning [BW08, CDM16, GECd18]. A selective classifier is a pair where is a classifier, and is a selection function which serves as a binary qualifier for , see e.g. [GEY17, GUE18]. The selective classifier abstains from prediction at a point if , and outputs when . The performance of a selective classifier can be quantified using the notions of coverage and selective risk. The coverage is defined as , whereas selective risk is given by
Their empirical estimations over the test set are:
A natural way to define a selection function is by means of a confidence function which quantifies how much we trust the prediction for input . The selection function is then constructed by thresholding , i.e. given , for all , we set . We only classify if its confidence is at least . Let be the set of all values for those points in the test dataset , , where is the projection of over the first coordinate. If there are duplicate values in , they are replicated so that . The performance of confidence function can be measured using the Area Under the RiskCoverage curve (AURC) computed over :
Better confidence functions lead to a faster decrease of the associated risk when we decrease coverage, which results in a lower AURC. They are able to improve accuracy by choosing not to classify points where uncertainty is highest and errors are likely.
Concerning outofdistribution detection, we present several standard metrics in Appendix B.
3.2 Confidence Functions
Selective classification relies on a confidence function , which quantifies the confidence in the class prediction . We present now several ways to define ; note they are linked to the algorithms we present in Section 4. First, we introduce some required background concepts.
In the Bayesian framework, a major obstacle often encountered in practice is to sample from the posterior distribution where denotes the parameters of either full or last layer networks. Closedform updates are usually not available, leading to an intractable problem (except for conjugate distributions). Posteriors can be approximated using workarounds such as variational inference [WJ08]
, or Markov Chain Monte Carlo algorithms, see e.g. Chapter 11 of
[GSC13]. The predictive posterior distribution is defined for and by(1) 
where is the likelihood function (the softmax output of the network), and the parameter space. We estimate this quantity in practice by
(2) 
where are approximately drawn according to the posterior distribution. In Section 4, we propose four algorithms from which we can sample . The three confidence functions considered are introduced below.
Softmax Response.
The first confidence function we examine is the softmax response (SR) [GEY17], also known as (one minus) the variation ratio, p.4043 of [Fre65]. It is defined for by where is the predictive posterior distribution given in (1). is estimated by
(3) 
where is defined in (2). The associated classifier is then determined by the optimal decision rule . Its empirical version is
(4) 
Standard deviation of the posterior distribution.
We keep fixed as above: . The second confidence function we investigate is the standard deviation of the probability at under the posterior:
We estimate it by
(5) 
where is defined in (4), and are approximately drawn according to the posterior distribution. The actual confidence function is defined as .
Entropy of .
Finally, the last confidence measure we study is based on a probability distribution over the classes defined as
where . The idea is to measure the amount of posterior mass under which each class is selected. The empirical estimator is given by
(6) 
The confidence is then based on the entropy of (or , in practice): .
4 Algorithms
In this Section, we describe a number of algorithms which allow to approximately draw samples from the posterior distribution
. The core idea, common to all of them, consists in explicitly disentangling representation learning and uncertainty estimation.
We start by describing the highlevel idea behind all the algorithms. Let be a classification training dataset. We first train a standard deep neural network to convergence using the cross entropy loss and a classical optimizer such as Adam [KB15]. We denote by
the vector space containing the input to the last layer of the trained neural network. The cornerstone of our method, coming from transfer learning
[YCBL14, RASC14, DJV14], consists first in computing the last layer features from the inputs by making a forward pass through the trained network. We do this for all points in , and produce a new training dataset which should provide a simpler representation of the data for the classification task. Finally, uncertainty estimation is carried out on via any algorithm that computes confidence estimates. In our case, the latter are applied to the last layer of the network, which is a dense layer with a softmax activation, i.e. forWe suggest and describe below four algorithms to perform uncertainty estimation: Stochastic Gradient Descent (SGD), Stochastic Gradient Langevin Dynamics (SGLD), MonteCarlo Dropout (MCDropout), and Bootstrap. They all compute an ensemble of models . The lastlayer approach is not restricted to these algorithms, and it can be implemented in combination with any algorithm computing uncertainty estimates from . Running the algorithms on the last layer considerably reduces the computational cost required to find uncertainty estimates. Note that this twostage procedure may make the Bayesian theory (which motivates the suggested lastlayer algorithms) not hold exactly.
In Sections 4.3, 4.2 and 4.1, we describe the last layer version of SGLD, SGD, MCDropout, and Bootstrap. Recall for them the training dataset is , instead of . We also apply the four algorithmic ideas to the full network: adaptation is simple, by replacing by . For the four algorithms, the last layer or full neural network is always initialized at , the parameters of the trained network after convergence.
4.1 Stochastic Gradient Langevin Dynamics and Stochastic Gradient Descent
Stochastic Gradient Langevin Dynamics (SGLD) is a Monte Carlo Markov Chain (MCMC) algorithm [WT11], adapted from the Langevin algorithm [RT96] to largescale datasets by taking a single minibatch of data to estimate the gradient at each update. More precisely, by the Bayes’ rule, the posterior distribution is proportional to where is a prior distribution on . In practice, we choose a standard Gaussian prior. The update equation of SGLD is then given for by
(7) 
where is a constant learning rate, a mini batch from of size and an i.i.d. sequence of standard Gaussian random variables of dimension . Following [ABW12, CFG14], we apply SGLD with a constant learning rate. However, a decreasing learning rate is also a valid approach. We do not apply a burnin period because the last layer or full neural network is always initialized at , a local minima.
The update equation of SGLD (7) is equal to the update equation of Stochastic Gradient Descent (SGD), apart from the addition of the Gaussian noise . In the same vein, [MHB17] shows that, under certain assumptions, SGD with a carefully chosen constant stepsize can be seen as approximate sampling from a posterior distribution with an appropriate prior. Therefore, we also consider SGD as an MCMC algorithm to approximately sample from the posterior distribution.
We apply the thinning technique to reduce the memory cost: given a thinning interval and a number of samples , we run the Markov chain during steps and at every iteration, we save the current parameters of the (last layer or full) neural network . The procedure is summarized in Algorithm 1, where (resp. ) stands for the update equation (7) (resp. (7) without Gaussian noise).
4.2 MonteCarlo Dropout
Dropout provides a popular method for computing empirical uncertainty estimates, and it was initially developed to avoid overfitting in deep learning models [SHK14]. It approximately samples from the posterior distribution when applied at test time [GG16]. This technique, often named MonteCarlo Dropout (MCDropout), is widely used in practical applications [ZL17, LAA17, NPAA18] due to its simplicity and good performance.
Dropout randomly sets a fraction
of input units to 0 at each update during training time, or at each forward pass during test time. For the full network version, we interleave a dropout layer after each max pooling layer in the VGGtype neural network and before each dense layer. The method is described in
Algorithm 2.4.3 Bootstrap
At the crossroad between the Bayesian and the frequentist approaches, the Bootstrap algorithm may provide a simple way to approximate the sampling distribution of an estimator, see e.g. [Efr12a, Efr12b]. We first sample with replacement data points from the training dataset , thus generating a new bootstrapped dataset . After this, either only the last layer (multinomial logistic regression) or a full neural network is trained on until convergence, and the parameters of the network are saved. We repeat this as many times as models we want, and then compute their ensemble. The procedure is detailed in Algorithm 3.
5 Experimental Results and Discussion
We evaluate the quality of the uncertainty estimates produced by the lastlayer algorithms on four image classification tasks of increasing complexity. The MNIST dataset [LBBH98] consists of 28x28 handwritten digits, which are divided in a training set with 60000 examples and a test set with 10000 images. The CIFAR10 (resp. CIFAR100) datasets [Kri09] consists of 32x32x3 colour images, each one corresponding to one of 10 (resp. 100) classes. The dataset is split in 50000 training images, and 10000 test ones. Therefore, there are 6000 (resp. 600) images per class. Finally, the ImageNet dataset [DDS09] has 1281167 training images and 50000 test ones, and they are divided in 1000 classes. We randomly crop the colour images to a 331x331x3 size.
For MNIST, we consider a fullyconnected feedforward neural network with 2 hidden layers of 512 and 20 neurons respectively. For CIFAR10 and 100, we use a pretrained VGG16 neural network
^{2}^{2}2https://github.com/geifmany/cifarvgg with 512 neurons in the last hidden layer. For ImageNet, the 4032dimensional lastlayer features are computed through a pretrained NASNet neural network^{3}^{3}3https://keras.io/applications/#nasnet. The trained networks achieve a standard accuracy of for MNIST, for CIFAR10, for CIFAR100, and for ImageNet (top1 accuracy). See Table 1 in the appendix for test accuracies for all algorithms and datasets.In addition to the four algorithms MCDropout, Bootstrap, SGD and SGLD, we evaluate the SGDPoint Estimate (SGDPE) baseline which simply computes the softmax outputs provided by the pretrained neural network. The posterior approximation is then formally a Dirac at , the parameters of the pretrained network. Thus, the only confidence function we can compute for SGDPE is the softmax response or its empirical estimation defined in (3).
We conduct two sets of experiments: we first evaluate the five methods against the AURC metric and then their ability to detect outofdistribution samples (AUROC and AUPRin/out). The results for the latter are reported in Appendix B. In order to better understand the value of adding multiple uncertainty layers, we run the algorithms both on the last layer and on the full neural networks for MNIST and CIFAR10/100. We append the word full to denote the full network versions of the algorithms in the tables below.
Given the size of both ImageNet and the NASNet network, we assess the potential benefit of multiple uncertainty layers on ImageNet by adding up to 3 dense hidden layers with 4032 neurons on top of NASNet. We apply the uncertainty algorithms to one, two, or the three layers. For example, in the case of Dropout, we compare the performance of adding from one to three dropout layers (note we do not add any layers without dropout in this case). For control, we also run SGDPE in the three fullyconnected architectures. The dense layers added at the top of NASnet are finetuned first, and these weights are then used both as a reference (SGDPE) and as the starting point for the four algorithms.
We perform a hyperparameter search for all algorithms and datasets. Details are in Appendix A. We only report below results for the best hyperparameter values.
The results for the AURC metric are shown in Figure 2 for MNIST, while Figure 3 contains the outcomes for CIFAR100, and Figure 4 those for ImageNet. We recall that the lower is the AURC, the better is the result. Let us define by min AURC the minimum value achieved using either SR, STD or the entropy of as a confidence function. For better readability, we define the normalized AURC as the ratio of min AURC over the AURC of SGDPE (unique, using SR as confidence function). Tables are provided in Section A.2. Note that our results are reported for one run of the algorithms because Bayesian approaches include uncertainty estimates.
We would like to pursue several avenues of research in the future: first, to compare our methodology with ensemble methods over the full network, i.e. snapshot ensembles [HLP17], deep ensembles [LPB17] as well as other methods that can be described as being Bayesian about the last layer such as deep kernel learning [WHSX16]. Second, to make additional comparisons to temperature scaling [GPSW17] and to apply bootstrapping to the full dataset and training procedure. Third, to consider methods for variational inference and Laplace approximations, e.g. [RBB18], and alternative methods for Bayesian logistic regression on large datasets since the focus is on last layer Bayesian approaches using the features learned from deep neural networks, e.g. [HCB16, HB15].
We summarize the results we obtain as follows:
1) Adding Multiple Uncertainty Layers Does Not Help.
Except on the MNIST dataset, where adding an extra hidden uncertainty layer improves the AURC, the last layer and its full network counterpart seem to perform similarly well for the four algorithms. On CIFAR10, AURC is actually better for the last layer algorithms: better for Dropout, for Bootstrap, for SGD and for SGLD. A similar observation can be made about CIFAR100: better for Dropout, for Bootstrap, for SGD and for SGLD. AURC is mostly constant with respect to the number of uncertainty layers (from 1 to 3) for ImageNet; a maximum variation of can be observed. More precisely, the best performance for all algorithms is obtained with uncertainty layers.
We show histograms of the SR confidence function values to shed some light on the difference between MNIST and CIFAR10/100. See Figure 5 for CIFAR100 and Figure 6 for MNIST. These plots compare the SR distributions of the correctly classified and misclassified test points for lastlayer SGD versus its full network counterpart. Actually, AURC is a direct measure of how well the two distributions are separated from each other: in the extreme case where all the correctly classified points had higher SR values than the misclassified ones, AURC would reach its minimum value (known as EAURC in [GUE18]).
In the case of MNIST, the histograms for correctly classified points are similar for both lastlayer and fullnetwork SGD versions. However, the fullnetwork exhibits a greater dispersion for incorrectly classified points (see scale of yaxis). Both facts combined lead to a stronger AURC for the fullnetwork algorithm, as it can better tell the difference between both sets of points. Indeed, MNIST is known to be an easy classification task, and a flat landscape is to be expected for local loss minima. In other words, there are many possible distinct representations which are enough to solve the problem. In this case, fullnetwork methods are able to explore many of them, thus accounting for the uncertainty in the representation, while still correctly classifying with high confidence the vast majority of points. Accordingly, we suspect fullnetwork approaches provide a more diverse set of predictions in this context, when compared to the last layer implementation which is committed to a single representation. We believe that, when the loss landscape is moreorless flat, fullnetwork algorithms can take advantage of representation uncertainty to deliver stronger results.
A different behavior can be observed on CIFAR100, where the classification task is more difficult. The histograms of the fullnetwork SGD are more dispersed for both correctly classified and misclassified points. In particular, as opposed to the MNIST scenario, a number of correctly classified points are no longer mapped to a high SR. Therefore, the confidence function SR is worse at ranking these examples, while the effect is lighter for the last layer version of the network, leading to a better AURC in the latter case. One possible explanation could be related to the sensitivity of the representation found by the pretrained network. For hard classification problems (like CIFAR100 or ImageNet), the network is expected to end up in a strong or deep local minima after training. Intuitively, this means that the quality of nearby representations in space quickly degrades. Unfortunately, this is precisely what most of the fullnetwork uncertainty methods try to do: they apply some dithering to the original strong local minima. A number of previously correctly ranked points might suffer due to poor representations. In this case, committing to the local optima may pay off; lastlayer models still exploit the fixed representation to compute some useful uncertainty estimates on top. In difficult problems, when we bootstrap the data, each individual model gets exposed to fewer different data points. We suspect this leads to reaching worse local minima, than the aforementioned deep one. One could try to train an ensemble of networks on the very same data. This may help in practice for this type of hard problems. It could also be the case that, if many of those models end up in the same deep local minima, their predictions will not be diverse enough to generate useful uncertainty estimates. In these cases, we expect lastlayer models to help at a reasonable computational cost. We see that the AURC is slightly worse in CIFAR100 for the full network methods (between and ). Therefore, for harder classification tasks, our results seem to support the idea that by explicitly decoupling representation learning (based on all but one layers) from uncertainty estimation (which is fully performed at the last layer) we capture most of the value provided by these algorithmic approaches in terms of selective classification.
2) Softmax Response (SR) is a Strong Confidence Function.
We have compared several confidence functions: SR, STD and the entropy of defined in Section 3.2. We observe a common theme in all cases: the softmax response SR does consistently outperform all the other confidence functions. As an example, the risk coverage curve for ImageNet is plotted in Figure 9 of the appendix.
3) SGD PointEstimate is actually a Strong Baseline.
SGDPE is particularly competitive on CIFAR10/100; it provides almost optimal performance. Its main advantage is simplicity: it can be applied offtheshelf and no twostage procedure is needed. However, the method suffers in both MNIST and ImageNet, compared to the other algorithms. For MNIST, the explanation is similar to the previous discussion: MNIST being an easy classification task, the full network versions of our algorithm are superior to the last layer versions, which are themselves better than a single point estimate. For ImageNet, the results suggest that raw endtoend softmax outputs may not be enough for more complex datasets. Ensemble techniques may bring additional stability and robustness in this context.
4) SGLD is Unstable on the Full Network.
When running SGLD on the full network for MNIST and CIFAR10/100, we observed the instability of this algorithm: if the learning rate is not very small, SGLD tends to diverge, i.e. the accuracy (resp. the loss) decreases (resp. increases) over the iterations. This phenomenon is not visible when SGLD is only applied on the last layer of the neural network. In the case of one dense layer endowed with a multinomial logistic regression model and a Gaussian prior over the weights and bias, the logarithm of the posterior distribution is a strongly log concave function. In this setting, convergence properties of SGLD have been studied [Dal17, DK17] and its behaviour has been shown to be close to SGD [NDH17, BDM18]. We conclude the full network version of SGLD should be used carefully while its lastlayer counterpart should be easier to train. In the future, we intend to perform comparisons with a decreasing learning rate schedule and make tests of convergence of the SGLD chain, e.g. number of effective samples.
The results for outofdistribution detection are in Section B.2. They support similar takeaway messages.
6 Conclusion
In this work, we showed that decoupling representation learning and uncertainty quantification in deep neural nets is a tractable approach to tackle selective classification, which is an important problem for realworld applications where mistakes can be fatal. Vanilla methods that do not compute uncertainty estimates struggle to solve some of the most complex tasks we studied. In addition, our experiments indicate that the improvements obtained by adding several uncertainty layers (either at the top, or along the whole architecture) are at most modest, thus making it hard to justify their complexity overhead.
Appendix
In Appendix A, additional material concerning the experiments is provided: the hyperparameters tuning is detailed and supplementary tables and plots for the AURC metric are presented. In Appendix B, metrics for outofdistribution detection are first defined and then computed for the last layer algorithms on MNIST and CIFAR10/100: related tables and plots are displayed and presented.
Appendix A Additional Material for the Experiments
The accuracies for all algorithms and all datasets are presented in Table 1.
mnist  cifar10  cifar100  

sgd  0,981  0,936  0,706 
sgld  0,981  0,935  0,707 
bootstrap  0,981  0,935  0,704 
dropout  0,980  0,935  0,705 
sgdpe  0,978  0,936  0,705 
sgd full  0,985  0,933  0,696 
sgld full  0,982  0,929  0,677 
bootstrap full  0,984  0,931  0,687 
dropout full  0,984  0,931  0,700 
a.1 HyperParameter Tuning
We perform hyperparameter tuning for the last layer algorithms as follows.
Algorithms. The number of samples varies between 10, 100, and 1000 for SGD, SGLD and MCDropout and between 10 and 100 for Bootstrap.
The thinning interval is chosen such that the parameters are saved at each epoch (one full pass over the data) for SGLD and SGD.
The number of SGD epochs is 10 for the Bootstrap, and 100 for MCDropout.
The dropout rate for the latter (probability of zeroing out a neuron) is chosen between 0.1, 0.3, and 0.5.
MNIST. The learning rate is among 5 equally spaced values between and . We use batchsize equal to 32.
CIFAR10/100. The learning rate is among 7 equally spaced values between and . The batchsize is 128.
ImageNet. The learning rate is among 4 equally spaced values between and , with a batchsize of 512. In this case, the number of samples is 10, and only 10 epochs are completed for MCDropout.
For the full network versions of the algorithms on MNIST and CIFAR10/100, the number of samples is equal to , the number of epochs is 10 for Bootstrap and for MCDropout.
MNIST. The learning rate is among 4 equally spaced values between and . We use batchsize equal to 32.
CIFAR10/100. The learning rate is among 4 equally spaced values between and . The batchsize is 128.
The metrics optimized by the hyperparameter search are defined in Section 3.1.
a.2 Additional Results for Selective Classification
Tables for the AURC metric are shown in Table 2 for MNIST, while Table 3 contains the outcomes for CIFAR10, and Table 4 those for CIFAR100. Finally, ImageNet results are displayed in Table 5. AURC sr (resp. AURC std) is the AURC obtained when the confidence function is the softmax response SR (resp. STD). min AURC is the minimum of these two values, and increase is the ratio of the min AURC over the AURC obtained by SGDPE. For ImageNet, the ratio is over the AURC using SGDPE on a 2 dense layers network at the top of NASNet. The AURC using the entropy of defined in (6) as a confidence function is not reported because the results are clearly below its competitors. In Table 5, nbll indicates the number of dense layers added at the top of NASNet (from 1 to 3).
algorithm  AURC sr  AURC std  min AURC  increase 

dropout  8.74E04  1.10E03  8.74E04  0.52 
dropout full  4.84E04  5.57E04  4.84E04  0.29 
bootstrap  7.43E04  7.55E04  7.43E04  0.45 
bootstrap full  7.68E04  5.56E04  5.56E04  0.33 
sgd  1.18E03  7.62E04  7.62E04  0.46 
sgd full  5.78E04  5.37E04  5.37E04  0.32 
sgld  7.26E04  7.28E04  7.26E04  0.44 
sgld full  9.03E04  6.74E04  6.74E04  0.40 
sgdpe  1.67E03  1.67E03  1.00 
algorithm  AURC sr  AURC std  min AURC  increase 

dropout  6.56E03  6.66E03  6.56E03  0.98 
dropout full  7.12E03  7.32E03  7.12E03  1.07 
bootstrap  6.60E03  6.90E03  6.60E03  0.99 
bootstrap full  7.14E03  7.26E03  7.14E03  1.07 
sgd  6.56E03  7.30E03  6.56E03  0.98 
sgd full  6.69E03  7.02E03  6.69E03  1.00 
sgld  6.51E03  6.80E03  6.51E03  0.98 
sgld full  7.44E03  7.56E03  7.44E03  1.12 
sgdpe  6.66E03  6.66E03  1.00 
algorithm  AURC sr  AURC std  min AURC  increase 

dropout  9.10E02  9.31E02  9.10E02  0.98 
dropout full  9.59E02  1.12E01  9.59E02  1.04 
bootstrap  9.20E02  9.90E02  9.20E02  0.99 
bootstrap full  1.02E01  1.14E01  1.02E01  1.10 
sgd  9.09E02  9.79E02  9.09E02  0.98 
sgd full  9.69E02  1.09E01  9.69E02  1.05 
sgld  9.15E02  9.45E02  9.15E02  0.99 
sgld full  1.09E01  1.13E01  1.09E01  1.18 
sgdpe  9.25E02  9.25E02  1.00 
algorithm  nbll  AURC sr  AURC std  min AURC  increase 

dropout  1  0.0974  0.1318  0.0974  0.94 
bootstrap  1  0.0975  0.1188  0.0975  0.94 
sgd  1  0.1007  0.2569  0.1007  0.98 
sgld  1  0.1023  0.1740  0.1023  0.99 
sgdpe  1  0.1352  0.1352  1.31  
dropout  2  0.0924  0.1117  0.0924  0.90 
bootstrap  2  0.0947  0.1073  0.0947  0.92 
sgd  2  0.0949  0.1080  0.0949  0.92 
sgld  2  0.0954  0.1101  0.0954  0.92 
sgdpe  2  0.1032  0.1032  1.00  
dropout  3  0.0929  0.1186  0.0929  0.90 
bootstrap  3  0.0972  0.1102  0.0972  0.94 
sgd  3  0.0974  0.1159  0.0974  0.94 
sgld  3  0.0975  0.1119  0.0975  0.94 
sgdpe  3  0.1064  0.1064  1.03 
Figure 7 follows the same approach as Figures 6 and 5 of the article, at the exception that the confidence function is and not . We observe the same phenomenon as in Figure 6: the histogram of the full network version of SGD is more dispersed on the misclassified examples but not on the correctly classified examples.
Already discussed in Section 3.1 of the article, calibration is a desired property of probabilistic models. Before computing some metrics associated to calibration, we first introduce some notations and concepts. Our probabilistic model is said to be calibrated if for all ,
(8) 
where is the predictive posterior distribution defined in (1). The empirical equivalent of (8) over the test dataset is
(9) 
In practice, it is necessary to discretize in bins, where . For , define
For , we define the average accuracy and confidence in the bin as:
We can then relax equation (9) to be: , for all . When a model does not satisfy for all , we say it is miscalibrated. There are a number of ways to measure miscalibration, for example (see e.g. [GPSW17]):

a reliability diagram, a barplot plotting w.r.t. for every ,

the expected calibration error defined as

the maximum calibration error defined as
The reliability diagram using the last layer or full network version of SGD on CIFAR100 is plotted in Figure 8. The full network version of SGD is better calibrated than the last layer version, which is corroborated by an ECE of 0.096 against 0.18. It is consistent with Figure 5 where the full network version shows more dispersed values of .
In Figure 9, the risk coverage curve on Imagenet of SGDPE and Bootstrap is displayed. We observe that the confidence function of Bootstrap (using the maximum of the predictive posterior distribution as a confidence estimate) achieves a better selection than the confidence function of SGDPE. Besides, the curve also supports the fact that is preferable to the confidence function.
Following the setup of Figures 6 and 5, Figure 10 presents the histograms of the values of the confidence function on ImageNet, with 1 or 3 dense layers on top of NASNet, using the Bootstrap algorithm. We observe the inverse phenomenon compared to Figure 5: when the number of dense layers at the top of NASNet increases, the histograms of the values become more concentrated for both the correctly classified and misclassified examples. In a consistent manner, the 1 dense layer version of Bootstrap is slightly better calibrated than the 3 dense layers version: the reliability diagram is plotted in Figure 11 and the ECE values are 0.08 against 0.11. This phenomenon may recall some empirical observations [GPSW17]: when the neural network becomes more complex and grows in size, the model tends to become less calibrated. It emphasizes the importance of the architecture of the neural network where the uncertainty algorithms are applied: on a convolutionaltype structure, the model seems to become more calibrated; on the inverse, on several stacked dense layers, it tends to lose its calibration property.
Appendix B Outofdistribution detection
Outofdistribution detection, i.e. finding out when a data point is not drawn from the training data distribution, is an important and difficult task. Its importance stems from the fact that we need robust models that acknowledge their own limitations. Detection is hard as highdimensional probability distributions are challenging to deal with, and often times require unreasonable amounts of data. Consequently, a flurry of work has been developed [HG17, HAB18, PAD18, LLLS18, SAK18, HMD19, SSL18, LLS18a, DT18, NMT19]; unfortunately, describing all of them is beyond the scope of this paper.
b.1 AUROC and AUPR in/out
Uncertainty estimates are an opportunity to detect outofdistribution
samples; with this in mind, the task is reduced to a binary classification (in/out of distribution) and standard metrics like the Area Under Receiver Operating Characteristic curve (AUROC) and the Area Under the PrecisionRecall curve (AUPR) can be used, see for example
[HG17, LLS18b]. The indistribution samples may be treated as the positive class, and the outofdistribution samples as the negative class (or viceversa). This binary classification is based on a score and a threshold such that the scores above the threshold are classified as positive and the ones below as negative. In our case, the score is given by a confidence function and the outofdistribution samples are supposed to be the least confident inputs according to .Define the true positive rate by , and the false positive rate by where is the number of true positive, the number of false negative, the number of false positive and the number of true negative. The ROC curve plots the true positive rate with respect to the false positive rate and the AUROC can be interpreted as the probability that a positive example has a greater score than a negative example. Consequently, a random detector corresponds to a AUROC and a perfect classifier corresponds to a AUROC.
The AUROC is not ideal when the positive class and negative class have greatly differing base rates, and the AUPR adjusts for these different positive and negative base rates [DG06, SR15]. The PR curve plots the precision and recall against each other. A random detector has an AUPR equal to the fraction of positive samples in the dataset and a perfect classifier has an AUPR of . Since the baseline AUPR is equal to the fraction of positive samples, the positive class must be specified; in view of this, the AUPRs are displayed when the indistribution classes are treated as positive (AUPR in), and viceversa when the outofdistribution samples are treated as positive (AUPR out).
b.2 Experimental Results and Discussion
Empirical evaluation of outofdistribution behavior is hard: there are lots of ways to not match the training distribution, some more radical than others. We want to test reasonably similar outofdistribution examples. Thus, we decided to train our models on the first half of the classes, while treating the other half as outofdistribution samples. Also, at this point, we do not include ImageNet in our experiments since the full training of NASNet was too computationally intensive, and leave this as future work.
For CIFAR10/100, we follow a standard training procedure with a decaying learning rate over 250 epochs^{4}^{4}4https://github.com/geifmany/cifarvgg. For MNIST, we use the default Adam optimizer over 20 epochs^{5}^{5}5https://keras.io/optimizers/. The pointestimate weights are used both as a reference (SGDPE), and as the starting point for the lastlayer algorithms. These algorithms, MCDropout, SGD, SGLD and Bootstrap, are then trained on (the encoded) half the classes of MNIST and CIFAR10/100 datasets.
In Tables 8, 7 and 6, we report the results for AUROC. For completeness, the AUPR out results are shown in Tables 11, 10 and 9 and the AUPR in results are presented in Tables 14, 13 and 12.
The confidence functions used for the computation of AUROC and AUPR in/out are SR, STD and q, the entropy of . They are not reported for AUROC using the entropy of on the MNIST and CIFAR10 datasets because this confidence function give consistently lower AUROC values. max is the maximum of the two or three AUROC/AUPR in/out values, and increase is the ratio of the max over the reference SGDPE. We recall that the higher is the AUROC or AUPR in/out, the better. The takeaway messages are aligned with those from the previous section: lastlayer perform comparably to fullnetwork versions, SR dominates other confidence functions, and SGDPE is a strong contender.
algorithm  AUROC sr  AUROC std  max AUROC  increase 

dropout  0.916  0.901  0.916  1.033 
dropout full  0.940  0.928  0.940  1.061 
bootstrap  0.872  0.885  0.885  0.998 
bootstrap full  0.898  0.908  0.909  1.025 
sgd  0.886  0.895  0.895  1.009 
sgd full  0.933  0.936  0.936  1.056 
sgld  0.903  0.918  0.918  1.036 
sgld full  0.938  0.941  0.941  1.062 
sgdpe  0.886  0.886  1.000 
algorithm  AUROC sr  AUROC std  max AUROC  increase 

dropout  0.791  0.793  0.793  1.005 
dropout full  0.795  0.792  0.795  1.007 
bootstrap  0.790  0.777  0.790  1.001 
bootstrap full  0.789  0.794  0.794  1.006 
sgd  0.792  0.772  0.792  1.003 
sgd full  0.791  0.788  0.791  1.002 
sgld  0.789  0.794  0.794  1.006 
sgld full  0.790  0.786  0.790  1.000 
sgdpe  0.789  0.789  1.000 
algorithm  AUROC q  AUROC sr  AUROC std  max AUROC  increase 

dropout  0.575  0.722  0.719  0.722  1.010 
dropout full  0.736  0.731  0.658  0.736  1.030 
bootstrap  0.499  0.717  0.697  0.717  1.003 
bootstrap full  0.653  0.720  0.703  0.720  1.007 
sgd  0.546  0.726  0.707  0.726  1.015 
sgd full  0.694  0.719  0.704  0.719  1.006 
sgld  0.599  0.728  0.718  0.728  1.018 
sgld full  0.576  0.713  0.710  0.713  0.998 
sgdpe  0.715  0.715  1.000 
algorithm  AUPR out q  AUPR out sr  AUPR out std  max AUPR out  increase 

bootstrap  0.611  0.891  0.895  0.895  0.997 
bootstrap full  0.584  0.909  0.913  0.913  1.017 
dropout  0.880  0.911  0.897  0.911  1.014 
dropout full  0.869  0.935  0.925  0.935  1.041 
sgd  0.619  0.903  0.905  0.905  1.008 
sgd full  0.599  0.932  0.932  0.932  1.038 
sgld  0.765  0.914  0.921  0.921  1.026 
sgld full  0.925  0.934  0.932  0.934  1.040 
sgdpe  0.898  0.898  1.000 
algorithm  AUPR out q  AUPR out sr  AUPR out std  max AUPR out  increase 

bootstrap  0.509  0.747  0.730  0.747  0.999 
bootstrap full  0.521  0.747  0.757  0.757  1.013 
dropout  0.566  0.748  0.751  0.751  1.005 
dropout full  0.651  0.749  0.744  0.749  1.002 
sgd  0.533  0.752  0.730  0.752  1.006 
sgd full  0.692  0.755  0.754  0.755  1.010 
sgld  0.512  0.747  0.754  0.754  1.008 
sgld full  0.573  0.749  0.749  0.751  1.004 
sgdpe  0.747  0.747  1.000 
algorithm  AUPR out q  AUPR out sr  AUPR out std  max AUPR out  increase 

bootstrap  0.509  0.670  0.631  0.670  1.003 
bootstrap full  0.645  0.678  0.636  0.678  1.015 
dropout  0.594  0.682  0.673  0.682  1.022 
dropout full  0.703  0.692  0.588  0.703  1.052 
sgd  0.570  0.687  0.658  0.687  1.028 
sgd full  0.668  0.674  0.635  0.674  1.009 
sgld  0.612  0.685  0.667  0.685  1.025 
sgld full  0.589  0.668  0.654  0.668  1.000 
sgdpe  0.668  0.668  1.000 
algorithm  AUPR in q  AUPR in sr  AUPR in std  max AUPR in  increase 

bootstrap  0.550  0.817  0.841  0.841  1.002 
bootstrap full  0.539  0.855  0.873  0.873  1.041 
dropout  0.817  0.914  0.899  0.914  1.090 
dropout full  0.765  0.938  0.925  0.938  1.119 
sgd  0.562  0.839  0.854  0.854  1.019 
sgd full  0.549  0.913  0.925  0.925  1.103 
sgld  0.641  0.866  0.894  0.895  1.067 
sgld full  0.881  0.924  0.935  0.935  1.114 
sgdpe  0.839  0.839  1.000 
algorithm  AUPR in q  AUPR in sr  AUPR in std  max AUPR in  increase 

bootstrap  0.503  0.811  0.799  0.811  1.001 
bootstrap full  0.510  0.809  0.813  0.813  1.003 
dropout  0.521  0.813  0.814  0.814  1.005 
dropout full  0.570  0.818  0.815  0.818  1.009 
sgd  0.509  0.808  0.790  0.808  0.996 
sgd full  0.606  0.809  0.805  0.809  0.998 
sgld  0.497  0.810  0.817  0.817  1.008 
sgld full  0.518  0.810  0.802  0.810  0.999 
sgdpe  0.810  0.810  1.000 
algorithm  AUPR in q  AUPR in sr  AUPR in std  max AUPR in  increase 

bootstrap  0.499  0.737  0.717  0.737  1.000 
bootstrap full  0.607  0.740  0.732  0.740  1.004 
dropout  0.542  0.734  0.732  0.734  0.996 
dropout full  0.737  0.741  0.701  0.741  1.005 
sgd  0.524  0.738  0.708  0.738  1.002 
sgd full  0.660  0.736  0.731  0.736  0.999 
sgld  0.560  0.750  0.737  0.750  1.017 
sgld full  0.544  0.735  0.732  0.735  0.997 
sgdpe  0.737  0.737  1.000 
References
 [AAB15] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [ABA18] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian deep qnetworks. arXiv preprint arXiv:1802.04412, 2018.
 [ABCG15] Claire AdamBourdarios, Glen Cowan, Cécile Germain, Isabelle Guyon, Balàzs Kégl, and David Rousseau. The Higgs boson machine learning challenge. In Glen Cowan, Cécile Germain, Isabelle Guyon, Balázs Kégl, and David Rousseau, editors, Proceedings of the NIPS 2014 Workshop on Highenergy Physics and Machine Learning, volume 42 of Proceedings of Machine Learning Research, pages 19–55, Montreal, Canada, 13 Dec 2015. PMLR.
 [ABW12] Sungjin Ahn, Anoop Korattikara Balan, and Max Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26  July 1, 2012, 2012.
 [AIP15] Ofélia Anjos, Carla Iglesias, Fatima Peres, Javier Martínez, Angela Garcia, and Javier Taboada. Neural networks applied to discriminate botanical origin of honeys. Food Chemistry, 175:128–136, 05 2015.
 [AOS16] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete Problems in AI Safety. arXiv eprints, page arXiv:1606.06565, June 2016.
 [ASK18] Amr Alexandari, Avanti Shrikumar, and Anshul Kundaje. Selective Classification via Curve Optimization. arXiv eprints, page arXiv:1802.07024, February 2018.
 [BB98] D. Barber and Christopher Bishop. Ensemble learning in bayesian neural networks. In Generalization in Neural Networks and Machine Learning, pages 215–237. Springer Verlag, January 1998.
 [BDM18] Nicolas Brosse, Alain Durmus, and Eric Moulines. The promises and pitfalls of stochastic gradient langevin dynamics. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8278–8288. Curran Associates, Inc., 2018.
 [Bis06] Christopher Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. SpringerVerlag New York, 2006.
 [BW08] Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. J. Mach. Learn. Res., 9:1823–1840, June 2008.
 [C15] François Chollet et al. Keras. https://keras.io, 2015.
 [CDM16] Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Boosting with abstention. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1660–1668. Curran Associates, Inc., 2016.
 [CFG14] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian Monte Carlo. In Proceedings of the 31st International Conference on Machine Learning, pages 1683–1691, 2014.
 [CPRD16] Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold gaussian processes for regression. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 3338–3345. IEEE, 2016.
 [Dal17] Arnak S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and logconcave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
 [DDS09] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 [DG06] Jesse Davis and Mark Goadrich. The relationship between precisionrecall and roc curves. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 233–240, New York, NY, USA, 2006. ACM.
 [DJV14] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 647–655, Bejing, China, 22–24 Jun 2014. PMLR.
 [DK17] Arnak S. Dalalyan and Avetik G. Karagulyan. Userfriendly guarantees for the Langevin Monte Carlo with inaccurate gradient. arXiv eprints, page arXiv:1710.00095, September 2017.
 [DT18] Terrance DeVries and Graham W. Taylor. Learning Confidence for OutofDistribution Detection in Neural Networks. arXiv eprints, page arXiv:1802.04865, February 2018.
 [Efr12a] Bradley Efron. A 250year argument: Belief, behavior, and the bootstrap. Bulletin of the American Mathematical Society, 50(1):129–146, apr 2012.
 [Efr12b] Bradley Efron. Bayesian inference and the parametric bootstrap. Ann. Appl. Stat., 6(4):1971–1997, 12 2012.
 [FHT01] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.
 [Fre65] Linton C. Freeman. Elementary Applied Statistics. New York: John Wiley and Sons, 1965.
 [Gal16] Yarin Gal. Uncertainty in deep learning. PhD thesis, 2016.
 [GECd18] Alexandre Garcia, Slim Essid, Chloé Clavel, and Florence d’AlchéBuc. Structured Output Learning with Abstention: Application to Accurate Opinion Prediction. arXiv eprints, page arXiv:1803.08355, March 2018.
 [GEY17] Yonatan Geifman and Ran ElYaniv. Selective classification for deep neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4878–4887. Curran Associates, Inc., 2017.
 [GG16] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 1050–1059. JMLR.org, 2016.
 [GIP18] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. arXiv eprints, page arXiv:1802.10026, February 2018.
 [GPSW17] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 [GSC13] Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.
 [GSS15] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 [GUE18] Y. Geifman, G. Uziel, and R. ElYaniv. BiasReduced Uncertainty Estimation for Deep Neural Classifiers. ArXiv eprints, May 2018.
 [HAB18] Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why ReLU networks yield highconfidence predictions far away from the training data and how to mitigate the problem. arXiv eprints, page arXiv:1812.05720, December 2018.

[HB15]
Matthew Hoffman and David Blei.
Stochastic Structured Variational Inference.
In Guy Lebanon and S. V. N. Vishwanathan, editors,
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
, volume 38 of Proceedings of Machine Learning Research, pages 361–369, San Diego, California, USA, 09–12 May 2015. PMLR.  [HCB16] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4080–4088. Curran Associates, Inc., 2016.
 [HG17] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and outofdistribution examples in neural networks. Proceedings of International Conference on Learning Representations, 2017.
 [HLP17] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger. Snapshot Ensembles: Train 1, get M for free. arXiv eprints, page arXiv:1704.00109, March 2017.

[HMD19]
Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich.
Deep anomaly detection with outlier exposure.
In International Conference on Learning Representations, 2019.  [HRH18] Siyu He, Siamak Ravanbakhsh, and Shirley Ho. Analysis of cosmic microwave background with deep learning, 2018.

[HvC93]
Geoffrey E. Hinton and Drew van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Proceedings of the Sixth Annual Conference on Computational Learning Theory
, COLT ’93, pages 5–13, New York, NY, USA, 1993. ACM.  [IG17] Tomoharu Iwata and Zoubin Ghahramani. Improving Output Uncertainty Estimation and Generalization in Deep Learning via Neural Network Gaussian Processes. arXiv eprints, page arXiv:1707.05922, July 2017.
 [KA13] Martin Krzywinski and Naomi Altman. Importance of being uncertain. Nature Methods, 10:809 EP –, Aug 2013.
 [KB15] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

[KG17]
Alex Kendall and Yarin Gal.
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
In Advances in Neural Information Processing Systems 30 (NIPS), 2017.  [Kri09] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [LAA17] Christian Leibig, Vaneeda Allken, Murat Seçkin Ayhan, Philipp Berens, and Siegfried Wahl. Leveraging uncertainty information from deep neural networks for disease detection. Scientific reports, 7(1):17816, 2017.
 [LBBH98] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
 [LKN18] Yun Liu, Timo Kohlberger, Mohammad Norouzi, George E Dahl, Jenny L Smith, Arash Mohtashamian, Niels Olson, Lily H Peng, Jason D Hipp, and Martin C Stumpe. Artificial intelligence–based breast cancer nodal metastasis detection: Insights into the black box for pathologists. Archives of pathology & laboratory medicine, 2018.
 [LLLS18] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting outofdistribution samples and adversarial attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7167–7177. Curran Associates, Inc., 2018.
 [LLS18a] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of outofdistribution image detection in neural networks. In International Conference on Learning Representations, 2018.
 [LLS18b] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of outofdistribution image detection in neural networks. In International Conference on Learning Representations, 2018.
 [LPB17] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017.
 [LPC15] Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David Crandall, and Dhruv Batra. Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks. arXiv eprints, page arXiv:1511.06314, November 2015.
 [MHB17] Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic gradient descent as approximate bayesian inference. J. Mach. Learn. Res., 18(1):4873–4907, January 2017.
 [MKG18] R. Michelmore, M. Kwiatkowska, and Y. Gal. Evaluating Uncertainty Quantification in EndtoEnd Autonomous Driving Control. arXiv eprints, November 2018.
 [NDH17] Tigran Nagapetyan, Andrew B. Duncan, Leonard Hasenclever, Sebastian J. Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The True Cost of Stochastic Gradient Langevin Dynamics. arXiv eprints, page arXiv:1706.02692, June 2017.
 [Nea96] Radford M. Neal. Bayesian Learning for Neural Networks. SpringerVerlag, Berlin, Heidelberg, 1996.
 [NMT19] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know? In International Conference on Learning Representations, 2019.
 [NPAA18] Thejas Nair, Doina Precup, Douglas L. Arnold, and Tal Arbel. Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. In MICCAI, 2018.

[PAD18]
Stanislav Pidhorskyi, Ranya Almohsen, and Gianfranco Doretto.
Generative probabilistic novelty detection with adversarial autoencoders.
In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6823–6834. Curran Associates, Inc., 2018.  [RASC14] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features offtheshelf: An astounding baseline for recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW ’14, pages 512–519, Washington, DC, USA, 2014. IEEE Computer Society.
 [RBB18] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural networks. In International Conference on Learning Representations, 2018.
 [RT96] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
 [RTS18] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.
 [RWR18] A. Radovic, Mike Williams, Dérick Rousseau, Michael Kagan, Daniele Bonacorsi, Alexander Himmel, Adam Aurisano, Kazuhiro Terao, and Taritree M Wongjirad. Machine learning at the energy and intensity frontiers of particle physics. Nature, 560:41–48, 2018.
 [SAK18] Gabi Shalev, Yossi Adi, and Joseph Keshet. Outofdistribution detection using multiple semantic label representations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7386–7396. Curran Associates, Inc., 2018.
 [SHF16] Saurabh Singh, Derek Hoiem, and David Forsyth. Swapout: Learning an ensemble of deep architectures. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 28–36. Curran Associates, Inc., 2016.
 [SHK14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 [SR15] T. Saito and M. Rehmsmeier. The precisionrecall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE, 10(3):e0118432, 2015.
 [SRS15] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning, pages 2171–2180, 2015.
 [SSL18] Alireza Shafaei, Mark Schmidt, and James J. Little. Does Your Model Know the Digit 6 Is Not a Cat? A Less Biased Evaluation of ”Outlier” Detectors. arXiv eprints, page arXiv:1809.04729, September 2018.
 [TBB19] Sunil Thulasidasan, Tanmoy Bhattacharya, Jeffrey Bilmes, Gopinath Chennupati, and Jamal MohdYusof. Knows when it doesn’t know: Deep abstaining classifiers, 2019.
 [WHSX16] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 370–378, Cadiz, Spain, 09–11 May 2016. PMLR.
 [WJ08] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(12):1–305, January 2008.
 [WT11] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 681–688, USA, 2011. Omnipress.
 [YCBL14] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2, NIPS’14, pages 3320–3328, Cambridge, MA, USA, 2014. MIT Press.
 [ZL17] Lingxue Zhu and Nikolay Laptev. Deep and confident prediction for time series at uber. 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pages 103–110, 2017.
 [ZLA18] Jiaming Zeng, Adam Lesnikowski, and Jose M. Alvarez. The Relevance of Bayesian Layer Positioning to Model Uncertainty in Deep Bayesian Active Learning. arXiv eprints, page arXiv:1811.12535, November 2018.
Comments
There are no comments yet.