Uncertainty Estimation with Infinitesimal Jackknife, Its Distribution and Mean-Field Approximation

06/13/2020 ∙ by Zhiyun Lu, et al. ∙ Google University of Southern California 31

Uncertainty quantification is an important research area in machine learning. Many approaches have been developed to improve the representation of uncertainty in deep models to avoid overconfident predictions. Existing ones such as Bayesian neural networks and ensemble methods require modifications to the training procedures and are computationally costly for both training and inference. Motivated by this, we propose mean-field infinitesimal jackknife (mfIJ) – a simple, efficient, and general-purpose plug-in estimator for uncertainty estimation. The main idea is to use infinitesimal jackknife, a classical tool from statistics for uncertainty estimation to construct a pseudo-ensemble that can be described with a closed-form Gaussian distribution, without retraining. We then use this Gaussian distribution for uncertainty estimation. While the standard way is to sample models from this distribution and combine each sample's prediction, we develop a mean-field approximation to the inference where Gaussian random variables need to be integrated with the softmax nonlinear functions to generate probabilities for multinomial variables. The approach has many appealing properties: it functions as an ensemble without requiring multiple models, and it enables closed-form approximate inference using only the first and second moments of Gaussians. Empirically, mfIJ performs competitively when compared to state-of-the-art methods, including deep ensembles, temperature scaling, dropout and Bayesian NNs, on important uncertainty tasks. It especially outperforms many methods on out-of-distribution detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in deep neural nets have dramatically improved predictive accuracy in supervised tasks. For many applications such as autonomous vehicle control and medical diagnosis, decision-making also needs accurate estimation of the uncertainty pertinent to the prediction. Unfortunately, deep neural nets are known to output overconfident, mis-calibrated predictions Guo et al. (2017).

It is crucial to improve deep models’ ability in representing uncertainty. There have been a steady development of new methods on uncertainty quantification for deep neural nets. One popular idea is to introduce additional stochasticity (such as temperature annealing or dropout to network architecture) to existing trained models to represent uncertainty Gal and Ghahramani (2016); Guo et al. (2017). Another line of work is to use an ensemble of models, collectively representing the uncertainty about the predictions. This ensemble of models can be obtained by varying training with respect to initialization Lakshminarayanan et al. (2017), hyper-parameters Ashukha et al. (2020), data partitions, i.e., bootstraps Efron (1992); Schulam and Saria (2019). Yet another line of work is to use Bayesian neural networks (bnn), which can be seen as an ensemble of an infinite number of models, characterized by the posterior distribution Blundell et al. (2015); MacKay (1992). In practice, one samples models from the posterior or use variational inference. Each of those methods offers different trade-offs among computational costs, memory consumptions, parallelization, and modeling flexibility. For example, while ensemble methods are often state-of-the-art, they are both computationally and memory intensive, for repeating training procedures or storing the resulting models.

Those stand in stark contrast to many practitioners’ desiderata. It would be the most ideal that neither training models nor inference with models incur additional memory and computational costs to estimate uncertainty beyond what is needed for making predictions. Additionally, it is also desirable to be able to quantify uncertainty on an existing model where re-training is not possible.

In this work, we propose a new method to bridge the gap. The main idea of our approach is to use Infinitesimal jackknife, a classical tool from statistics for uncertainty estimation Jaeckel (1972), to construct a pseudo-ensemble that can be described with a closed-form Gaussian distribution. We then use this Gaussian distribution for uncertainty estimation. While the standard way is to sample models from this distribution and combine each sample’s prediction, we develop a mean-field approximation to the inference, where Gaussian random variables need to be integrated with the softmax nonlinear functions to generate probabilities for multinomial variables.

We show the proposed approach, which we refer to as mean-field Infinitesimal jackknife (

) often surpasses or as competitive as existing approaches in evaluations metrics of NLL, ECE, and Out-of-Distribution detection accuracy on several benchmark datasets.

shares appealing properties with several recent approaches for uncertainty estimation: constructing the pseudo-ensemble with infinitesimal jackknife does not require changing existing training procedures Schulam and Saria (2019); Barber et al. (2019); Giordano et al. (2018, 2019); Koh and Liang (2017); approximating the ensemble with a distribution removes the need of storing many models – an impractical task for modern learning models Chen et al. (2016); Maddox et al. (2019)

; the pseudo-ensemble distribution is in similar form as the Laplace approximation for Bayesian inference 

Barber and Bishop (1998); MacKay (1992) thus existing approaches in computational statistics such as Kronecker product factorization can be directly applied.

The mean-field approximation brings out an additional appeal. It is in closed-form and needs only the first and the second moments of the Gaussian random variables. In our case, the first moments are simply the predictions of the networks while the second moments involve the product between the inverse Hessian and a vector, which can be computed efficiently 

Agarwal et al. (2016); Martens (2010); Pearlmutter (1994). Additionally, the mean-field approximation can be applied when integrals in similar form need to be approximated. In Appendix B.3, we demonstrate its utility of applying it to the recently proposed swag algorithm for uncertainty estimation where the Gaussian distribution is derived differently Chen et al. (2016); Izmailov et al. (2018); Maddox et al. (2019); Mandt et al. (2017).

We describe our approach in §2, followed by a discussion on the related work in §3. Empirical studies are reported in §4, and we conclude in §5.

2 Approach

In this section, we start by introducing necessary notations and defining the task of uncertainty estimation. We then describe the technique of infinitesimal jackknife in §2.1. We derive a closed-form Gaussian distribution of an infinite number of models estimated with infinitesimal jackknife– we call them pseudo-ensemble. We describe how to use this distribution for uncertainty estimation in §2.2. We present our efficient mean-field approximation to Gaussian-softmax integral in §2.3. Lastly, we discuss hyper-parameters of our method and present the algorithm in §2.4.

Notation We are given a training set of i.i.d. samples , where with the input and the target . We fit the data to a parametric predictive model . We define the loss on a sample as and optimize the model’s parameter via empirical risk minimization on . The minimizer is given by

(1)

In practice, we are interested in not only the prediction but also quantifying the uncertainty of making such a prediction. In this paper, we consider (deep) neural network as the predictive model.

2.1 Infinitesimal jackknife and its distribution

Jackknife is a well-known resampling method to estimate the confidence interval of an estimator 

Tukey (1958); Efron and Stein (1981). It is a straightforward procedure. Each element is left out from the dataset to form a unique “leave-one-out” Jackknife sample . A Jackknife sample’s estimate of is given by

(2)

We obtain such samples

and use them to estimate the variances of

and the predictions made with . In this vein, this is a form of ensemble method.

However, it is not feasible to retrain modern neural networks times, when is often in the order of millions. Infinitesimal jackknife is a classical tool to approximate without re-training on

. It is often used as a theoretical tool for asymptotic analysis 

Jaeckel (1972), and is closely related to influence functions in robust statistics Cook and Weisberg (1982). Recent studies have brought (renewed) interests in applying this methodology to machine learning problems Giordano et al. (2018); Koh and Liang (2017). Here, we briefly summarize the method.

Linear approximation. The basic idea behind infinitesimal jackknife is to treat the and as special cases of an estimator on weighted samples

(3)

where the weights form a -simplex: . Thus the maximum likelihood estimate is when . A Jackknife sample’s estimate , on the other end, is when where is all-zero vector except taking a value of at the -th coordinate.

Using the first-order Taylor expansion around , we obtain (under the condition of twice-differentiability and the invertibility of the Hessian),

(4)

where is the Hessian matrix of evaluated at , and is the gradient of evaluated at . We use and as shorthands when there is enough context to avoid confusion.

An infinite number of infinitesimal jackknife samples. If the number of samples , we can characterize the “infinite” number of with a closed-form Gaussian distribution with the following sample mean and covariance as the distribution’s mean and covariance,

(5)

where denotes the observed Fisher information matrix.

Infinite infinitesimal bootstraps. The above procedure and analysis can be extended to bootstrapping (i.e., sampling with replacement). Similarly, to characterize the estimates from the bootstraps, we can also use a Gaussian distribution – details omitted here for brevity,

(6)

We refer to the distributions and as the pseudo-ensemble distributions. We can approximate further by to obtain and .

Lakshminarayanan et al. (2017) discussed that using models trained on bootstrapped samples does not work empirically as well as other approaches, as the learner only sees of the dataset in each bootstrap sample. We note that this is an empirical limitation rather than a theoretical one. In practice, we can only train a very limited number of models. However, we hypothesize we can get the benefits of combining an infinite number of models without training. Empirical results valid this hypothesis.

2.2 Sampling based uncertainty estimation with the pseudo-ensemble distributions

Given the general form of the pseudo-ensemble distributions , it is straightforward to see, if we approximate the predictive function with a linear function,

we can then regard the predictions by the models as a Gaussian distributed random variable,

For predictive functions whose outputs are approximately Gaussian, it might be adequate to characterize the uncertainty with the approximated mean and variance. However, for predictive functions that are categorical, this approach is not applicable.

A standard way is to use the sampling for combining the discrete predictions from the models in the ensemble. For example, for classification where is the probability of labeling with -th category, the averaged prediction from the ensemble is then

(7)

where , and is the indicator function. In the next section, we propose a new approach that avoids sampling and directly approximates the ensemble prediction of discrete labels.

2.3 Mean-field approximation for Gaussian-softmax integration

In deep neural network for classification, the predictions

are the outputs of the softmax layer.

(8)

where is the transformation of the input through the layers before the fully-connected layer, and is the connection weights in the softmax layer. We focus on the case is deterministic – extending to the random variables is straightforward. As discussed, we assume the pseudo-ensemble on forms a Gaussian distribution . Then we have such that

(9)

We give detailed derivation in Appendix A to compute the expectation of ,

(10)

The key idea is to apply the mean-field approximation

and use the following well-known formula to compute the Gaussian integral of a sigmoid function

,

where is a constant and is usually chosen to be or . In the softmax case, we arrive at:

(11)
(12)
(13)

The 3 approximations differ in how much information from is considered: not considering , considering its variance and considering its covariance . Note that and are computationally preferred over which uses covariances, where is the number of classes. (Daunizeau (2017) derived an approximation in the form of but did not apply to uncertainty estimation.)

Intuition

The simple form of the mean-field approximations makes it possible to understand them intuitively. We focus on . We first rewrite it in the familiar “softmax” form:

Note that this form is similar to a “softmax” with a temperature scaling: However, there are several important differences. In , the temperature scaling factor is category-specific:

(14)

Importantly, the factor depends on the variance of the category. For a prediction with high variance, the temperature for that category is high, reducing the corresponding “probability” . Specifically,

In other words, the scaling factor is both category-specific and data-dependent, providing additional flexibility to a global temperature scaling factor.

Implementation nuances

Because of this category-specific temperature scaling, (and approximations from and ) is no longer a multinomial probability. Proper normalization should be performed,

2.4 Other implementation considerations

Temperature scaling. Temperature scaling was shown to be useful for obtaining calibrated probabilities Guo et al. (2017). This can be easily included as well with

(15)

We can also combine with another “temperature” scaling factor, representing how well the models in the pseudo-ensemble are concentrated

(16)

Here is for the pseudo-ensembles or the posterior Wenzel et al. (2020). Note that these two temperatures control variability differently. When , the ensemble focuses on one model. When , each model in the ensemble moves to “hard” decisions, as in eq. (7). Using as an example,

(17)

where is computed at . Empirically, we can tune the temperatures and as hyper-parameters on a heldout set, to optimize the predictive performance.

Computation complexity and scalability. The bulk of the computation, as in Bayesian approximate inference, lies in the computation of in and , as in eq. (5) and eq. (6), or more precisely the product between the inverse Hessian and vectors, cf. eq. (9). For smaller models, computing and storying exactly is attractive. For large models, one can compute the inverse Hessian-vector product approximately using multiple Hessian-vector products Pearlmutter (1994); Agarwal et al. (2016); Koh and Liang (2017). Alternatively, we can approximate inverse Hessian using Kronecker factorization Ritter et al. (2018). In short, any advances in computing the inverse Hessian and related quantities can be used to accelerate computation needed in this paper.

An application example. In alg:uncert, we exemplify the application of the mean-field approximation to Gaussian-softmax integration in the uncertainty estimation. For brevity, we consider the last-layer/fully-connected layer’s parameters as a part of the deep neural network’s parameters . We assume the mapping prior to the fully connected layer is deterministic. algocf[t]    

3 Related Work

Resampling methods, such as jackknife and bootstrap, are classical statistical tools for assessing confidence intervals Tukey (1958); Miller (1974); Efron and Stein (1981); Efron (1992). Recent results have shown that carefully designed jackknife estimators Barber et al. (2019) can achieve worst-case coverage guarantees on regression problems.

However, the exhaustive re-training in jackknife or bootstrap can be cumbersome in practice. Recent works have leveraged the idea of influence function Koh and Liang (2017); Cook and Weisberg (1982) to alleviate the computational challenge.  Schulam and Saria (2019) combines the influence function with random data weight samples to approximate the variance of predictions in bootstrapping;  Alaa and van der Schaar (2019) derives higher-order influence function approximation for Jackknife estimators. Theoretical properties of the approximation are studied in Giordano et al. (2018, 2019).  Madras et al. (2019) applies this approximation to identify underdetermined test points. Infinitesimal jackknife follows the same idea as those work Jaeckel (1972); Giordano et al. (2018, 2019). To avoid explicitly storing those models, we seek a Gaussian distribution to approximately characterize the model. This connects to several existing research.

The posterior of a Bayesian model can be approximated with a Gaussian distribution MacKay (1992); Ritter et al. (2018): , where is interpreted as the maximum a posterior estimate (thus incorporating any prior the model might want to include). If we approximate the observed Fisher information matrix in eq. (5) with the Hessian , the two pseudo-ensemble distributions and the Laplace approximation to the posterior have identical forms except that the Hessians are scaled differently, and can be captured by the ensemble temperature eq. (16). Note that despite the similarity, infinitesimal jackknife are “frequentist” methods and do not assume well-formed Bayesian modeling.

The trajectory of stochastic gradient descent gives rise to a sequence of models where the covariance matrix among batch means converges to

 Chen et al. (2014, 2016); Maddox et al. (2019)

, similar to the pseudo-ensemble distributions in form. But note that those approaches do not collect information around the maximum likelihood estimate, while we do. It is also a classical result that the maximum likelihood estimator converges in distribution to a normal distribution.

and are simply the plug-in estimators of the truth mean and the covariance of this asymptotic distribution.

4 Experiments

We first describe the setup for our empirical studies. We then demonstrate the effectiveness of the proposed approach on the MNIST dataset, mainly contrasting the results from sampling to those from mean-field approximation and other implementation choices. We then provide a detailed comparison to popular approaches for uncertainty estimation. In the main text, we focus on classification problems. We evaluate on commonly used benchmark datasets, summarized in Table 1. In Appendix B.5, we report results on regression tasks.

4.1 Setup

Model and training details for classifiers.

 For MNIST, we train a two-layer MLP with 256 ReLU units per layer, using Adam optimizer for 100 epochs. For CIFAR-10, we train a ResNet-20 with Adam for 200 epochs. On CIFAR-100, we train a DenseNet-BC-121 with SGD optimizer for 300 epochs. For ILSVRC-2012, we train a ResNet-50 with SGD optimizer for 90 epochs.

Dataset # of train, held-out OOD held-out
Name classes and test splits Dataset and test splits
MNIST LeCun et al. (1998) 10 55k / 5k / 10k NotMNIST Bulatov (2011) 5k / 13.7k
CIFAR-10 Krizhevsky (2009) 10 45k / 5k / 10k LSUN (resized) Yu et al. (2015) 1k / 9k
CIFAR-100 Krizhevsky (2009) 100 SVHN Netzer et al. (2011) 5k / 21k
ILSVRC-2012 Deng et al. (2009) 1,000 1,281k / 25k / 25k Imagenet-O Hendrycks et al. (2019) 2k / -

: the number of samples is limited; best results on held-out are reported.

Table 1: Datasets for Classification Tasks and Out-of-Distribution (OOD) Detection

Evaluation tasks and metrics. We evaluate on two tasks: predictive uncertainty on in-domain samples, and detection of out-of-distribution samples. For in-domain predictive uncertainty, we report classification error rate (), negative log-likelihood (NLL), and expected calibration error in distance (ECE) Guo et al. (2017) on the test set. NLL is a proper scoring rule Lakshminarayanan et al. (2017), and measures the KL-divergence between the ground-truth data distribution and the predictive distribution of the classifiers. ECE measures the discrepancy between the histogram of the predicted probabilities by the classifiers and the observed ones in the data – properly calibrated classifiers will yield matching histograms. Both metrics are commonly used in the literature and the lower the better. In Appendix B.1, we give precise definitions.

On the task of out-of-distribution (OOD) detection, we assess how well , the classifier’s output being interpreted as probability, can be used to distinguish invalid samples from normal in-domain images. Following the common practice Hendrycks and Gimpel (2016); Liang et al. (2017); Lee et al. (2018)

, we report two threshold-independent metrics: area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPR). Since the precision-recall curve is sensitive to the choice of positive class, we report both “AUPR in:out” where in-distribution and out-of-distribution images are specified as positives respectively. We also report detection accuracy, the optimal accuracy achieved among all thresholds in classifying in-/out-domain samples. All three metrics are the higher the better.

Competing approaches for uncertainty estimation. We compare to popular approaches: (i) frequentist approaches: the point estimator of maximum likelihood estimator (mle) as a baseline, the temperature scaling calibration (T. ScaleGuo et al. (2017), the deep ensemble method (ensembleLakshminarayanan et al. (2017), and the resampling bootstrap rue Schulam and Saria (2019). (ii) variants of Bayesian neural networks (bnn): dropout Gal and Ghahramani (2016) approximates the Bayesian posterior using stochastic forward passes of the network with dropout. bnn(vi) trains the network via stochastic variational inference to maximize the evidence lower bound. bnn(kfac) applies the Laplace approximation to construct a Gaussian posterior, via layer-wise Kronecker product factorized covariance matrices Ritter et al. (2018).

Hyper-parameter tuning. For in-domain uncertainty estimation, we use the NLL on the held-out sets to tune hyper-parameters. For the OOD detection task, we use AUROC on the held-out to select hyper-parameters. We report the results of the best hyper-parameters on the test sets. The key hyper-parameters are the temperatures, regularization or prior in bnn methods, and dropout rates.

Other implementation details. When Hessian needs to be inverted, we add a dampening term following Ritter et al. (2018); Schulam and Saria (2019)

to ensure positive semi-definiteness and the smallest eigenvalue of

be 1. For bnn(vi), we use Flipout Wen et al. (2018) to reduce gradient variances and follow Snoek et al. (2019) for variational inference on deep ResNets. On ImageNet, we compute the Kronecker-product factorized Hessian matrix, rather than full due to high dimensionality. For bnn(kfac) and rue, we use mini-batch approximations on subsets of the training set to scale up on ImageNet, as suggested in Ritter et al. (2018); Schulam and Saria (2019).

4.2 Infinitesimal jackknife on MNIST

Most uncertainty quantification methods, including the ones proposed in this paper, have “knobs” to tune. We first concentrate on MNIST and perform extensive studies of the proposed approaches to understand several design choices. Table 2 contrasts them.

max width= which approx. MNIST: in-domain ( NotMNIST: OOD detection () layer(s) method (%) NLL ECE (%) Acc. (%) AU ROC AU PR (in : out) all sampling(§2.2)
1.66 0.06 0.42 87.46 92.50 87.01: 93.52
last 1.74 0.06 0.41 87.58 93.48 91.66 : 94.27 last 1.68 0.05 0.47 90.10 95.55 94.62 : 96.08 last 1.67 0.05 0.50 90.77 96.20 95.79 : 96.47 last mean-field
1.67 0.05 0.20 91.93 96.91 96.67 : 96.99
last 1.67 0.05 0.47 91.91 96.93 96.72 : 97.03 last 1.67 0.05 0.46 91.91 96.94 96.72 : 97.03

Table 2: Performance of Infinitesimal jackknife Pseudo-Ensemble Distribution on MNIST

Use all layers or just the last layer.

 Uncertainty quantification on deep neural nets is computationally costly, given their large number of parameters, especially when the methods need information on the curvatures of the loss functions. To this end, many approaches assume layer-wise independence 

Ritter et al. (2018) and low-rank components Maddox et al. (2019); Mishkin et al. (2018) and in some cases, restrict uncertainty quantification to only a few layers Zeng et al. (2018) – in particular, the last layer Kristiadi et al. (2020). The top portion of Table 2 shows the restricting to the last layer harms the in-domain ECE slightly but improves OOD significantly.

Effectiveness of mean-field approximation. Table 2 also shows that the mean-field approximation has similar performance as sampling (the distribution) on the in-domain tasks but noticeably improves on OOD detection. performs the best among the 3 variants.

Effect of ensemble and activation temperatures. We study the roles of ensemble and activation temperatures in 2.4). We grid search the two and generate the heatmaps of NLL and AU ROC on the held-out sets, shown in  fig:fTempAct. Note that correspond to mle.

What is particularly interesting is that for NLL, higher activation temperature () and lower ensemble temperature () work the best. For AU ROC, however, lower temperatures on both work best. That lower is preferred was also observed in Wenzel et al. (2020) and using for better calibration is noted in Guo et al. (2017). On the other end, for OOD detection,  Liang et al. (2017) suggests a very high activation temperature ( in their work, likely due to using a single model instead of an ensemble).

(a) NLL on held-out
(b) AUROC on in-/ood- held-out
(c) Calibration under distribution shift
Figure 1: Best viewed in colors. (a-b) Effects of the softmax and ensemble temperatures on NLL and AuROC. The yellow star marks the best pairs of temperatures. See texts for details.

4.3 Comparison to other approaches

Given our results in the previous section, we report of infinitesimal jackknife in the rest of the paper. Table 3 contrasts various methods on in-domain tasks of MNIST, CIFAR-100, and ImageNet. Table 4 contrasts performances on the out-of-distribution detection task (OOD). Results on CIFAR-10 (both in-domain and OOD), as well as CIFAR-100 OOD on SVNH are in Appendix B.4.

Method MNIST CIFAR-100 ImageNet
(%) NLL ECE (%) (%) NLL ECE (%) (%) NLL ECE (%)
mle 1.67 0.10 1.18 24.3 1.03 10.4 23.66 0.92 3.03
T. Scale 1.67 0.06 0.74 24.3 0.92 3.13 23.66 0.92 2.09
ensemble 1.25 0.05 0.30 19.6 0.71 2.00 21.22 0.83 3.10
rue 1.72 0.08 0.85 24.3 0.99 8.60 23.63 0.92 2.83
dropout 1.67 0.06 0.68 23.7 0.84 3.43 24.93 0.99 1.62
bnn(vi) 1.72 0.14 1.13 25.6 0.98 8.35 26.54 1.17 4.41
bnn(kfac) 1.71 0.06 0.16 24.1 0.89 3.36 23.64 0.92 2.95
1.67 0.05 0.20 24.3 0.91 1.49 23.66 0.91 0.93

: for CIFAR-100 and ImageNet, only the last layer is used due to the high computational cost.

Table 3: Comparing different uncertainty estimate methods on in-domain tasks (lower is better)
Method MNIST vs. NotMNIST CIFAR-100 vs. LSUN ImageNet vs. ImageNet-O
Acc. ROC PR Acc. ROC PR Acc. ROC PR
mle 67.6 53.8 40.1 : 72.5 72.7 80.0 83.5 : 75.2 58.4 51.6 78.4 : 26.3
T. Scale 67.4 66.7 48.8 : 77.0 76.6 84.3 86.7 : 80.3 58.5 54.5 79.2 : 27.8
ensemble 86.5 88.0 70.4 : 92.8 74.4 82.3 85.7 : 77.8 60.1 50.6 78.9 : 25.7
rue 61.1 64.7 60.5 : 68.4 75.2 83.0 86.7 : 77.8 58.4 51.6 78.3 : 26.3
dropout 88.8 91.4 78.7 : 93.5 69.8 77.3 81.1 : 72.9 59.5 51.7 79.0 : 26.3
bnn(vi) 86.9 81.1 59.8 : 89.9 62.5 67.7 71.4 : 63.1 57.8 52.0 75.7 : 26.8
bnn(kfac) 88.7 93.5 89.1 : 93.8 72.9 80.4 83.9 : 75.5 60.3 53.1 79.6 : 26.9
91.9 96.9 96.7 : 97.0 82.2 89.9 92.0 : 86.6 63.2 62.9 83.5 : 33.3

: Accuracy (%). : Area under ROC. : Area under Precision-Recall with “in” vs. “out” domains flipped.
: for CIFAR-100 and ImageNet, only the last layer is used for due to the high computational cost.

Table 4: Comparing different methods on out-of-distribution detection (higher is better)

While deep ensemble Lakshminarayanan et al. (2017) achieves the best performance on in-domain tasks most of time, the proposed approach typically outperforms other approaches, especially on the calibration metric ECE. On the OOD detection task, significantly outperforms all other approaches in all metrics.

ImageNet-O is a particularly hard dataset for OOD Hendrycks et al. (2019). The images are from ImageNet-22K samples, thus share similar low-level statistics as the in-domain data. Moreover, the images are chosen such that they are misclassified by existing networks (ResNet-50) with high confidence, or so called “natural adversarial examples”. We follow Hendrycks et al. (2019) to use 200-class subset, which are the confusing classes to the OOD images, of the test set as the in-distribution examples.  Hendrycks et al. (2019) further demonstrates that many popular approaches to improve neural network robustness, like adversarial training, hardly help on ImageNet-O. improves over other baselines by a margin.

Robustness to distributional shift.  Snoek et al. (2019) points out that many uncertainty estimation methods are sensitive to distributional shift. Thus, we evaluate the robustness of on rotated MNIST images, from to . The ECE curves in fig:fTempAct(c) show is better or as robust as other approaches.

5 Conclusion

We propose a simple, efficient, and general-purpose confidence estimator for deep neural networks. The main idea is to approximate the ensemble of an infinite number of infinitesimal jackknife samples with a closed-form Gaussian distribution, and derive an efficient mean-field approximation to classification predictions, when the softmax layer is applied to Gaussian. Empirically, surpasses or is competitive with the state-of-the-art methods for uncertainty estimation while incurring lower computational cost and memory footprint.

References

  • N. Agarwal, B. Bullins, and E. Hazan (2016) Second-order stochastic optimization in linear time. stat 1050, pp. 15. Cited by: §1, §2.4.
  • A. M. Alaa and M. van der Schaar (2019) The discriminative jackknife: quantifying predictive uncertainty via higher-order influence functions. External Links: Link Cited by: §3.
  • A. Ashukha, A. Lyzhov, D. Molchanov, and D. Vetrov (2020)

    Pitfalls of in-domain uncertainty estimation and ensembling in deep learning

    .
    arXiv preprint arXiv:2002.06470. Cited by: §1.
  • D. Barber and C. M. Bishop (1998) Ensemble learning for multi-layer networks. In Advances in neural information processing systems, pp. 395–401. Cited by: §1.
  • R. F. Barber, E. J. Candes, A. Ramdas, and R. J. Tibshirani (2019) Predictive inference with the jackknife+. arXiv preprint arXiv:1905.02928. Cited by: §1, §3.
  • C. M. Bishop (2006) Pattern recognition and machine learning. Springer. Cited by: Appendix A.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §1.
  • Y. Bulatov (2011) Notmnist dataset. Google (Books/OCR), Tech. Rep.[Online]. Available: http://yaroslavvb. blogspot. it/2011/09/notmnist-dataset. html 2. Cited by: Table 1.
  • T. Chen, E. Fox, and C. Guestrin (2014) Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pp. 1683–1691. Cited by: §3.
  • X. Chen, J. D. Lee, X. T. Tong, and Y. Zhang (2016) Statistical inference for model parameters in stochastic gradient descent. arXiv preprint arXiv:1610.08637. Cited by: §1, §1, §3.
  • R. D. Cook and S. Weisberg (1982) Residuals and influence in regression. New York: Chapman and Hall. Cited by: §2.1, §3.
  • J. Daunizeau (2017) Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables. arXiv preprint arXiv:1703.00091. Cited by: Appendix A, §2.3.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    ,
    pp. 248–255. Cited by: Table 1.
  • B. Efron and C. Stein (1981) The jackknife estimate of variance. The Annals of Statistics, pp. 586–596. Cited by: §2.1, §3.
  • B. Efron (1992) Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pp. 569–593. Cited by: §1, §3.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §B.5, Table B.9, §1, §4.1.
  • R. Giordano, M. I. Jordan, and T. Broderick (2019) A higher-order swiss army infinitesimal jackknife. arXiv preprint arXiv:1907.12116. Cited by: §1, §3.
  • R. Giordano, W. Stephenson, R. Liu, M. I. Jordan, and T. Broderick (2018) A swiss army infinitesimal jackknife. arXiv preprint arXiv:1806.00550. Cited by: §1, §2.1, §3.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. arXiv preprint arXiv:1706.04599. Cited by: §1, §1, §2.4, §4.1, §4.1, §4.2.
  • D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §4.1.
  • D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song (2019) Natural adversarial examples. arXiv preprint arXiv:1907.07174. Cited by: §4.3, Table 1.
  • J. M. Hernández-Lobato and R. Adams (2015)

    Probabilistic backpropagation for scalable learning of bayesian neural networks

    .
    In International Conference on Machine Learning, pp. 1861–1869. Cited by: §B.5, Table B.9.
  • P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson (2018) Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407. Cited by: §1.
  • L. A. Jaeckel (1972) The infinitesimal jackknife. Bell Telephone Laboratories. Cited by: §1, §2.1, §3.
  • P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1885–1894. Cited by: §1, §2.1, §2.4, §3.
  • A. Kristiadi, M. Hein, and P. Hennig (2020) Being bayesian, even just a bit, fixes overconfidence in relu networks. arXiv preprint arXiv:2002.10118. Cited by: §4.2.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Tront. Cited by: Table 1.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6405–6416. Cited by: §B.5, Table B.9, §1, §2.1, §4.1, §4.1, §4.3.
  • Y. LeCun, C. Cortes, and C. J. Burges (1998) The mnist database of handwritten digits, 1998. URL http://yann. lecun. com/exdb/mnist 10, pp. 34. Cited by: Table 1.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7165–7175. Cited by: §4.1.
  • S. Liang, Y. Li, and R. Srikant (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §4.1, §4.2.
  • D. J. MacKay (1992) Bayesian methods for adaptive models. Ph.D. Thesis, California Institute of Technology. Cited by: §1, §1, §3.
  • W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G. Wilson (2019) A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pp. 13132–13143. Cited by: §B.3, §1, §1, §3, §4.2.
  • D. Madras, J. Atwood, and A. D’Amour (2019) Detecting extrapolation with local ensembles. arXiv preprint arXiv:1910.09573. Cited by: §3.
  • S. Mandt, M. D. Hoffman, and D. M. Blei (2017) Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research 18 (1), pp. 4873–4907. Cited by: §1.
  • J. Martens (2010) Deep learning via hessian-free optimization.. In ICML, Vol. 27, pp. 735–742. Cited by: §1.
  • R. G. Miller (1974) The jackknife-a review. Biometrika 61 (1), pp. 1–15. Cited by: §3.
  • A. Mishkin, F. Kunstner, D. Nielsen, M. Schmidt, and M. E. Khan (2018) Slang: fast structured covariance approximations for bayesian deep learning with natural gradient. In Advances in Neural Information Processing Systems, pp. 6245–6255. Cited by: §4.2.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Ng (2011) Reading digits in natural images with unsupervised feature learning. NIPS, pp. . Cited by: Table 1.
  • B. A. Pearlmutter (1994) Fast exact multiplication by the hessian. Neural computation 6 (1), pp. 147–160. Cited by: §1, §2.4.
  • H. Ritter, A. Botev, and D. Barber (2018) A scalable laplace approximation for neural networks. In 6th International Conference on Learning Representations, ICLR 2018-Conference Track Proceedings, Vol. 6. Cited by: Table B.9, §2.4, §3, §4.1, §4.1, §4.2.
  • P. Schulam and S. Saria (2019) Can you trust this prediction? auditing pointwise reliability after learning. arXiv preprint arXiv:1901.00403. Cited by: §B.5, Table B.9, §1, §1, §3, §4.1, §4.1.
  • J. Snoek, Y. Ovadia, E. Fertig, B. Lakshminarayanan, S. Nowozin, D. Sculley, J. Dillon, J. Ren, and Z. Nado (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp. 13969–13980. Cited by: §B.2, §4.1, §4.3.
  • J. Tukey (1958) Bias and confidence in not quite large samples. Ann. Math. Statist. 29, pp. 614. Cited by: §2.1, §3.
  • Y. Wen, P. Vicol, J. Ba, D. Tran, and R. Grosse (2018) Flipout: efficient pseudo-independent weight perturbations on mini-batches. arXiv preprint arXiv:1803.04386. Cited by: §4.1.
  • F. Wenzel, K. Roth, B. S. Veeling, J. Światkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin (2020) How good is the bayes posterior in deep neural networks really?. arXiv preprint arXiv:2002.02405. Cited by: §2.4, §4.2.
  • F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao (2015) Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: Table 1.
  • J. Zeng, A. Lesnikowski, and J. M. Alvarez (2018)

    The relevance of bayesian layer positioning to model uncertainty in deep bayesian active learning

    .
    arXiv preprint arXiv:1811.12535. Cited by: §4.2.

Appendix A Mean-Field Approximation for Gaussian-softmax Integration

In this section, we derive the mean-field approximation for Gaussian-softmax integration, eq. (10) in the main text. Assume the same notations as in §2.3, where the activation to softmax follows a Gaussian .

(A.1)

where “integrate independently” means integrating each term in the summand independently, resulting the expectation to the marginal distribution over the pair . This approximation is prompted by the mean-field approximation: for a nonlinear function 222Similar to the classical use of mean-field approximation on Ising models, we use the term mean-field approximation to capture the notion that the expectation is computed by considering the weak, pairwise coupling effect from points on the lattice, i.e., with ..

Next we plug in the approximation to where is the sigmoid function, which states that

(A.2)

is a constant and is usually chosen to be or . This is a well-known result, see Bishop [2006]. We further approximate by considering different ways to compute the bivariate expectations in the denominator.

Mean-Field 0 ()

In the denominator, we ignore the variance of for and replace with its mean , and compute the expectation only with respect to . We arrive at

Applying eq.(A.2), we have

(A.3)

Mean-Field 1 ()

If we replace with the two independent marginals in the denominator, recognizing , we get,

(A.4)

Mean-Field 2 ()

Lastly, if we compute eq.(A.1) with a full covariance between and , recognizing , we get

(A.5)

We note that  Daunizeau [2017] has developed the approximation form eq.(A.5) for computing , though the author did not use it for uncertainty estimation.

Appendix B Experiments

b.1 Definitions of evaluation metrics

NLL is defined as the KL-divergence between the data distribution and the model’s predictive distribution,

(B.1)

where is the one-hot embedding of the label.

ECE measures the discrepancy between the predicted probabilities and the empirical accuracy of a classifier in terms of distance. It is computed as the expected difference between per bucket confidence and per bucket accuracy, where all predictions are binned into buckets such that are predictions falling within the interval . ECE is defined as,

(B.2)

where and .

b.2 Details of experiments in the main text

Table B.5 provides key hyper-parameters used in training deep neural networks on different datasets.

Dataset MNIST CIFAR-10 CIFAR-100 ImageNet
Architecture MLP ResNet20 Densenet-BC-121 ResNet50
Optimizer Adam Adam SGD SGD
Learning rate 0.001 0.001 0.1 0.1
Learning rate decay exponential staircase staircase staircase
0.998 at 80, 120, 160 at 150, 225 at 30, 60, 80
Weight decay 0
Batch size 100 8 128 256
Epochs 100 200 300 90
Table B.5: Hyper-parameters of neural network trainings

For method, we use in our implementation. For the ensemble approach, we use models on all datasets as in Snoek et al. [2019]. For rue, dropout, bnn (vi) and bnn(kfac), where sampling is applied at inference time, we use Monte-Carlo samples on MNIST, and on CIFAR-10, CIFAR-100 and ImageNet. We use buckets when computing ECE on MNIST, CIFAR-10 and CIFAR-100, and on ImageNet.

b.3 Apply the mean-field approximation to other Gaussian posteriors inference tasks

Approx. method MNIST: in-domain () NotMNIST: OOD detection ()
(%) NLL ECE Acc. (%) AU ROC AU PR (in : out)
sampling
20 1.55 0.06 0.60 85.78 91.15 91.91 : 90.34
100 1.55 0.06 0.60 88.33 93.19 86.95 : 95.72
500 1.55 0.06 0.60 82.68 88.80 82.51: 91.23
mean-field
 
1.56 0.05 0.25 87.56 93.26 91.49 : 93.96
1.55 0.05 0.46 85.28 90.68 85.15 : 92.39
1.55 0.05 0.51 82.18 87.40 78.12 : 90.69
Table B.6: Uncertainty estimation with swag on MNIST

The mean-field approximation is interesting in its own right. In particular, it can be applied to approximate any Gaussian-softmax integral. In this section, we apply it to the swa-Gaussian posterior (swagMaddox et al. [2019], whose covariance is a low-rank matrix plus a diagonal matrix derived from the SGD iterates. We tune the ensemble and activation temperatures with swag to perform uncertainty tasks on MNIST, and the results are reported in tab:tSwag. We ran swag on the softmax layer for 50 epochs, and collect models along the trajectory to form the posterior.

We use the default values for other hyper-parameters, like swag learning rate and the rank of the low-rank component, as the main objective here is to combine the mean-field approximation with different Gaussian posteriors. As we can see from tab:tSwag, swag, when using sampling for approximate inference, have a variance larger than expected333In particular, the variance is higher than the variance in the sampling results of the infinitesimal jackknife, cf. Table 2 in the main text.. Nonetheless, within the variance, using the mean-field approximation instead of sampling performs similarly. The notable exception is in-domain tasks where the mean-field approximation consistently outperforms sampling.

This suggests that the mean-field approximations can work with other Gaussian distributions as a replacement for sampling to reduce the computation cost.

max width= Method CIFAR-10 in-domain () LSUN: OOD detection () SVHN: OOD detection () (%) NLL ECE (%) Acc. AU ROC AU PR (in/out) Acc. AU ROC AU PR (in : out) mle 8.81 0.30 3.59 85.0 91.3 93.8 : 87.8 86.3 90.7 89.5 : 92.6 T. Scale 8.81 0.26 0.52 89.0 95.3 96.5 : 93.4 88.9 94.0 92.7 : 95.0 ensemble 6.66 0.20 1.37 88.0 94.4 95.9 : 92.1 88.7 93.4 92.4 : 94.8 rue 8.71 0.28 1.87 85.0 91.3 93.8 : 87.8 86.3 90.7 89.5 : 92.6 dropout 8.83 0.26 0.58 81.8 88.6 91.7 : 84.3 86.0 91.6 90.1 : 94.0 bnn vi 11.09 0.33 1.57 79.9 87.3 90.5 : 83.3 85.7 91.4 89.5 : 93.9 bnn ll-vi 8.94 0.33 4.15 84.6 91.0 93.5 : 87.2 87.8 93.3 91.7 : 95.4 bnn(kfac) 8.75 0.29 3.45 85.0 91.3 93.8 : 87.8 86.3 90.7 89.5 : 92.6 8.81 0.26 0.56 91.0 96.4 97.4 : 94.8 89.7 94.6 93.3 : 95.3

Table B.7: Uncertainty estimation on CIFAR-10
Method SVHN: OOD detection ()
Acc. AU ROC AU PR (in : out)
mle 73.90 80.69 74.03 : 86.93
T. scaling 76.94 83.43 77.53 : 88.11
Ensemble 75.99 83.48 77.75 : 88.85
rue 73.90 80.69 74.03 : 86.93
Dropout 74.13 81.87 75.78 : 88.07
bnn ll-vi 71.87 79.55 72.08 : 87.27
bnn(kfac) 74.13 80.97 74.46 : 87.06
81.38 88.04 84.59 : 91.23
Table B.8: Out of Distribution Detection on CIFAR-100

b.4 More results on CIFAR-10 and CIFAR-100

Tables B.7 and B.8 supplement the main text with additional experimental results on the CIFAR-10 dataset with both in-domain and out-of-distribution detection tasks, and on the CIFAR-100 with out-of-distribution detection using the SVHN dataset. bnn ll-vi refers to stochastic variational inference on the last layer only.

The results support the same observations in the main text: performs similar to other approaches on in-domain tasks, but noticeably outperforms them on out-of-distribution detection.

max width= Metrics Dataset ensemble Lakshminarayanan et al. [2017] rue Schulam and Saria [2019] dropout Gal and Ghahramani [2016] bnn (pbpHernández-Lobato and Adams [2015] bnn (kfacRitter et al. [2018] NLL () Housing 2.41 0.25 2.69 0.44 2.46 0.06 2.57 0.09 2.74 0.46 2.66 0.39 Concrete 3.06 0.18 3.21 0.14 3.04 0.02 3.16 0.02 3.25 0.18 3.15 0.09 Energy 1.38 0.22 1.48 0.35 1.99 0.02 2.04 0.02 1.48 0.35 1.47 0.35 Kin8nm -1.20 0.02 -1.13 0.03 -0.95 0.01 -0.90 0.01 -1.14 0.03 -1.15 0.02 Naval -5.63 0.05 -6.00 0.22 -3.80 0.01 -3.73 0.01 -5.99 0.22 -5.99 0.22 Power 2.79 0.04 2.86 0.04 2.80 0.01 2.84 0.01 2.86 0.04 2.86 0.04 Wine 0.94 0.12 0.99 0.06 0.93 0.01 0.97 0.01 0.99 0.08 0.99 0.07 Yacht 1.18 0.21 1.14 0.42 1.55 0.03 1.63 0.02 1.66 0.88 1.18 0.53 RMSE () Housing 3.28 1.00 3.34 0.94 2.97 0.19 3.01 0.18 3.42 1.09 3.24 0.93 Concrete 6.03 0.58 5.67 0.65 5.23 0.12 5.67 0.09 5.65 0.63 5.60 0.61 Energy 2.09 0.29 1.09 0.35 1.66 0.04 1.80 0.05 1.08 0.35 1.08 0.35 Kin8nm 0.09 0.00 0.08 0.00 0.10 0.00 0.10 0.00 0.08 0.00 0.08 0.00 Naval 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 Power 4.11 0.17 4.21 0.15 4.02 0.04 4.12 0.03 4.21 0.15 4.21 0.15 Wine 0.64 0.04 0.65 0.04 0.62 0.01 0.64 0.01 0.65 0.04 0.65 0.04 Yacht 1.58 0.48 0.82 0.26 1.11 0.09 1.02 0.05 1.19 1.08 0.80 0.25 : numbers are cited from the original paper.

Table B.9: NLLs and RMSEs of different methods on regression benchmark datasets.

b.5 Regression experiments

We conduct real-world regression experiments on UCI datasets. We follow the experimental setup in Gal and Ghahramani [2016], Hernández-Lobato and Adams [2015], Lakshminarayanan et al. [2017]

, where each dataset is split into 20 train-test folds randomly, and the average results with the standard deviation are reported. We use the same architecture as previous works, with 1 hidden layer of 50 ReLU units. Since in regression tasks the output distribution can be compute analytically, we don’t use the mean-field approximation in these experiments. Nevertheless, we still refer to our method as

to be consistent with other datasets.

For approach, we use the pseudo-ensemble Gaussian distribution of the last layer parameters to compute a Gaussian distribution of the network output, i.e. as eq. (9) in the main text. We also estimate the variance of the observation noise from the residuals on the training set Schulam and Saria [2019]. Therefore, the predictive distribution of is given by . We can compute the negative log likelihood (NLL) as,

We tune ensemble temperature on the heldout sets, and fix the activation temperature to be .

For sampling-based methods, rue and kfac, the NLL from prediction sample can be computed as,

where is the mean of prediction samples, and is the variance of prediction samples. The sampling is conducted on all network layers in both rue and kfac.

Results

We report the NLL and RMSE on the test set in tab:tPerf and compare with other approaches. rue and bnn(kfac) use samples and the ensemble method has models in the ensemble. We achieve similar or better performances than rue and kfac without the computation cost of explicit sampling. Compared to dropout and ensemble, which are the state-of-the-art methods on these datasets, we also achieve competitive results. We highlight the best performing method on each dataset, and when its result is within one std from the best method.

b.6 More results on comparing all layers versus just the last layer.

We conduct an experiment similar to the top portion of Table 2 in the main text, to study the effect of restricting the parameter uncertainty to the last layer only. We use NotMNIST as the in-domain dataset and treat MNIST as the out-of-distribution dataset. We use a two-layer MLP with 256 ReLU hidden units. Table B.10 supports the same observation as in the main text: restricting to the last layer improves on OOD task while it does not have significant negative impact on the in-domain tasks.

max width= which approx. NotMNIST: in-domain ( MNIST: OOD detection () layer(s) method (%) NLL ECE Acc. (%) AU ROC AU PR (in : out) all sampling
3.18 0.12 0.43 90.73 95.14 97.09 : 89.78
last 3.18 0.12 0.43 92.26 96.23 97.82 : 92.06

Table B.10: Performance of infinitesimal jackknife pseudo-ensemble distribution on NotMNIST
comparing all layers versus just the last layer