1 Introduction
Recent advances in deep neural nets have dramatically improved predictive accuracy in supervised tasks. For many applications such as autonomous vehicle control and medical diagnosis, decisionmaking also needs accurate estimation of the uncertainty pertinent to the prediction. Unfortunately, deep neural nets are known to output overconfident, miscalibrated predictions Guo et al. (2017).
It is crucial to improve deep models’ ability in representing uncertainty. There have been a steady development of new methods on uncertainty quantification for deep neural nets. One popular idea is to introduce additional stochasticity (such as temperature annealing or dropout to network architecture) to existing trained models to represent uncertainty Gal and Ghahramani (2016); Guo et al. (2017). Another line of work is to use an ensemble of models, collectively representing the uncertainty about the predictions. This ensemble of models can be obtained by varying training with respect to initialization Lakshminarayanan et al. (2017), hyperparameters Ashukha et al. (2020), data partitions, i.e., bootstraps Efron (1992); Schulam and Saria (2019). Yet another line of work is to use Bayesian neural networks (bnn), which can be seen as an ensemble of an infinite number of models, characterized by the posterior distribution Blundell et al. (2015); MacKay (1992). In practice, one samples models from the posterior or use variational inference. Each of those methods offers different tradeoffs among computational costs, memory consumptions, parallelization, and modeling flexibility. For example, while ensemble methods are often stateoftheart, they are both computationally and memory intensive, for repeating training procedures or storing the resulting models.
Those stand in stark contrast to many practitioners’ desiderata. It would be the most ideal that neither training models nor inference with models incur additional memory and computational costs to estimate uncertainty beyond what is needed for making predictions. Additionally, it is also desirable to be able to quantify uncertainty on an existing model where retraining is not possible.
In this work, we propose a new method to bridge the gap. The main idea of our approach is to use Infinitesimal jackknife, a classical tool from statistics for uncertainty estimation Jaeckel (1972), to construct a pseudoensemble that can be described with a closedform Gaussian distribution. We then use this Gaussian distribution for uncertainty estimation. While the standard way is to sample models from this distribution and combine each sample’s prediction, we develop a meanfield approximation to the inference, where Gaussian random variables need to be integrated with the softmax nonlinear functions to generate probabilities for multinomial variables.
We show the proposed approach, which we refer to as meanfield Infinitesimal jackknife (
) often surpasses or as competitive as existing approaches in evaluations metrics of NLL, ECE, and OutofDistribution detection accuracy on several benchmark datasets.
shares appealing properties with several recent approaches for uncertainty estimation: constructing the pseudoensemble with infinitesimal jackknife does not require changing existing training procedures Schulam and Saria (2019); Barber et al. (2019); Giordano et al. (2018, 2019); Koh and Liang (2017); approximating the ensemble with a distribution removes the need of storing many models – an impractical task for modern learning models Chen et al. (2016); Maddox et al. (2019); the pseudoensemble distribution is in similar form as the Laplace approximation for Bayesian inference
Barber and Bishop (1998); MacKay (1992) thus existing approaches in computational statistics such as Kronecker product factorization can be directly applied.The meanfield approximation brings out an additional appeal. It is in closedform and needs only the first and the second moments of the Gaussian random variables. In our case, the first moments are simply the predictions of the networks while the second moments involve the product between the inverse Hessian and a vector, which can be computed efficiently
Agarwal et al. (2016); Martens (2010); Pearlmutter (1994). Additionally, the meanfield approximation can be applied when integrals in similar form need to be approximated. In Appendix B.3, we demonstrate its utility of applying it to the recently proposed swag algorithm for uncertainty estimation where the Gaussian distribution is derived differently Chen et al. (2016); Izmailov et al. (2018); Maddox et al. (2019); Mandt et al. (2017).2 Approach
In this section, we start by introducing necessary notations and defining the task of uncertainty estimation. We then describe the technique of infinitesimal jackknife in §2.1. We derive a closedform Gaussian distribution of an infinite number of models estimated with infinitesimal jackknife– we call them pseudoensemble. We describe how to use this distribution for uncertainty estimation in §2.2. We present our efficient meanfield approximation to Gaussiansoftmax integral in §2.3. Lastly, we discuss hyperparameters of our method and present the algorithm in §2.4.
Notation We are given a training set of i.i.d. samples , where with the input and the target . We fit the data to a parametric predictive model . We define the loss on a sample as and optimize the model’s parameter via empirical risk minimization on . The minimizer is given by
(1) 
In practice, we are interested in not only the prediction but also quantifying the uncertainty of making such a prediction. In this paper, we consider (deep) neural network as the predictive model.
2.1 Infinitesimal jackknife and its distribution
Jackknife is a wellknown resampling method to estimate the confidence interval of an estimator
Tukey (1958); Efron and Stein (1981). It is a straightforward procedure. Each element is left out from the dataset to form a unique “leaveoneout” Jackknife sample . A Jackknife sample’s estimate of is given by(2) 
We obtain such samples
and use them to estimate the variances of
and the predictions made with . In this vein, this is a form of ensemble method.However, it is not feasible to retrain modern neural networks times, when is often in the order of millions. Infinitesimal jackknife is a classical tool to approximate without retraining on
. It is often used as a theoretical tool for asymptotic analysis
Jaeckel (1972), and is closely related to influence functions in robust statistics Cook and Weisberg (1982). Recent studies have brought (renewed) interests in applying this methodology to machine learning problems Giordano et al. (2018); Koh and Liang (2017). Here, we briefly summarize the method.Linear approximation. The basic idea behind infinitesimal jackknife is to treat the and as special cases of an estimator on weighted samples
(3) 
where the weights form a simplex: . Thus the maximum likelihood estimate is when . A Jackknife sample’s estimate , on the other end, is when where is allzero vector except taking a value of at the th coordinate.
Using the firstorder Taylor expansion around , we obtain (under the condition of twicedifferentiability and the invertibility of the Hessian),
(4) 
where is the Hessian matrix of evaluated at , and is the gradient of evaluated at . We use and as shorthands when there is enough context to avoid confusion.
An infinite number of infinitesimal jackknife samples. If the number of samples , we can characterize the “infinite” number of with a closedform Gaussian distribution with the following sample mean and covariance as the distribution’s mean and covariance,
(5)  
where denotes the observed Fisher information matrix.
Infinite infinitesimal bootstraps. The above procedure and analysis can be extended to bootstrapping (i.e., sampling with replacement). Similarly, to characterize the estimates from the bootstraps, we can also use a Gaussian distribution – details omitted here for brevity,
(6) 
We refer to the distributions and as the pseudoensemble distributions. We can approximate further by to obtain and .
Lakshminarayanan et al. (2017) discussed that using models trained on bootstrapped samples does not work empirically as well as other approaches, as the learner only sees of the dataset in each bootstrap sample. We note that this is an empirical limitation rather than a theoretical one. In practice, we can only train a very limited number of models. However, we hypothesize we can get the benefits of combining an infinite number of models without training. Empirical results valid this hypothesis.
2.2 Sampling based uncertainty estimation with the pseudoensemble distributions
Given the general form of the pseudoensemble distributions , it is straightforward to see, if we approximate the predictive function with a linear function,
we can then regard the predictions by the models as a Gaussian distributed random variable,
For predictive functions whose outputs are approximately Gaussian, it might be adequate to characterize the uncertainty with the approximated mean and variance. However, for predictive functions that are categorical, this approach is not applicable.
A standard way is to use the sampling for combining the discrete predictions from the models in the ensemble. For example, for classification where is the probability of labeling with th category, the averaged prediction from the ensemble is then
(7) 
where , and is the indicator function. In the next section, we propose a new approach that avoids sampling and directly approximates the ensemble prediction of discrete labels.
2.3 Meanfield approximation for Gaussiansoftmax integration
In deep neural network for classification, the predictions
are the outputs of the softmax layer.
(8) 
where is the transformation of the input through the layers before the fullyconnected layer, and is the connection weights in the softmax layer. We focus on the case is deterministic – extending to the random variables is straightforward. As discussed, we assume the pseudoensemble on forms a Gaussian distribution . Then we have such that
(9) 
We give detailed derivation in Appendix A to compute the expectation of ,
(10) 
The key idea is to apply the meanfield approximation
and use the following wellknown formula to compute the Gaussian integral of a sigmoid function
,where is a constant and is usually chosen to be or . In the softmax case, we arrive at:
(11)  
(12)  
(13) 
The 3 approximations differ in how much information from is considered: not considering , considering its variance and considering its covariance . Note that and are computationally preferred over which uses covariances, where is the number of classes. (Daunizeau (2017) derived an approximation in the form of but did not apply to uncertainty estimation.)
Intuition
The simple form of the meanfield approximations makes it possible to understand them intuitively. We focus on . We first rewrite it in the familiar “softmax” form:
Note that this form is similar to a “softmax” with a temperature scaling: However, there are several important differences. In , the temperature scaling factor is categoryspecific:
(14) 
Importantly, the factor depends on the variance of the category. For a prediction with high variance, the temperature for that category is high, reducing the corresponding “probability” . Specifically,
In other words, the scaling factor is both categoryspecific and datadependent, providing additional flexibility to a global temperature scaling factor.
Implementation nuances
Because of this categoryspecific temperature scaling, (and approximations from and ) is no longer a multinomial probability. Proper normalization should be performed,
2.4 Other implementation considerations
Temperature scaling. Temperature scaling was shown to be useful for obtaining calibrated probabilities Guo et al. (2017). This can be easily included as well with
(15) 
We can also combine with another “temperature” scaling factor, representing how well the models in the pseudoensemble are concentrated
(16) 
Here is for the pseudoensembles or the posterior Wenzel et al. (2020). Note that these two temperatures control variability differently. When , the ensemble focuses on one model. When , each model in the ensemble moves to “hard” decisions, as in eq. (7). Using as an example,
(17) 
where is computed at . Empirically, we can tune the temperatures and as hyperparameters on a heldout set, to optimize the predictive performance.
Computation complexity and scalability. The bulk of the computation, as in Bayesian approximate inference, lies in the computation of in and , as in eq. (5) and eq. (6), or more precisely the product between the inverse Hessian and vectors, cf. eq. (9). For smaller models, computing and storying exactly is attractive. For large models, one can compute the inverse Hessianvector product approximately using multiple Hessianvector products Pearlmutter (1994); Agarwal et al. (2016); Koh and Liang (2017). Alternatively, we can approximate inverse Hessian using Kronecker factorization Ritter et al. (2018). In short, any advances in computing the inverse Hessian and related quantities can be used to accelerate computation needed in this paper.
An application example. In alg:uncert, we exemplify the application of the meanfield approximation to Gaussiansoftmax integration in the uncertainty estimation. For brevity, we consider the lastlayer/fullyconnected layer’s parameters as a part of the deep neural network’s parameters . We assume the mapping prior to the fully connected layer is deterministic. algocf[t]
3 Related Work
Resampling methods, such as jackknife and bootstrap, are classical statistical tools for assessing confidence intervals Tukey (1958); Miller (1974); Efron and Stein (1981); Efron (1992). Recent results have shown that carefully designed jackknife estimators Barber et al. (2019) can achieve worstcase coverage guarantees on regression problems.
However, the exhaustive retraining in jackknife or bootstrap can be cumbersome in practice. Recent works have leveraged the idea of influence function Koh and Liang (2017); Cook and Weisberg (1982) to alleviate the computational challenge. Schulam and Saria (2019) combines the influence function with random data weight samples to approximate the variance of predictions in bootstrapping; Alaa and van der Schaar (2019) derives higherorder influence function approximation for Jackknife estimators. Theoretical properties of the approximation are studied in Giordano et al. (2018, 2019). Madras et al. (2019) applies this approximation to identify underdetermined test points. Infinitesimal jackknife follows the same idea as those work Jaeckel (1972); Giordano et al. (2018, 2019). To avoid explicitly storing those models, we seek a Gaussian distribution to approximately characterize the model. This connects to several existing research.
The posterior of a Bayesian model can be approximated with a Gaussian distribution MacKay (1992); Ritter et al. (2018): , where is interpreted as the maximum a posterior estimate (thus incorporating any prior the model might want to include). If we approximate the observed Fisher information matrix in eq. (5) with the Hessian , the two pseudoensemble distributions and the Laplace approximation to the posterior have identical forms except that the Hessians are scaled differently, and can be captured by the ensemble temperature eq. (16). Note that despite the similarity, infinitesimal jackknife are “frequentist” methods and do not assume wellformed Bayesian modeling.
The trajectory of stochastic gradient descent gives rise to a sequence of models where the covariance matrix among batch means converges to
Chen et al. (2014, 2016); Maddox et al. (2019), similar to the pseudoensemble distributions in form. But note that those approaches do not collect information around the maximum likelihood estimate, while we do. It is also a classical result that the maximum likelihood estimator converges in distribution to a normal distribution.
and are simply the plugin estimators of the truth mean and the covariance of this asymptotic distribution.4 Experiments
We first describe the setup for our empirical studies. We then demonstrate the effectiveness of the proposed approach on the MNIST dataset, mainly contrasting the results from sampling to those from meanfield approximation and other implementation choices. We then provide a detailed comparison to popular approaches for uncertainty estimation. In the main text, we focus on classification problems. We evaluate on commonly used benchmark datasets, summarized in Table 1. In Appendix B.5, we report results on regression tasks.
4.1 Setup
Model and training details for classifiers.
For MNIST, we train a twolayer MLP with 256 ReLU units per layer, using Adam optimizer for 100 epochs. For CIFAR10, we train a ResNet20 with Adam for 200 epochs. On CIFAR100, we train a DenseNetBC121 with SGD optimizer for 300 epochs. For ILSVRC2012, we train a ResNet50 with SGD optimizer for 90 epochs.
Dataset  # of  train, heldout  OOD  heldout 
Name  classes  and test splits  Dataset  and test splits 
MNIST LeCun et al. (1998)  10  55k / 5k / 10k  NotMNIST Bulatov (2011)  5k / 13.7k 
CIFAR10 Krizhevsky (2009)  10  45k / 5k / 10k  LSUN (resized) Yu et al. (2015)  1k / 9k 
CIFAR100 Krizhevsky (2009)  100  SVHN Netzer et al. (2011)  5k / 21k  
ILSVRC2012 Deng et al. (2009)  1,000  1,281k / 25k / 25k  ImagenetO Hendrycks et al. (2019)  2k /  
: the number of samples is limited; best results on heldout are reported.
Evaluation tasks and metrics. We evaluate on two tasks: predictive uncertainty on indomain samples, and detection of outofdistribution samples. For indomain predictive uncertainty, we report classification error rate (), negative loglikelihood (NLL), and expected calibration error in distance (ECE) Guo et al. (2017) on the test set. NLL is a proper scoring rule Lakshminarayanan et al. (2017), and measures the KLdivergence between the groundtruth data distribution and the predictive distribution of the classifiers. ECE measures the discrepancy between the histogram of the predicted probabilities by the classifiers and the observed ones in the data – properly calibrated classifiers will yield matching histograms. Both metrics are commonly used in the literature and the lower the better. In Appendix B.1, we give precise definitions.
On the task of outofdistribution (OOD) detection, we assess how well , the classifier’s output being interpreted as probability, can be used to distinguish invalid samples from normal indomain images. Following the common practice Hendrycks and Gimpel (2016); Liang et al. (2017); Lee et al. (2018)
, we report two thresholdindependent metrics: area under the receiver operating characteristic curve (AUROC), and area under the precisionrecall curve (AUPR). Since the precisionrecall curve is sensitive to the choice of positive class, we report both “AUPR in:out” where indistribution and outofdistribution images are specified as positives respectively. We also report detection accuracy, the optimal accuracy achieved among all thresholds in classifying in/outdomain samples. All three metrics are the higher the better.
Competing approaches for uncertainty estimation. We compare to popular approaches: (i) frequentist approaches: the point estimator of maximum likelihood estimator (mle) as a baseline, the temperature scaling calibration (T. Scale) Guo et al. (2017), the deep ensemble method (ensemble) Lakshminarayanan et al. (2017), and the resampling bootstrap rue Schulam and Saria (2019). (ii) variants of Bayesian neural networks (bnn): dropout Gal and Ghahramani (2016) approximates the Bayesian posterior using stochastic forward passes of the network with dropout. bnn(vi) trains the network via stochastic variational inference to maximize the evidence lower bound. bnn(kfac) applies the Laplace approximation to construct a Gaussian posterior, via layerwise Kronecker product factorized covariance matrices Ritter et al. (2018).
Hyperparameter tuning. For indomain uncertainty estimation, we use the NLL on the heldout sets to tune hyperparameters. For the OOD detection task, we use AUROC on the heldout to select hyperparameters. We report the results of the best hyperparameters on the test sets. The key hyperparameters are the temperatures, regularization or prior in bnn methods, and dropout rates.
Other implementation details. When Hessian needs to be inverted, we add a dampening term following Ritter et al. (2018); Schulam and Saria (2019)
to ensure positive semidefiniteness and the smallest eigenvalue of
be 1. For bnn(vi), we use Flipout Wen et al. (2018) to reduce gradient variances and follow Snoek et al. (2019) for variational inference on deep ResNets. On ImageNet, we compute the Kroneckerproduct factorized Hessian matrix, rather than full due to high dimensionality. For bnn(kfac) and rue, we use minibatch approximations on subsets of the training set to scale up on ImageNet, as suggested in Ritter et al. (2018); Schulam and Saria (2019).4.2 Infinitesimal jackknife on MNIST
Most uncertainty quantification methods, including the ones proposed in this paper, have “knobs” to tune. We first concentrate on MNIST and perform extensive studies of the proposed approaches to understand several design choices. Table 2 contrasts them.
Use all layers or just the last layer.
Uncertainty quantification on deep neural nets is computationally costly, given their large number of parameters, especially when the methods need information on the curvatures of the loss functions. To this end, many approaches assume layerwise independence
Ritter et al. (2018) and lowrank components Maddox et al. (2019); Mishkin et al. (2018) and in some cases, restrict uncertainty quantification to only a few layers Zeng et al. (2018) – in particular, the last layer Kristiadi et al. (2020). The top portion of Table 2 shows the restricting to the last layer harms the indomain ECE slightly but improves OOD significantly.Effectiveness of meanfield approximation. Table 2 also shows that the meanfield approximation has similar performance as sampling (the distribution) on the indomain tasks but noticeably improves on OOD detection. performs the best among the 3 variants.
Effect of ensemble and activation temperatures. We study the roles of ensemble and activation temperatures in (§2.4). We grid search the two and generate the heatmaps of NLL and AU ROC on the heldout sets, shown in fig:fTempAct. Note that correspond to mle.
What is particularly interesting is that for NLL, higher activation temperature () and lower ensemble temperature () work the best. For AU ROC, however, lower temperatures on both work best. That lower is preferred was also observed in Wenzel et al. (2020) and using for better calibration is noted in Guo et al. (2017). On the other end, for OOD detection, Liang et al. (2017) suggests a very high activation temperature ( in their work, likely due to using a single model instead of an ensemble).
4.3 Comparison to other approaches
Given our results in the previous section, we report of infinitesimal jackknife in the rest of the paper. Table 3 contrasts various methods on indomain tasks of MNIST, CIFAR100, and ImageNet. Table 4 contrasts performances on the outofdistribution detection task (OOD). Results on CIFAR10 (both indomain and OOD), as well as CIFAR100 OOD on SVNH are in Appendix B.4.
Method  MNIST  CIFAR100  ImageNet  
(%)  NLL  ECE (%)  (%)  NLL  ECE (%)  (%)  NLL  ECE (%)  
mle  1.67  0.10  1.18  24.3  1.03  10.4  23.66  0.92  3.03 
T. Scale  1.67  0.06  0.74  24.3  0.92  3.13  23.66  0.92  2.09 
ensemble  1.25  0.05  0.30  19.6  0.71  2.00  21.22  0.83  3.10 
rue  1.72  0.08  0.85  24.3  0.99  8.60  23.63  0.92  2.83 
dropout  1.67  0.06  0.68  23.7  0.84  3.43  24.93  0.99  1.62 
bnn(vi)  1.72  0.14  1.13  25.6  0.98  8.35  26.54  1.17  4.41 
bnn(kfac)  1.71  0.06  0.16  24.1  0.89  3.36  23.64  0.92  2.95 
1.67  0.05  0.20  24.3  0.91  1.49  23.66  0.91  0.93 
: for CIFAR100 and ImageNet, only the last layer is used due to the high computational cost.
Method  MNIST vs. NotMNIST  CIFAR100 vs. LSUN  ImageNet vs. ImageNetO  
Acc.  ROC  PR  Acc.  ROC  PR  Acc.  ROC  PR  
mle  67.6  53.8  40.1 : 72.5  72.7  80.0  83.5 : 75.2  58.4  51.6  78.4 : 26.3 
T. Scale  67.4  66.7  48.8 : 77.0  76.6  84.3  86.7 : 80.3  58.5  54.5  79.2 : 27.8 
ensemble  86.5  88.0  70.4 : 92.8  74.4  82.3  85.7 : 77.8  60.1  50.6  78.9 : 25.7 
rue  61.1  64.7  60.5 : 68.4  75.2  83.0  86.7 : 77.8  58.4  51.6  78.3 : 26.3 
dropout  88.8  91.4  78.7 : 93.5  69.8  77.3  81.1 : 72.9  59.5  51.7  79.0 : 26.3 
bnn(vi)  86.9  81.1  59.8 : 89.9  62.5  67.7  71.4 : 63.1  57.8  52.0  75.7 : 26.8 
bnn(kfac)  88.7  93.5  89.1 : 93.8  72.9  80.4  83.9 : 75.5  60.3  53.1  79.6 : 26.9 
91.9  96.9  96.7 : 97.0  82.2  89.9  92.0 : 86.6  63.2  62.9  83.5 : 33.3 
: Accuracy (%). : Area under ROC. : Area under PrecisionRecall with “in” vs. “out” domains flipped.
: for CIFAR100 and ImageNet, only the last layer is used for due to the high computational cost.
While deep ensemble Lakshminarayanan et al. (2017) achieves the best performance on indomain tasks most of time, the proposed approach typically outperforms other approaches, especially on the calibration metric ECE. On the OOD detection task, significantly outperforms all other approaches in all metrics.
ImageNetO is a particularly hard dataset for OOD Hendrycks et al. (2019). The images are from ImageNet22K samples, thus share similar lowlevel statistics as the indomain data. Moreover, the images are chosen such that they are misclassified by existing networks (ResNet50) with high confidence, or so called “natural adversarial examples”. We follow Hendrycks et al. (2019) to use 200class subset, which are the confusing classes to the OOD images, of the test set as the indistribution examples. Hendrycks et al. (2019) further demonstrates that many popular approaches to improve neural network robustness, like adversarial training, hardly help on ImageNetO. improves over other baselines by a margin.
Robustness to distributional shift. Snoek et al. (2019) points out that many uncertainty estimation methods are sensitive to distributional shift. Thus, we evaluate the robustness of on rotated MNIST images, from to . The ECE curves in fig:fTempAct(c) show is better or as robust as other approaches.
5 Conclusion
We propose a simple, efficient, and generalpurpose confidence estimator for deep neural networks. The main idea is to approximate the ensemble of an infinite number of infinitesimal jackknife samples with a closedform Gaussian distribution, and derive an efficient meanfield approximation to classification predictions, when the softmax layer is applied to Gaussian. Empirically, surpasses or is competitive with the stateoftheart methods for uncertainty estimation while incurring lower computational cost and memory footprint.
References
 Secondorder stochastic optimization in linear time. stat 1050, pp. 15. Cited by: §1, §2.4.
 The discriminative jackknife: quantifying predictive uncertainty via higherorder influence functions. External Links: Link Cited by: §3.

Pitfalls of indomain uncertainty estimation and ensembling in deep learning
. arXiv preprint arXiv:2002.06470. Cited by: §1.  Ensemble learning for multilayer networks. In Advances in neural information processing systems, pp. 395–401. Cited by: §1.
 Predictive inference with the jackknife+. arXiv preprint arXiv:1905.02928. Cited by: §1, §3.
 Pattern recognition and machine learning. Springer. Cited by: Appendix A.
 Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §1.
 Notmnist dataset. Google (Books/OCR), Tech. Rep.[Online]. Available: http://yaroslavvb. blogspot. it/2011/09/notmnistdataset. html 2. Cited by: Table 1.
 Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pp. 1683–1691. Cited by: §3.
 Statistical inference for model parameters in stochastic gradient descent. arXiv preprint arXiv:1610.08637. Cited by: §1, §1, §3.
 Residuals and influence in regression. New York: Chapman and Hall. Cited by: §2.1, §3.
 Semianalytical approximations to statistical moments of sigmoid and softmax mappings of normal variables. arXiv preprint arXiv:1703.00091. Cited by: Appendix A, §2.3.

Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: Table 1.  The jackknife estimate of variance. The Annals of Statistics, pp. 586–596. Cited by: §2.1, §3.
 Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pp. 569–593. Cited by: §1, §3.
 Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §B.5, Table B.9, §1, §4.1.
 A higherorder swiss army infinitesimal jackknife. arXiv preprint arXiv:1907.12116. Cited by: §1, §3.
 A swiss army infinitesimal jackknife. arXiv preprint arXiv:1806.00550. Cited by: §1, §2.1, §3.
 On calibration of modern neural networks. arXiv preprint arXiv:1706.04599. Cited by: §1, §1, §2.4, §4.1, §4.1, §4.2.
 A baseline for detecting misclassified and outofdistribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §4.1.
 Natural adversarial examples. arXiv preprint arXiv:1907.07174. Cited by: §4.3, Table 1.

Probabilistic backpropagation for scalable learning of bayesian neural networks
. In International Conference on Machine Learning, pp. 1861–1869. Cited by: §B.5, Table B.9.  Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407. Cited by: §1.
 The infinitesimal jackknife. Bell Telephone Laboratories. Cited by: §1, §2.1, §3.
 Understanding blackbox predictions via influence functions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1885–1894. Cited by: §1, §2.1, §2.4, §3.
 Being bayesian, even just a bit, fixes overconfidence in relu networks. arXiv preprint arXiv:2002.10118. Cited by: §4.2.
 Learning multiple layers of features from tiny images. Master’s thesis, University of Tront. Cited by: Table 1.
 Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6405–6416. Cited by: §B.5, Table B.9, §1, §2.1, §4.1, §4.1, §4.3.
 The mnist database of handwritten digits, 1998. URL http://yann. lecun. com/exdb/mnist 10, pp. 34. Cited by: Table 1.
 A simple unified framework for detecting outofdistribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7165–7175. Cited by: §4.1.
 Enhancing the reliability of outofdistribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §4.1, §4.2.
 Bayesian methods for adaptive models. Ph.D. Thesis, California Institute of Technology. Cited by: §1, §1, §3.
 A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pp. 13132–13143. Cited by: §B.3, §1, §1, §3, §4.2.
 Detecting extrapolation with local ensembles. arXiv preprint arXiv:1910.09573. Cited by: §3.
 Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research 18 (1), pp. 4873–4907. Cited by: §1.
 Deep learning via hessianfree optimization.. In ICML, Vol. 27, pp. 735–742. Cited by: §1.
 The jackknifea review. Biometrika 61 (1), pp. 1–15. Cited by: §3.
 Slang: fast structured covariance approximations for bayesian deep learning with natural gradient. In Advances in Neural Information Processing Systems, pp. 6245–6255. Cited by: §4.2.
 Reading digits in natural images with unsupervised feature learning. NIPS, pp. . Cited by: Table 1.
 Fast exact multiplication by the hessian. Neural computation 6 (1), pp. 147–160. Cited by: §1, §2.4.
 A scalable laplace approximation for neural networks. In 6th International Conference on Learning Representations, ICLR 2018Conference Track Proceedings, Vol. 6. Cited by: Table B.9, §2.4, §3, §4.1, §4.1, §4.2.
 Can you trust this prediction? auditing pointwise reliability after learning. arXiv preprint arXiv:1901.00403. Cited by: §B.5, Table B.9, §1, §1, §3, §4.1, §4.1.
 Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp. 13969–13980. Cited by: §B.2, §4.1, §4.3.
 Bias and confidence in not quite large samples. Ann. Math. Statist. 29, pp. 614. Cited by: §2.1, §3.
 Flipout: efficient pseudoindependent weight perturbations on minibatches. arXiv preprint arXiv:1803.04386. Cited by: §4.1.
 How good is the bayes posterior in deep neural networks really?. arXiv preprint arXiv:2002.02405. Cited by: §2.4, §4.2.
 Lsun: construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: Table 1.

The relevance of bayesian layer positioning to model uncertainty in deep bayesian active learning
. arXiv preprint arXiv:1811.12535. Cited by: §4.2.
Appendix A MeanField Approximation for Gaussiansoftmax Integration
In this section, we derive the meanfield approximation for Gaussiansoftmax integration, eq. (10) in the main text. Assume the same notations as in §2.3, where the activation to softmax follows a Gaussian .
(A.1) 
where “integrate independently” means integrating each term in the summand independently, resulting the expectation to the marginal distribution over the pair . This approximation is prompted by the meanfield approximation: for a nonlinear function ^{2}^{2}2Similar to the classical use of meanfield approximation on Ising models, we use the term meanfield approximation to capture the notion that the expectation is computed by considering the weak, pairwise coupling effect from points on the lattice, i.e., with ..
Next we plug in the approximation to where is the sigmoid function, which states that
(A.2) 
is a constant and is usually chosen to be or . This is a wellknown result, see Bishop [2006]. We further approximate by considering different ways to compute the bivariate expectations in the denominator.
MeanField 0 ()
In the denominator, we ignore the variance of for and replace with its mean , and compute the expectation only with respect to . We arrive at
Applying eq.(A.2), we have
(A.3) 
MeanField 1 ()
If we replace with the two independent marginals in the denominator, recognizing , we get,
(A.4) 
MeanField 2 ()
Appendix B Experiments
b.1 Definitions of evaluation metrics
NLL is defined as the KLdivergence between the data distribution and the model’s predictive distribution,
(B.1) 
where is the onehot embedding of the label.
ECE measures the discrepancy between the predicted probabilities and the empirical accuracy of a classifier in terms of distance. It is computed as the expected difference between per bucket confidence and per bucket accuracy, where all predictions are binned into buckets such that are predictions falling within the interval . ECE is defined as,
(B.2) 
where and .
b.2 Details of experiments in the main text
Table B.5 provides key hyperparameters used in training deep neural networks on different datasets.
Dataset  MNIST  CIFAR10  CIFAR100  ImageNet 
Architecture  MLP  ResNet20  DensenetBC121  ResNet50 
Optimizer  Adam  Adam  SGD  SGD 
Learning rate  0.001  0.001  0.1  0.1 
Learning rate decay  exponential  staircase  staircase  staircase 
0.998  at 80, 120, 160  at 150, 225  at 30, 60, 80  
Weight decay  0  
Batch size  100  8  128  256 
Epochs  100  200  300  90 
For method, we use in our implementation. For the ensemble approach, we use models on all datasets as in Snoek et al. [2019]. For rue, dropout, bnn (vi) and bnn(kfac), where sampling is applied at inference time, we use MonteCarlo samples on MNIST, and on CIFAR10, CIFAR100 and ImageNet. We use buckets when computing ECE on MNIST, CIFAR10 and CIFAR100, and on ImageNet.
b.3 Apply the meanfield approximation to other Gaussian posteriors inference tasks
Approx. method  MNIST: indomain ()  NotMNIST: OOD detection ()  
(%)  NLL  ECE  Acc. (%)  AU ROC  AU PR (in : out)  
sampling

20  1.55  0.06  0.60  85.78  91.15  91.91 : 90.34 
100  1.55  0.06  0.60  88.33  93.19  86.95 : 95.72  
500  1.55  0.06  0.60  82.68  88.80  82.51: 91.23  
meanfield

1.56  0.05  0.25  87.56  93.26  91.49 : 93.96  
1.55  0.05  0.46  85.28  90.68  85.15 : 92.39  
1.55  0.05  0.51  82.18  87.40  78.12 : 90.69 
The meanfield approximation is interesting in its own right. In particular, it can be applied to approximate any Gaussiansoftmax integral. In this section, we apply it to the swaGaussian posterior (swag) Maddox et al. [2019], whose covariance is a lowrank matrix plus a diagonal matrix derived from the SGD iterates. We tune the ensemble and activation temperatures with swag to perform uncertainty tasks on MNIST, and the results are reported in tab:tSwag. We ran swag on the softmax layer for 50 epochs, and collect models along the trajectory to form the posterior.
We use the default values for other hyperparameters, like swag learning rate and the rank of the lowrank component, as the main objective here is to combine the meanfield approximation with different Gaussian posteriors. As we can see from tab:tSwag, swag, when using sampling for approximate inference, have a variance larger than expected^{3}^{3}3In particular, the variance is higher than the variance in the sampling results of the infinitesimal jackknife, cf. Table 2 in the main text.. Nonetheless, within the variance, using the meanfield approximation instead of sampling performs similarly. The notable exception is indomain tasks where the meanfield approximation consistently outperforms sampling.
This suggests that the meanfield approximations can work with other Gaussian distributions as a replacement for sampling to reduce the computation cost.
Method  SVHN: OOD detection ()  
Acc.  AU ROC  AU PR (in : out)  
mle  73.90  80.69  74.03 : 86.93 
T. scaling  76.94  83.43  77.53 : 88.11 
Ensemble  75.99  83.48  77.75 : 88.85 
rue  73.90  80.69  74.03 : 86.93 
Dropout  74.13  81.87  75.78 : 88.07 
bnn llvi  71.87  79.55  72.08 : 87.27 
bnn(kfac)  74.13  80.97  74.46 : 87.06 
81.38  88.04  84.59 : 91.23 
b.4 More results on CIFAR10 and CIFAR100
Tables B.7 and B.8 supplement the main text with additional experimental results on the CIFAR10 dataset with both indomain and outofdistribution detection tasks, and on the CIFAR100 with outofdistribution detection using the SVHN dataset. bnn llvi refers to stochastic variational inference on the last layer only.
The results support the same observations in the main text: performs similar to other approaches on indomain tasks, but noticeably outperforms them on outofdistribution detection.
b.5 Regression experiments
We conduct realworld regression experiments on UCI datasets. We follow the experimental setup in Gal and Ghahramani [2016], HernándezLobato and Adams [2015], Lakshminarayanan et al. [2017]
, where each dataset is split into 20 traintest folds randomly, and the average results with the standard deviation are reported. We use the same architecture as previous works, with 1 hidden layer of 50 ReLU units. Since in regression tasks the output distribution can be compute analytically, we don’t use the meanfield approximation in these experiments. Nevertheless, we still refer to our method as
to be consistent with other datasets.For approach, we use the pseudoensemble Gaussian distribution of the last layer parameters to compute a Gaussian distribution of the network output, i.e. as eq. (9) in the main text. We also estimate the variance of the observation noise from the residuals on the training set Schulam and Saria [2019]. Therefore, the predictive distribution of is given by . We can compute the negative log likelihood (NLL) as,
We tune ensemble temperature on the heldout sets, and fix the activation temperature to be .
For samplingbased methods, rue and kfac, the NLL from prediction sample can be computed as,
where is the mean of prediction samples, and is the variance of prediction samples. The sampling is conducted on all network layers in both rue and kfac.
Results
We report the NLL and RMSE on the test set in tab:tPerf and compare with other approaches. rue and bnn(kfac) use samples and the ensemble method has models in the ensemble. We achieve similar or better performances than rue and kfac without the computation cost of explicit sampling. Compared to dropout and ensemble, which are the stateoftheart methods on these datasets, we also achieve competitive results. We highlight the best performing method on each dataset, and when its result is within one std from the best method.
b.6 More results on comparing all layers versus just the last layer.
We conduct an experiment similar to the top portion of Table 2 in the main text, to study the effect of restricting the parameter uncertainty to the last layer only. We use NotMNIST as the indomain dataset and treat MNIST as the outofdistribution dataset. We use a twolayer MLP with 256 ReLU hidden units. Table B.10 supports the same observation as in the main text: restricting to the last layer improves on OOD task while it does not have significant negative impact on the indomain tasks.
Comments
There are no comments yet.