1 Introduction
Deep neural networks have achieved remarkable performance in various tasks, but have critical limitations in reliability of their predictions. One example is that inference results are often overly confident even for unseen or ambiguous examples. Since many practical applications including medical diagnosis, autonomous driving, and machine inspection require accurate uncertainty estimation as well as high prediction accuracy for each inference, such an overconfidence issue makes deep neural networks inappropriate to be deployed for realworld problems in spite of their impressive accuracy.
Regularization is a common technique in training deep neural networks to avoid overfitting problems and improve generalization performance (Srivastava et al., 2014; Huang et al., 2016; Ioffe & Szegedy, 2015). Although regularization is effective to learn robust models, its objective is not directly related to generating score distributions aligned with uncertainty of predictions. Hence, existing deep neural networks are often poor at calibrating prediction accuracy and confidence.
Our goal is to learn deep neural networks that are able to estimate uncertainty of each prediction while maintaining accuracy. In other words, we propose a generic framework to calibrate prediction score (confidence) with accuracy in deep neural networks. Our algorithm starts with an observation that variance of prediction scores measured from multiple stochastic inferences is highly correlated with accuracy and confidence of the prediction based on the average score. Based on its Bayesian interpretation, we employ stochastic regularization such as stochastic depth or dropout to obtain multiple stochastic inference results. By exploiting the empirical observation with the theoretical interpretation, we design a novel loss function to enable deep neural network to predict confidencecalibrated scores based only on a single prediction, without multiple stochastic inferences. Our contribution is summarized below:

We provide a generic framework to estimate uncertainty of a prediction based on stochastic inferences in deep neural networks, which is supported by empirical observations and theoretical analysis.

We propose a novel varianceweighted confidenceintegrated loss function in a principled way, which enables deep neural networks to produce confidencecalibrated predictions even without performing stochastic inferences and introducing hyperparameters.

The proposed framework presents outstanding performance to reduce overconfidence issue and estimate accurate uncertainty in various combinations of network architectures and datasets.
The rest of the paper is organized as follows. We first discuss prior research related to our algorithm, and describe theoretical background for Bayesian interpretation of our approach in Section 2 and 3, respectively. Section 4 presents our confidence calibration algorithm through stochastic inferences, and Section 5 illustrates experimental results.
2 Related Work
Uncertainty estimation is a critical problem in deep neural networks and receives growing attention from machine learning community. Bayesian approach is a common tool to provide a mathematical framework for uncertainty estimation in deep neural networks. However, exact Bayesian inference is not tractable in deep neural networks due to its high computational cost, and various approximate inference techniques—MCMC
(Neal, 1996), Laplace approximation (MacKay, 1992) and variational inference (Barber & Bishop, 1998; Graves, 2011; Hoffman et al., 2013)–have been proposed. Recently, Bayesian interpretation of multiplicative noise is employed to estimate uncertainty in deep neural networks (Gal & Ghahramani, 2016; McClure & Kriegeskorte, 2016). There are several approaches outside Bayesian modeling, which include postprocessing (NiculescuMizil & Caruana, 2005; Platt, 2000; Zadrozny & Elkan, 2001; Guo et al., 2017) and deep ensembles (Lakshminarayanan et al., 2017). All the postprocessing methods require a holdout validation set to adjust prediction scores after training, and the ensemblebased technique employs multiple models to estimate uncertainty.Stochastic regularization is a common technique to improve generalization performance by injecting random noise to deep neural networks. The most notable method is Srivastava et al. (2014), which randomly drops their hidden units by multiplying Bernoulli random noise. There exist several variants, for example, dropping weights (Wan et al., 2013) or skipping layers (Huang et al., 2016). Most stochastic regularization methods exploit stochastic inferences during training, but perform deterministic inferences using the whole network during testing. On the contrary, we also use stochastic inferences to obtain diverse and reliable outputs during testing.
Although the following works do not address uncertainty estimation, their main idea is related to our objective more or less. Label smoothing (Szegedy et al., 2016)
encourages models to be less confident, by preventing a network from assigning the full probability to a single class. A similar loss function is discussed to train confidencecalibrated classifiers in
Lee et al. (2018), but it focuses on how to discriminate indistribution and outofdistribution examples, rather than estimating uncertainty or alleviating miscalibration of indistribution examples. On the other hand, Pereyra et al. (2017) claims that blind label smoothing and penalizing entropy enhances accuracy by integrating loss functions with the same concept with Szegedy et al. (2016); Lee et al. (2018), but improvement is marginal in practice.3 Bayesian Interpretation of Stochastic Regularization
This section describes Bayesian interpretation of stochastic regularization in deep neural networks, and discusses relation between stochastic regularization and uncertainty modeling.
3.1 Stochastic Methods for Regularizations
One popular class of regularization techniques is stochastic regularization, which introduces random noise to a network for perturbing its inputs or weights. We focus on the multiplicative binary noise injection, where random binary noise is applied to the inputs or weights by elementwise multiplication since such stochastic regularization techniques are widely used (Srivastava et al., 2014; Wan et al., 2013; Huang et al., 2016). Note that input perturbation can be reformulated as weight perturbation. For example, dropout—binary noise injection to activations—is intertpretable as weight perturbation that masks out all the weights associated with the dropped inputs. Therefore, if a classification network modeling with parameters is trained with stochastic regularization methods by minimizing the cross entropy, the loss function can be defined by
(1) 
where is a set of perturbed parameters by elementwise multiplication with random noise sample , and is a pair of input and output in training dataset .
At inference time, the network is parameterized by the expectation of the perturbed parameters, , to predict an output , i.e.,
(2) 
3.2 Bayesian Modeling
Given the dataset with examples, Bayesian objective is to estimate the posterior distribution of the model parameter, denoted by , to predict a label for an input , which is given by
(3) 
A common technique for the posterior estimation is variational approximation, which introduces an approximate distribution and minimizes KullbackLeibler (KL) divergence with the true posterior as follows:
(4) 
The intractable integral and summation over the entire dataset in Equation 4 is approximated by Monte Carlo method and minibatch optimization resulting in
(5) 
where is a sample from the approximate distribution, is the number of samples, and is the size of a minibatch. Note that the first term is data likelihood and the second term is divergence of the approximate distribution with respect to the prior distribution.
3.3 Interpreting Stochastic Regularizations as Bayesian Model
Suppose that we train a classifier with
regularization by a stochastic gradient descent method. Then, the loss function in Equation
1 is rewritten as(6) 
where regularization is applied to the deterministic parameters with weight . Optimizing this loss function is equivalent to optimizing Equation 5 if there exists a proper prior and is approximated as a Gaussian mixture distribution (Gal & Ghahramani, 2016). Note that Gal & Ghahramani (2016) cast dropout training as an approximate Bayesian inference. Thus, we can interpret training with stochastic depth (Huang et al., 2016) within the same framework by simple modification. (See Appendix A and B for details.) Then, the predictive distribution of a model trained with stochastic regularization is approximately given by
(7) 
Following Gal & Ghahramani (2016) and Teye et al. (2018), we estimate the predictive mean and uncertainty by Monte Carlo approximation by drawing parameter samples as
(8) 
where
denotes a score vector of
class labels. Equation 8 means that the average prediction and its variance can be computed directly from multiple stochastic inferences.4 Confidence Calibration through Stochastic Inference
We present a novel confidence calibration technique for prediction in deep neural networks, which is given by a varianceweighted confidenceintegrated loss function. We present our observation that variance of multiple stochastic inferences is closely related to accuracy and confidence of predictions, and provide an endtoend training framework for confidence selfcalibration. Then, prediction accuracy and uncertainty are directly accessible from the predicted scores obtained from a single forward pass. This section presents our observation from stochastic inferences and technical details about our confidence calibration technique.


4.1 Empirical Observations
Equation 8 suggests that variation of models provides variance of multiple stochastic predictions for a single example. Figure 1 presents how variance of multiple stochastic inferences given by stochastic depth or dropout is related to accuracy and confidence of the corresponding average prediction, where the confidence is measured by the maximum score of the average prediction. In the figure, accuracy and score of each bin are computed with the examples belonging to the corresponding bin of the normalized variance. We present results from CIFAR100 with ResNet34 and VGGNet with 16 layers. The histograms illustrate the strong correlation between the predicted variance and the reliability—accuracy and confidence—of a prediction; we can estimate accuracy and uncertainty of an example effectively based on its prediction variances given by multiple stochastic inferences.
4.2 VarianceWeighted ConfidenceIntegrated Loss
The strong correlation of accuracy and confidence with predicted variance observed in Figure 1 shows great potential to make confidencecalibrated prediction by stochastic inferences. However, variance computation involves multiple stochastic inferences by executing multiple forward passes. Note that this property incurs additional computational cost and may produce inconsistent results.
To overcome these limitations, we propose a generic framework for training accuracyscore calibrated networks whose prediction score from a single forward pass directly provides confidence of the prediction. This objective is achieved by designing a loss function, which augments a confidencecalibration loss to the standard crossentropy loss, while the two terms are balanced by the variance measured by multiple stochastic inferences. Specifically, our varianceweighted confidenceintegrated loss for the whole training data
is defined by a linear interpolation of the standard crossentropy loss with groundtruth
and the crossentropy with the uniform distribution , which is formally given by(9) 
where is a normalized variance, is a sampled model parameter with binary noise for stochastic prediction, is the number of stochastic inferences, and is a constant.
The two terms in our varianceweighted confidenceintegrated loss pushes the network toward opposite directions; the first term encourages the network to fit the ground truth label while the second term forces the network to make a prediction close to the uniform distribution. These terms are linearly interpolated by an instancespecific balancing coefficient , which is given by normalizing the prediction variance of an example obtained from multiple stochastic inferences. Note that the normalized variance is distinct for each training example and is used to measure model uncertainty. Therefore, optimizing our loss function produces gradient signals, forcing the prediction toward the uniform distribution for the examples with high uncertainty derived by high variance while intensifying prediction confidence of the examples with low variance.
By training deep neural networks using the proposed loss function, we can estimate uncertainty of each testing example with a single forward pass. Unlike the ordinary models, a prediction score of our model is wellcalibrated and represents confidence of the prediction, which means that we can rely more on the predictions with high scores.
4.3 ConfidenceIntegrated Loss
Our claim is that an adaptive combination of crossentropy losses with respect to groundtruth and uniform distribution is a reasonable method to learn uncertainty. As a special case of our varianceweighted confidenceintegrated loss, we also present a blind version of the combination, which can be used as a baseline uncertainty estimation technique. This baseline loss function is referred to as confidenceintegrated loss, which is given by
(10) 
where is the predicted distribution with model parameter and is a constant. The main idea of this loss function is to regularize with the uniform distribution by expecting the score distributions of uncertain examples to be flattened first while the distributions of confident ones remain intact, where the impact of the confidenceintegrated loss term is controlled by a global hyperparameter .
The proposed loss function is also employed in Pereyra et al. (2017) to regularize deep neural networks and improve classification accuracy. However, Pereyra et al. (2017) does not discuss confidence calibration issues while presenting marginal accuracy improvement. On the other hand, Lee et al. (2018) discusses a similar loss function but focuses on differentiating between indistribution and outofdistribution examples by measuring loss of each example based only on one of the two loss terms depending on its origin.
Contrary to the existing approaches, we employ the loss function in Equation 10 to estimate prediction confidence in deep neural networks. Although the confidenceintegrated loss makes sense intuitively, blind selection of a hyperparameter limits its generality compared to our varianceweighted confidenceintegrated loss.
4.4 Relation to Other Calibration Approaches
There are several score calibration techniques (Guo et al., 2017; Zadrozny & Elkan, 2002; Naeini et al., 2015; NiculescuMizil & Caruana, 2005) by adjusting confidence scores through postprocessing, among which Guo et al. (2017)
proposes a method to calibrate confidence of predictions by scaling logits of a network using a global temperature
. The scaling is performed before applying the softmax function, and is trained with validation dataset. As discussed in Guo et al. (2017), this simple technique is equivalent to maximize entropy of the output distribution . It is also identical to minimize KLdivergence because(11) 
where is a constant. We can formulate another confidenceintegrated loss with the entropy as
(12) 
where is a constant. Equation 12 suggests that temperature scaling in Guo et al. (2017) is closely related to our framework.
5 Experiments
5.1 Experimental Setting and Implementation Details
We choose four most widely used deep neural network architectures to test our framework: ResNet (He et al., 2016), VGGNet (Simonyan & Zisserman, 2015), WideResNet (Zagoruyko & Komodakis, 2016) and DenseNet (Huang et al., 2017). We employ stochastic depth in ResNet as proposed in Huang et al. (2016) while employing dropouts (Srivastava et al., 2014) before every fc layer except for the classification layer in other architectures. Note that, as discussed in Section 3.3, both stochastic depth and dropout inject multiplicative binary noise to withinlayer activations or residual blocks, they are equivalent to noise injection into network weights. Hence, training with regularization term enables us to interpret stochastic depth and dropout by Bayesian models.
We evaluate the proposed framework on two benchmarks, Tiny ImageNet and CIFAR100. Tiny ImageNet contains
images with 200 object classes whereas CIFAR100 has images of 100 object kinds. There are 500 training images per class in both datasets. For testing, we use the validation set of Tiny ImageNet and the test set of CIFAR100, which contain 50 and 100 images per class, respectively. To test the two benchmarks with the same architecture, we resize images in Tiny ImageNet to .All networks are trained with stochastic gradient decent with the momentum of 0.9 for 300 epochs. We set the initial learning rate to 0.1 and exponentially decay it with factor of 0.2 at epoch 60, 120, 160, 200 and 250. Each batch consists of 64 training examples for ResNet, WideResNet and DenseNet and 256 for VGG architectures. To train networks with the proposed varianceweighted confidenceintegrated loss, we draw
samples of network parameters for each input image, and compute the normalized variance by running forward passes. The number of samples is set to 5. The normalized variance is estimated as Bhattacharyya coefficients between individual predictions and the average prediction.5.2 Evaluation Metric
We measure classification accuracy and calibration scores—expected calibration error (ECE), maximum calibration error (MCE), negative log likelihood (NLL) and brier score—of the trained models. Let be a set of indices of test examples whose scores for the groundtruth labels fall into interval , where is the number of bins (). ECE and MCE are formally defined by
where is the number of the test samples. Also, accuracy and confidence of each bin are given by
where is an indicator function, and are predicted and true label of the th example and is its predicted confidence. NLL and Brier score are another metrics for calibration and are defined as
where is the number of classes. We note that low values for all these calibration scores means that the network is wellcalibrated.











ResNet34  Baseline  50.82  0.067  0.147  2.050  0.628  
CI  50.09 1.08  0.134 0.079  0.257 0.098  2.270 0.212  0.665 0.037  
VWCI  52.80  0.027  0.076  1.949  0.605  
38  CI[Oracle]  51.45  0.035  0.171  2.030  0.620  
VGG16  Baseline  46.58  0.346  0.595  4.220  0.844  
CI  46.82 0.81  0.226 0.095  0.435 0.107  3.224 0.468  0.761 0.054  
VWCI  48.03  0.053  0.142  2.373  0.659  
38  CI[Oracle]  47.39  0.122  0.320  2.812  0.701  
WideResNet168  Baseline  55.92  0.132  0.237  1.974  0.593  
CI  55.80 0.44  0.115 0.040  0.288 0.100  1.980 0.114  0.594 0.017  
VWCI  56.66  0.046  0.136  1.866  0.569  
38  CI[Oracle]  56.38  0.050  0.208  1.851  0.572  
DenseNet4012  Baseline  42.50  0.020  0.154  2.423  0.716  
CI  40.18 1.68  0.059 0.061  0.152 0.082  2.606 0.208  0.748 0.035  
VWCI  43.25  0.025  0.089  2.410  0.712  
38  CI[Oracle]  41.21  0.025  0.094  2.489  0.726  

ResNet34  Baseline  77.19  0.109  0.304  1.020  0.345  
CI  77.56 0.60  0.134 0.131  0.251 0.128  1.064 0.217  0.360 0.057  
VWCI  78.64  0.034  0.089  0.908  0.310  
38  CI[Oracle]  78.54  0.029  0.087  0.921  0.321  
VGG16  Baseline  73.78  0.187  0.486  1.667  0.437  
CI  73.75 0.35  0.183 0.079  0.489 0.214  1.526 0.175  0.436 0.034  
VWCI  73.87  0.098  0.309  1.277  0.391  
38  CI[Oracle]  73.78  0.083  0.285  1.289  0.396  
WideResNet168  Baseline  77.52  0.103  0.278  0.984  0.336  
CI  77.35 0.21  0.133 0.091  0.297 0.108  1.062 0.180  0.356 0.044  
VWCI  77.74  0.038  0.101  0.891  0.314  
38  CI[Oracle]  77.53  0.074  0.211  0.931  0.327  
DenseNet4012  Baseline  65.91  0.074  0.134  1.238  0.463  
CI  64.72 1.46  0.070 0.040  0.138 0.055  1.312 0.125  0.482 0.028  
VWCI  67.45  0.026  0.094  1.161  0.439  
38  CI[Oracle]  66.20  0.019  0.053  1.206  0.456  
, we present mean and standard deviation of results from models with five different
’s. In addition, we also show results from the oracle CI loss, CI[Oracle], which are the most optimistic values out of results from all ’s in individual columns. Note that the numbers corresponding to CI[Oracle] may come from different ’s. Refer to Appendix C for the full results.










ResNet34  TS (case 1)  50.82  0.162  0.272  2.241  0.660  
TS (case 2)  47.20  0.021  0.080  2.159  0.661  
VWCI  52.80  0.027  0.076  1.949  0.605  
VGG16  TS (case 1)  46.58  0.358  0.604  4.425  0.855  
TS (case 2)  46.53  0.028  0.067  2.361  0.671  
VWCI  48.03  0.053  0.142  2.373  0.659  
WideResNet168  TS (case 1)  55.92  0.200  0.335  2.259  0.627  
TS (case 2)  53.95  0.027  0.224  1.925  0.595  
VWCI  56.66  0.046  0.136  1.866  0.569  
DenseNet4012  TS (case 1)  42.50  0.037  0.456  2.436  0.717  
TS (case 2)  41.63  0.024  0.109  2.483  0.728  
VWCI  43.25  0.025  0.089  2.410  0.712  

ResNet34  TS (case 1)  77.67  0.133  0.356  1.162  0.354  
TS (case 2)  77.40  0.036  0.165  0.886  0.323  
VWCI  78.64  0.034  0.089  0.908  0.310  
VGG16  TS (case 1)  73.66  0.197  0.499  1.770  0.445  
TS (case 2)  72.69  0.031  0.074  1.193  0.389  
VWCI  73.87  0.098  0.309  1.277  0.391  
WideResNet168  TS (case 1)  77.52  0.144  0.400  1.285  0.361  
TS (case 2)  76.42  0.028  0.101  0.891  0.332  
VWCI  77.74  0.038  0.101  0.891  0.314  
DenseNet4012  TS (case 1)  65.91  0.095  0.165  1.274  0.468  
TS (case 2)  64.96  0.082  0.163  1.306  0.481  
VWCI  67.45  0.026  0.094  1.161  0.439  
5.3 Results
Table 1 presents accuracy and calibration scores for several combinations of network architectures and benchmark datasets. The models trained with VWCI loss consistently outperform the models with CI loss, which are special cases of VWCI, and the baseline on both classification accuracy and confidence calibration performance. Performance of CI is given by average and variance from 5 different cases of ^{1}^{1}1These 5 values of are selected favorably to CI based on our preliminary experiment. and CI[Oracle] denotes the most optimistic value among the 5 cases in each column. Note that VWCI presents outstanding results in most cases even when compared with CI[Oracle] and that performance of CI is sensitive to the choice of ’s (See Appendix C for details.). These results imply that the proposed loss function balances two conflicting loss terms effectively using variance of multiple stochastic inferences while performance of CI varies depending on hyperparameter setting in each dataset.
We also compare the proposed framework with the stateoftheart postprocessing method, temperature scaling (TS) (Guo et al., 2017). The main distinction between postprocessing methods and our work is the need for heldout dataset: our method allows to calibrate scores during training without additional data while Guo et al. (2017) requires heldout validation sets to calibrate scores. To illustrate the effectiveness of our framework, we compare our approach with TS in the following two scenarios: 1) using the entire training set for both training and calibration and 2) using 90% of training set for training and the remaining 10% for calibration. Table 2 presents that case 1 suffers from poor calibration performance and case 2 loses accuracy substantially due to training data reduction although it shows comparable calibration scores to VWCI. VWCI, on the other hand, presents consistently good results in terms of both classification accuracy and calibration performance.
Coverage of ResNet34 models with respect to confidence interval on Tiny ImageNet (left) and CIFAR100 (right). Coverage is computed by the portion of examples with higher accuracy and confidence than thresholds shown in
axis. To compare VWCI to CI[Oracle], we present results from multiple CI models with oracle ’s for individual metrics, which are shown in graph legends.A critical benefit of our variancedriven weight in the VWCI loss is the capability to maintain examples with high accuracy and high confidence. This is an important property for building realworld decision making systems with confidence interval, where the decisions should be both highly accurate and confident. Figure 2 illustrates portion of test examples that have higher accuracy and confidence than varying thresholds in ResNet34, where VWCI presents better coverage than CI[Oracle] by controlling weights of two loss terms effectively based on variance of multiple stochastic inferences. Note that coverage of CI often depends on the choice of significantly as demonstrated in Figure 2(right) while VWCI maintains higher coverage than CI using accurately calibrated prediction scores. These results imply that using the predictive uncertainty for balancing the loss terms is preferable to setting with a constant coefficient.
6 Conclusion
We presented a generic framework for uncertainty estimation of a prediction in deep neural networks by calibrating accuracy and score based on stochastic inferences. Based on Bayesian interpretation of stochastic regularization and our empirical observation results, we claim that variation of multiple stochastic inferences for a single example is a crucial factor to estimate uncertainty of the average prediction. Motivated by this fact, we design the varianceweighted confidenceintegrated loss to learn confidencecalibrated networks and enable uncertainty to be estimated by a single prediction. The proposed algorithm is also useful to understand existing confidence calibration methods in a unified way, and we compared our algorithm with other variations within our framework to analyze their characteristics.
References
 Barber & Bishop (1998) D. Barber and Christopher Bishop. Ensemble learning for multilayer networks. In NIPS, 1998.

Gal & Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
In ICML, 2016.  Graves (2011) Alex Graves. Practical variational inference for neural networks. In NIPS, 2011.
 Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In ICML, 2017.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 Hoffman et al. (2013) Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013. ISSN 15324435. URL http://dl.acm.org/citation.cfm?id=2502581.2502622.
 Huang et al. (2016) Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
 Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, pp. 2261–2269, 2017.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, pp. 6405–6416, 2017.
 Lee et al. (2018) Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidencecalibrated classifiers for detecting outofdistribution samples. In ICLR, 2018.

MacKay (1992)
David J. C. MacKay.
A practical bayesian framework for backpropagation networks.
Neural Comput., 4(3):448–472, May 1992. ISSN 08997667. doi: 10.1162/neco.1992.4.3.448. URL http://dx.doi.org/10.1162/neco.1992.4.3.448.  McClure & Kriegeskorte (2016) Patrick McClure and Nikolaus Kriegeskorte. Representation of uncertainty in deep neural networks through sampling. CoRR, abs/1611.01639, 2016.
 Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory F Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI, 2015.
 Neal (1996) Radford M. Neal. Bayesian Learning for Neural Networks. SpringerVerlag, 1996. ISBN 0387947248.

NiculescuMizil & Caruana (2005)
Alexandru NiculescuMizil and Rich Caruana.
Predicting good probabilities with supervised learning.
In ICML, 2005.  Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.

Platt (2000)
John Platt.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
10, 06 2000.  Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CVPR, 2016.
 Teye et al. (2018) Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncertainty estimation for batch normalized deep networks. arXiv preprint arXiv:1802.06455, 2018.
 Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In ICML, 2013.

Zadrozny & Elkan (2001)
Bianca Zadrozny and Charles Elkan.
Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers.
In ICML, 2001.  Zadrozny & Elkan (2002) Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, 2002.
 Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
Appendix
Appendix A Stochastic depth as approximate Bayesian inference
ResNet (He et al., 2016) proposes to add skip connections to the network. If denotes the output of the layer and represents a typical convolutional transformation, we can obtain the forward propagation
(13) 
and is commonly defined by
(14) 
where and are weight matrices, () denotes convolution, and and
indicates batch normalization and ReLU function, respectively.
ResNet with stochastic depth (Huang et al., 2016) randomly drops a subset of layers and bypass them with shortcut connection. Let
denotes a Bernoulli random variable which indicates whether the
residual block is active or not. The forward propagation is extended from Equation 13 to(15) 
Now we can transform stochasticity from layers to the parameter space as follows:
(16)  
(17)  
(18)  
(19)  
(20)  
(21) 
since it is a Bernoulli random variable. All stochastic parameters in this block drop at once or not.
Appendix B Approximation of KLDivergence
Let , and with a probability vector where and . In our work, denotes the deterministic model parameter and . The KLDivergence between and is
(22)  
(23) 
We can reparameterize the first entropy term with where .
(24)  
(25) 
Using for large enough ,
(26)  
(27)  
(28) 
For the second term of the Equation 23,
(29)  
(30) 
Then we can approximate
(31) 
For a more general proof, see Gal & Ghahramani (2016).
Appendix C Full Experimental Results
We present the full results of Table 1 including all scores from individual CI models with different ’s. Table 3 and 4 show the results on Tiny ImageNet and CIFAR100, respectively.











ResNet18  Baseline  46.38  0.029  0.086  2.227  0.674  
CI[]  46.48  0.022  0.073  2.216  0.672  
CI[]  47.20  0.022  0.060  2.198  0.666  
CI[]  47.03  0.021  0.157  2.193  0.667  
CI[]  47.58  0.055  0.111  2.212  0.666  
CI[]  47.92  0.241  0.380  2.664  0.742  
VWCI  48.57  0.026  0.054  2.129  0.651  
ResNet34  Baseline  50.82  0.067  0.147  2.050  0.628  
CI[]  48.89  0.132  0.241  2.257  0.668  
CI[]  50.17  0.127  0.227  2.225  0.653  
CI[]  49.16  0.119  0.219  2.223  0.663  
CI[]  51.45  0.035  0.171  2.030  0.620  
CI[]  50.77  0.255  0.426  2.614  0.722  
VWCI  52.80  0.027  0.076  1.949  0.605  
VGG16  Baseline  46.58  0.346  0.595  4.220  0.844  
CI[]  47.26  0.325  0.533  3.878  0.830  
CI[]  47.39  0.296  0.536  3.542  0.795  
CI[]  47.11  0.259  0.461  3.046  0.763  
CI[]  46.94  0.122  0.327  2.812  0.701  
CI[]  45.40  0.130  0.320  2.843  0.717  
VWCI  48.03  0.053  0.142  2.373  0.659  
WideResNet168  Baseline  55.92  0.132  0.237  1.974  0.593  
CI[]  55.29  0.126  0.208  1.987  0.598  
CI[]  55.53  0.120  0.237  1.949  0.592  
CI[]  56.12  0.116  0.238  1.949  0.590  
CI[]  56.38  0.050  0.456  1.851  0.572  
CI[]  55.66  0.161  0.301  2.163  0.619  
VWCI  56.66  0.046  0.136  1.866  0.569  
DenseNet4012  Baseline  42.50  0.020  0.154  2.423  0.716  
CI[]  41.20  0.030  0.156  2.489  0.726  
CI[]  41.21  0.036  0.122  2.514  0.735  
CI[]  40.61  0.025  0.097  2.550  0.739  
CI[]  40.67  0.037  0.094  2.501  0.732  
CI[]  37.23  0.169  0.291  2.975  0.810  
VWCI  43.25  0.025  0.089  2.410  0.712  











ResNet18  Baseline  75.61  0.097  0.233  1.024  0.359  
CI[]  75.03  0.104  0.901  1.055  0.369  
CI[]  75.51  0.087  0.219  0.986  0.357  
CI[]  74.95  0.069  0.183  0.998  0.358  
CI[]  75.94  0.065  0.961  1.018  0.349  
CI[]  75.61  0.340  0.449  1.492  0.475  
VWCI  76.09  0.045  0.128  0.976  0.342  
ResNet34  Baseline  77.19  0.109  0.304  1.020  0.345  
CI[]  77.38  0.105  0.259  1.000  0.341  
CI[]  76.98  0.101  0.261  0.999  0.344  
CI[]  77.23  0.074  0.206  0.921  0.331  
CI[]  77.66  0.029  0.087  0.953  0.321  
CI[]  78.54  0.362  0.442  1.448  0.461  
VWCI  78.64  0.034  0.089  0.908  0.310  
VGG16  Baseline  73.78  0.187  0.486  1.667  0.437  
CI[]  73.19  0.189  0.860  1.679  0.446  
CI[]  73.70  0.183  0.437  1.585  0.434  
CI[]  73.78  0.163  0.425  1.375  0.420  
CI[]  73.68  0.083  0.285  1.289  0.396  
CI[]  73.62  0.291  0.399  1.676  0.487  
VWCI  73.87  0.098  0.309  1.277  0.391  
WideResNet168  Baseline  77.52  0.103  0.278  0.984  0.336  
CI[]  77.04  0.109  0.280  1.011  0.345  
CI[]  77.46  0.104  0.272  0.974  0.339  
CI[]  77.53  0.074  0.211  0.931  0.327  
CI[]  77.23  0.085  0.239  1.015  0.336  
CI[]  77.48  0.295  0.485  1.378  0.434  
VWCI  77.74  0.038  0.101  0.891  0.314  
DenseNet4012  Baseline  65.91  0.074  0.134  1.238  0.463  
CI[]  66.20  0.064  0.141  1.236  0.463  
CI[]  63.61  0.086  0.177  1.360  0.496  
CI[]  65.13  0.052  0.127  1.249  0.471  
CI[]  65.86  0.019  0.053  1.206  0.456  
CI[]  62.82  0.127  0.193  1.510  0.523  
VWCI  67.45  0.026  0.094  1.161  0.439  
Comments
There are no comments yet.