1 Introduction
For safetycritical vision tasks such as autonomous driving or computeraided diagnosis, it is essential that the prediction uncertainty
of deep learning models is considered. Bayesian neural networks and recent advances in their approximation provide the mathematical tools for quantification of uncertainty
Bishop2006 ; Kingma2013 . One practical approximation is variational inference with Monte Carlo (MC) dropout Gal2016 . It is applied to obtain epistemic uncertainty, which is caused by uncertainty in the model weights due to training with data sets of limited size Bishop2006 ; Kendall2017 . However, it tends to be miscalibrated, i. e. the uncertainty does not correspond well to the model error Guo2017 ; Gal2017 .We first take a step back and review the problem of the frequentist approach to uncertainty of prediction: The weights of a deep model are obtained by maximum likelihood estimation
Bishop2006 , and the normalized output likelihood for an unseen test input does not consider uncertainty in the weights Kendall2017 . The likelihood is generally unjustifiably high Guo2017 , and can be misinterpreted as high prediction confidence. This miscalibration can also be observed for model uncertainty provided by MC dropout variational inference. However, calibrated uncertainty is essential as miscalibration can lead to decisions with fatal consequences in the aforementioned task domains. In the last decades, several nonparametric and parametric calibration approaches such as isotonic regression Zadrozny2002 or Platt scaling Platt1999 have been presented. Recently, temperature scaling (TS) has been demonstrated to lead to wellcalibrated model likelihood in nonBayesian deep neural networks Guo2017 .Our work extends temperature scaling to variational Bayesian inference with dropout to obtain wellcalibrated model uncertainty. The main contributions of this paper are 1. definition for perfect calibration of uncertainty and definition for the expected uncertainty calibration error, 2. the derivation of temperature scaling for dropout variational inference, and 3. experimental results of different network architectures on CIFAR10/100 Krizhevsky2009 , that demonstrate the improvement of calibration by the proposed method. By using temperature scaling together with Bayesian inference, we expect better calibrated uncertainty. To the best of our knowledge, temperature scaling has not yet been used to calibrate model uncertainty in variational Bayesian inference. Our code is available at: github.com/mlaves/bayesiantemperaturescaling.
2 Methods
The presented approach for obtaining wellcalibrated uncertainty is applied to a general multiclass classification task. Let input
be a random variable with corresponding label
. Letbe the output (logits) of a neural network with weight matrices
, and with model likelihood for class, which is sampled from a probability vector
, obtained by passing the model output through the softmax function . From a frequentist perspective, the softmax likelihood is often interpreted as confidence of prediction. Throughout this paper, we follow this definition. However, due to optimizing the weights via minimization of the negative loglikelihood of , modern deep models are prone to overly confident predictions and are therefore miscalibrated Guo2017 ; Gal2017 .Let be the most likely class prediction of input with likelihood and true label . Then, following Guo et al. Guo2017 , perfect calibration is defined as
(1) 
To determine model uncertainty, dropout variational inference is done by training the model with dropout Srivastava2014 and using dropout at test time to sample from the approximate posterior by performing stochastic forward passes Gal2016 ; Kendall2017 . This is also referred to as MC dropout. In MC dropout, the final probability vector is obtained by MC integration:
(2) 
Entropy of the softmax likelihood is used to describe uncertainty of prediction Kendall2017 . We introduce normalization to scale the values to a range between and :
(3) 
From Eq. (1) and Eq. (3), we define perfect calibration of uncertainty as
(4) 
That is, in a batch of inputs all classified with uncertainty of e.g.
, a top1 error of is expected.2.1 Expected Uncertainty Calibration Error (UCE)
A popular way to quantify miscalibration of neural networks with a scalar value is the expectation of the difference between predicted softmax likelihood and accuracy
(5) 
which can be approximated by the Expected Calibration Error (ECE) Naeini2015 ; Guo2017 . Practically, the output of a neural network is partitioned into bins with equal width and a weighted average of the difference between accuracy and confidence (softmax likelihood) is taken:
(6) 
with total number of inputs and set of indices of inputs whose confidence falls into that bin (see Guo2017 for more details). We propose the following slightly modified notion of Eq. (5) to quantify miscalibration of uncertainty:
(7) 
We refer to this as Expected Uncertainty Calibration Error (UCE) and analogically approximate with
(8) 
See appendix A.1 for definitions of and .
2.2 Temperature Scaling for Dropout Variational Inference
Stateoftheart deep neural networks are generally miscalibrated with regard to softmax likelihood Guo2017 . However, when obtaining model uncertainty with dropout variational inference, this also tends to be not wellcalibrated Gal2017 . Fig. 1 (top row) shows reliability diagrams Niculescu2005 for uncalibrated ResNet101 He2016 trained on CIFAR100 Krizhevsky2009 . The divergence from the identity function reveals miscalibration.
In this work, dropout is inserted before the last layer with fixed dropout probability of as in Gal2016 . Temperature scaling with is inserted before final softmax activation and before MC integration:
(9) 
is optimized with respect to negative loglikelihood while performing MC dropout on the validation set. This is equiavlent to maximizing the entropy of Guo2017 . See appendix A.2 for more details on .
3 Experiments & Results
Uncalibrated  TS Calibrated  

Freq.  MC Dropout  Freq.  MC Dropout  
Data Set  Model  ECE  ECE  UCE  ECE  ECE  UCE 
CIFAR10  ResNet18  8.95  8.41  7.60  1.40  0.47  5.27 
CIFAR100  ResNet101  29.63  24.62  30.33  3.50  1.92  2.41 
CIFAR100  DenseNet169  30.62  23.98  29.62  6.10  2.89  2.69 
Tab. 1 reports test set results for different networks He2016 ; Huang2017 and data sets used to evaluate the performance of temperature scaling for dropout variational inference. The proposed UCE metric is used to quantify calibration of uncertainty. Fig. 1 shows reliability diagrams Niculescu2005 for different calibration scenarios of ResNet101 He2016 on CIFAR100. For MC dropout forward passes are performed. Uncalibrated ECE shows, that MC dropout already reduces miscalibration of model likelihood by up to percentage points. With TS calibration, MC dropout reduces ECE by 45–66 % and UCE drops drastically (especially for larger networks). This illustrates the magnitude of how much TS calibration benefits from Bayesian inference using MC dropout. Additional reliability diagrams showing similar results can be found in the appendix, as well as details on the training procedure.
4 Conclusion
Temperature scaling calibrates uncertainty obtained by dropout variational inference with high effectiveness. The experimental results confirm the hypothesis that the presented approach yields better calibrated uncertainty. In addition, substantially better calibrated softmax probability was achieved. MC dropout TS is simple to implement and scaling does not change the maximum of the output of a network, thus model accuracy is not compromised. Therefore, it is an obvious choice in Bayesian deep learning with dropout variational inference because well calibrated uncertainties are of utmost importance for safetycritical decisionmaking. However, there are many factors (e. g. network architecture, weight decay, dropout probability) influencing the uncertainty in Bayesian deep learning that have not been discussed in this paper and are open to future work.
Acknowledgments
This work has received funding from European Union EFRE projects OPhonLas and ProMoPro.
References
 (1) Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 (2) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2014.
 (3) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050–1059, 2016.

(4)
Alex Kendall and Yarin Gal.
What uncertainties do we need in bayesian deep learning for computer vision?
In NeurIPS, pages 5574–5584, 2017.  (5) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, pages 1321–1330, 2017.
 (6) Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In NeurIPS, pages 3581–3590, 2017.
 (7) Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages 694–699, 2002.

(8)
John C. Platt.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
In Advances in Large Margin Classifiers, pages 61–74, 1999.  (9) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.
 (10) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
 (11) Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining Well Calibrated Probabilities Using Bayesian Binning. In AAAI, pages 2901–2907, 2015.

(12)
Alexandru NiculescuMizil and Rich Caruana.
Predicting good probabilities with supervised learning.
In ICML, pages 625–632, 2005.  (13) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 (14) G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261–2269, 2017.

(15)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in PyTorch.
In NeurIPS Autodiff Workshop, 2017.  (16) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
Appendix A Appendix
a.1 Expected Uncertainty Calibration Error
We restate the definition of Expected Uncertainty Calibration Error (UCE) from Eq. (8):
The error per bin is defined as
(10) 
where and . Uncertainty per bin is defined as
(11) 
a.2 Temperature Scaling with Monte Carlo Dropout
Temperature scaling with MC dropout variational inference is derived by closely following the derivation of frequentist temperature scaling in the appendix of [5]. Let be a set of logit vectors obtained by MC dropout with stochastic forward passes for each input with true labels . Temperature scaling is the solution to entropy maximization
(12) 
subject to
(13) 
(14) 
(15) 
Guo et al. solve this constrained optimization problem with the method of Lagrange multipliers. We skip reviewing their proof as one can see that the solution to in the case of MC dropout integration provides
(16)  
(17) 
which recovers temperature scaling for [5]. is optimized on the validation set using MC dropout.
a.3 Training Settings
The model implementations from PyTorch 1.2 [15] are used and trained with following settings:

batch size of

AdamW optimizer [16] with initial learn rate of and

weight decay of

negativelog likelihood (cross entropy) loss

reduceonplateau learn rate scheduler with factor of

additional validation set is randomly extracted from the training set (5000 samples)

dropout with probability of before the last linear layer was used in all models during training

in MC dropout, forward passes with dropout probability of were performed
Code is available at: github.com/mlaves/bayesiantemperaturescaling.
a.4 Additional Reliability Diagrams
In this section, reliability diagrams for the other data set/model combinations from Tab. 1 are visualized to provide additional insight into the calibration performance. The proposed method is able to calibrate all models with respect to both UCE and ECE across all bins.
Comments
There are no comments yet.