Well-calibrated Model Uncertainty with Temperature Scaling for Dropout Variational Inference

09/30/2019 ∙ by Max-Heinrich Laves, et al. ∙ uni hannover 0

In this paper, well-calibrated model uncertainty is obtained by using temperature scaling together with Monte Carlo dropout as approximation to Bayesian inference. The proposed approach can easily be derived from frequentist temperature scaling and yields well-calibrated model uncertainty as well as softmax likelihood.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For safety-critical vision tasks such as autonomous driving or computer-aided diagnosis, it is essential that the prediction uncertainty

of deep learning models is considered. Bayesian neural networks and recent advances in their approximation provide the mathematical tools for quantification of uncertainty

Bishop2006 ; Kingma2013 . One practical approximation is variational inference with Monte Carlo (MC) dropout Gal2016 . It is applied to obtain epistemic uncertainty, which is caused by uncertainty in the model weights due to training with data sets of limited size Bishop2006 ; Kendall2017 . However, it tends to be miscalibrated, i. e. the uncertainty does not correspond well to the model error Guo2017 ; Gal2017 .

We first take a step back and review the problem of the frequentist approach to uncertainty of prediction: The weights of a deep model are obtained by maximum likelihood estimation

Bishop2006 , and the normalized output likelihood for an unseen test input does not consider uncertainty in the weights Kendall2017 . The likelihood is generally unjustifiably high Guo2017 , and can be misinterpreted as high prediction confidence. This miscalibration can also be observed for model uncertainty provided by MC dropout variational inference. However, calibrated uncertainty is essential as miscalibration can lead to decisions with fatal consequences in the aforementioned task domains. In the last decades, several non-parametric and parametric calibration approaches such as isotonic regression Zadrozny2002 or Platt scaling Platt1999 have been presented. Recently, temperature scaling (TS) has been demonstrated to lead to well-calibrated model likelihood in non-Bayesian deep neural networks Guo2017 .

Our work extends temperature scaling to variational Bayesian inference with dropout to obtain well-calibrated model uncertainty. The main contributions of this paper are 1. definition for perfect calibration of uncertainty and definition for the expected uncertainty calibration error, 2. the derivation of temperature scaling for dropout variational inference, and 3. experimental results of different network architectures on CIFAR-10/100 Krizhevsky2009 , that demonstrate the improvement of calibration by the proposed method. By using temperature scaling together with Bayesian inference, we expect better calibrated uncertainty. To the best of our knowledge, temperature scaling has not yet been used to calibrate model uncertainty in variational Bayesian inference. Our code is available at: github.com/mlaves/bayesian-temperature-scaling.

2 Methods

The presented approach for obtaining well-calibrated uncertainty is applied to a general multi-class classification task. Let input

be a random variable with corresponding label

. Let

be the output (logits) of a neural network with weight matrices

, and with model likelihood for class

, which is sampled from a probability vector

, obtained by passing the model output through the softmax function . From a frequentist perspective, the softmax likelihood is often interpreted as confidence of prediction. Throughout this paper, we follow this definition. However, due to optimizing the weights via minimization of the negative log-likelihood of , modern deep models are prone to overly confident predictions and are therefore miscalibrated Guo2017 ; Gal2017 .

Let be the most likely class prediction of input with likelihood and true label . Then, following Guo et al. Guo2017 , perfect calibration is defined as


To determine model uncertainty, dropout variational inference is done by training the model with dropout Srivastava2014 and using dropout at test time to sample from the approximate posterior by performing stochastic forward passes Gal2016 ; Kendall2017 . This is also referred to as MC dropout. In MC dropout, the final probability vector is obtained by MC integration:


Entropy of the softmax likelihood is used to describe uncertainty of prediction Kendall2017 . We introduce normalization to scale the values to a range between and :


From Eq. (1) and Eq. (3), we define perfect calibration of uncertainty as


That is, in a batch of inputs all classified with uncertainty of e.g.

, a top-1 error of is expected.

2.1 Expected Uncertainty Calibration Error (UCE)

A popular way to quantify miscalibration of neural networks with a scalar value is the expectation of the difference between predicted softmax likelihood and accuracy


which can be approximated by the Expected Calibration Error (ECE) Naeini2015 ; Guo2017 . Practically, the output of a neural network is partitioned into bins with equal width and a weighted average of the difference between accuracy and confidence (softmax likelihood) is taken:


with total number of inputs and set of indices of inputs whose confidence falls into that bin (see Guo2017 for more details). We propose the following slightly modified notion of Eq. (5) to quantify miscalibration of uncertainty:


We refer to this as Expected Uncertainty Calibration Error (UCE) and analogically approximate with


See appendix A.1 for definitions of and .

2.2 Temperature Scaling for Dropout Variational Inference

State-of-the-art deep neural networks are generally miscalibrated with regard to softmax likelihood Guo2017 . However, when obtaining model uncertainty with dropout variational inference, this also tends to be not well-calibrated Gal2017 . Fig. 1 (top row) shows reliability diagrams Niculescu2005 for uncalibrated ResNet-101 He2016 trained on CIFAR-100 Krizhevsky2009 . The divergence from the identity function reveals miscalibration.

In this work, dropout is inserted before the last layer with fixed dropout probability of as in Gal2016 . Temperature scaling with is inserted before final softmax activation and before MC integration:


is optimized with respect to negative log-likelihood while performing MC dropout on the validation set. This is equiavlent to maximizing the entropy of Guo2017 . See appendix A.2 for more details on .

3 Experiments & Results

Figure 1: Reliability diagrams ( bins) for ResNet-101 on CIFAR-100. Top row: Uncalibrated frequentist confidence (left), and confidence and uncertainty obtained by dropout variational inference (right). Bottom row: Results from calibration with TS. Dashed lines denote perfect calibration.
Uncalibrated TS Calibrated
Freq. MC Dropout Freq. MC Dropout
CIFAR-10 ResNet-18 8.95 8.41 7.60 1.40 0.47 5.27
CIFAR-100 ResNet-101 29.63 24.62 30.33 3.50 1.92 2.41
CIFAR-100 DenseNet-169 30.62 23.98 29.62 6.10 2.89 2.69
Table 1: ECE and UCE test set results in % ( bins). 0 % means perfect calibration. In TS calibration with MC dropout the same value of was used to report both ECE and UCE.

Tab. 1 reports test set results for different networks He2016 ; Huang2017 and data sets used to evaluate the performance of temperature scaling for dropout variational inference. The proposed UCE metric is used to quantify calibration of uncertainty. Fig. 1 shows reliability diagrams Niculescu2005 for different calibration scenarios of ResNet-101 He2016 on CIFAR-100. For MC dropout forward passes are performed. Uncalibrated ECE shows, that MC dropout already reduces miscalibration of model likelihood by up to percentage points. With TS calibration, MC dropout reduces ECE by 45–66 % and UCE drops drastically (especially for larger networks). This illustrates the magnitude of how much TS calibration benefits from Bayesian inference using MC dropout. Additional reliability diagrams showing similar results can be found in the appendix, as well as details on the training procedure.

4 Conclusion

Temperature scaling calibrates uncertainty obtained by dropout variational inference with high effectiveness. The experimental results confirm the hypothesis that the presented approach yields better calibrated uncertainty. In addition, substantially better calibrated softmax probability was achieved. MC dropout TS is simple to implement and scaling does not change the maximum of the output of a network, thus model accuracy is not compromised. Therefore, it is an obvious choice in Bayesian deep learning with dropout variational inference because well calibrated uncertainties are of utmost importance for safety-critical decision-making. However, there are many factors (e. g. network architecture, weight decay, dropout probability) influencing the uncertainty in Bayesian deep learning that have not been discussed in this paper and are open to future work.


This work has received funding from European Union EFRE projects OPhonLas and ProMoPro.


  • (1) Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
  • (2) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • (3) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050–1059, 2016.
  • (4) Alex Kendall and Yarin Gal.

    What uncertainties do we need in bayesian deep learning for computer vision?

    In NeurIPS, pages 5574–5584, 2017.
  • (5) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, pages 1321–1330, 2017.
  • (6) Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In NeurIPS, pages 3581–3590, 2017.
  • (7) Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages 694–699, 2002.
  • (8) John C. Platt.

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.

    In Advances in Large Margin Classifiers, pages 61–74, 1999.
  • (9) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.
  • (10) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
  • (11) Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining Well Calibrated Probabilities Using Bayesian Binning. In AAAI, pages 2901–2907, 2015.
  • (12) Alexandru Niculescu-Mizil and Rich Caruana.

    Predicting good probabilities with supervised learning.

    In ICML, pages 625–632, 2005.
  • (13) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • (14) G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261–2269, 2017.
  • (15) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.

    Automatic differentiation in PyTorch.

    In NeurIPS Autodiff Workshop, 2017.
  • (16) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.

Appendix A Appendix

a.1 Expected Uncertainty Calibration Error

We restate the definition of Expected Uncertainty Calibration Error (UCE) from Eq. (8):

The error per bin is defined as


where and . Uncertainty per bin is defined as


a.2 Temperature Scaling with Monte Carlo Dropout

Temperature scaling with MC dropout variational inference is derived by closely following the derivation of frequentist temperature scaling in the appendix of [5]. Let be a set of logit vectors obtained by MC dropout with stochastic forward passes for each input with true labels . Temperature scaling is the solution to entropy maximization


subject to


Guo et al. solve this constrained optimization problem with the method of Lagrange multipliers. We skip reviewing their proof as one can see that the solution to in the case of MC dropout integration provides


which recovers temperature scaling for [5]. is optimized on the validation set using MC dropout.

a.3 Training Settings

The model implementations from PyTorch 1.2 [15] are used and trained with following settings:

  • batch size of

  • AdamW optimizer [16] with initial learn rate of and

  • weight decay of

  • negative-log likelihood (cross entropy) loss

  • reduce-on-plateau learn rate scheduler with factor of

  • additional validation set is randomly extracted from the training set (5000 samples)

  • dropout with probability of before the last linear layer was used in all models during training

  • in MC dropout, forward passes with dropout probability of were performed

Code is available at: github.com/mlaves/bayesian-temperature-scaling.

a.4 Additional Reliability Diagrams

In this section, reliability diagrams for the other data set/model combinations from Tab. 1 are visualized to provide additional insight into the calibration performance. The proposed method is able to calibrate all models with respect to both UCE and ECE across all bins.

Figure 2: Reliability diagrams ( bins) for ResNet-18 on CIFAR-10.
Figure 3: Reliability diagrams ( bins) for DenseNet-169 on CIFAR-100.