For safety-critical vision tasks such as autonomous driving or computer-aided diagnosis, it is essential that the prediction uncertaintyBishop2006 ; Kingma2013 . One practical approximation is variational inference with Monte Carlo (MC) dropout Gal2016 . It is applied to obtain epistemic uncertainty, which is caused by uncertainty in the model weights due to training with data sets of limited size Bishop2006 ; Kendall2017 . However, it tends to be miscalibrated, i. e. the uncertainty does not correspond well to the model error Guo2017 ; Gal2017 .
We first take a step back and review the problem of the frequentist approach to uncertainty of prediction: The weights of a deep model are obtained by maximum likelihood estimationBishop2006 , and the normalized output likelihood for an unseen test input does not consider uncertainty in the weights Kendall2017 . The likelihood is generally unjustifiably high Guo2017 , and can be misinterpreted as high prediction confidence. This miscalibration can also be observed for model uncertainty provided by MC dropout variational inference. However, calibrated uncertainty is essential as miscalibration can lead to decisions with fatal consequences in the aforementioned task domains. In the last decades, several non-parametric and parametric calibration approaches such as isotonic regression Zadrozny2002 or Platt scaling Platt1999 have been presented. Recently, temperature scaling (TS) has been demonstrated to lead to well-calibrated model likelihood in non-Bayesian deep neural networks Guo2017 .
Our work extends temperature scaling to variational Bayesian inference with dropout to obtain well-calibrated model uncertainty. The main contributions of this paper are 1. definition for perfect calibration of uncertainty and definition for the expected uncertainty calibration error, 2. the derivation of temperature scaling for dropout variational inference, and 3. experimental results of different network architectures on CIFAR-10/100 Krizhevsky2009 , that demonstrate the improvement of calibration by the proposed method. By using temperature scaling together with Bayesian inference, we expect better calibrated uncertainty. To the best of our knowledge, temperature scaling has not yet been used to calibrate model uncertainty in variational Bayesian inference. Our code is available at: github.com/mlaves/bayesian-temperature-scaling.
The presented approach for obtaining well-calibrated uncertainty is applied to a general multi-class classification task. Let input
be a random variable with corresponding label. Let
be the output (logits) of a neural network with weight matrices, and with model likelihood for class , obtained by passing the model output through the softmax function . From a frequentist perspective, the softmax likelihood is often interpreted as confidence of prediction. Throughout this paper, we follow this definition. However, due to optimizing the weights via minimization of the negative log-likelihood of , modern deep models are prone to overly confident predictions and are therefore miscalibrated Guo2017 ; Gal2017 .
Let be the most likely class prediction of input with likelihood and true label . Then, following Guo et al. Guo2017 , perfect calibration is defined as
To determine model uncertainty, dropout variational inference is done by training the model with dropout Srivastava2014 and using dropout at test time to sample from the approximate posterior by performing stochastic forward passes Gal2016 ; Kendall2017 . This is also referred to as MC dropout. In MC dropout, the final probability vector is obtained by MC integration:
Entropy of the softmax likelihood is used to describe uncertainty of prediction Kendall2017 . We introduce normalization to scale the values to a range between and :
That is, in a batch of inputs all classified with uncertainty of e.g., a top-1 error of is expected.
2.1 Expected Uncertainty Calibration Error (UCE)
A popular way to quantify miscalibration of neural networks with a scalar value is the expectation of the difference between predicted softmax likelihood and accuracy
which can be approximated by the Expected Calibration Error (ECE) Naeini2015 ; Guo2017 . Practically, the output of a neural network is partitioned into bins with equal width and a weighted average of the difference between accuracy and confidence (softmax likelihood) is taken:
with total number of inputs and set of indices of inputs whose confidence falls into that bin (see Guo2017 for more details). We propose the following slightly modified notion of Eq. (5) to quantify miscalibration of uncertainty:
We refer to this as Expected Uncertainty Calibration Error (UCE) and analogically approximate with
See appendix A.1 for definitions of and .
2.2 Temperature Scaling for Dropout Variational Inference
State-of-the-art deep neural networks are generally miscalibrated with regard to softmax likelihood Guo2017 . However, when obtaining model uncertainty with dropout variational inference, this also tends to be not well-calibrated Gal2017 . Fig. 1 (top row) shows reliability diagrams Niculescu2005 for uncalibrated ResNet-101 He2016 trained on CIFAR-100 Krizhevsky2009 . The divergence from the identity function reveals miscalibration.
In this work, dropout is inserted before the last layer with fixed dropout probability of as in Gal2016 . Temperature scaling with is inserted before final softmax activation and before MC integration:
is optimized with respect to negative log-likelihood while performing MC dropout on the validation set. This is equiavlent to maximizing the entropy of Guo2017 . See appendix A.2 for more details on .
3 Experiments & Results
|Freq.||MC Dropout||Freq.||MC Dropout|
Tab. 1 reports test set results for different networks He2016 ; Huang2017 and data sets used to evaluate the performance of temperature scaling for dropout variational inference. The proposed UCE metric is used to quantify calibration of uncertainty. Fig. 1 shows reliability diagrams Niculescu2005 for different calibration scenarios of ResNet-101 He2016 on CIFAR-100. For MC dropout forward passes are performed. Uncalibrated ECE shows, that MC dropout already reduces miscalibration of model likelihood by up to percentage points. With TS calibration, MC dropout reduces ECE by 45–66 % and UCE drops drastically (especially for larger networks). This illustrates the magnitude of how much TS calibration benefits from Bayesian inference using MC dropout. Additional reliability diagrams showing similar results can be found in the appendix, as well as details on the training procedure.
Temperature scaling calibrates uncertainty obtained by dropout variational inference with high effectiveness. The experimental results confirm the hypothesis that the presented approach yields better calibrated uncertainty. In addition, substantially better calibrated softmax probability was achieved. MC dropout TS is simple to implement and scaling does not change the maximum of the output of a network, thus model accuracy is not compromised. Therefore, it is an obvious choice in Bayesian deep learning with dropout variational inference because well calibrated uncertainties are of utmost importance for safety-critical decision-making. However, there are many factors (e. g. network architecture, weight decay, dropout probability) influencing the uncertainty in Bayesian deep learning that have not been discussed in this paper and are open to future work.
This work has received funding from European Union EFRE projects OPhonLas and ProMoPro.
- (1) Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
- (2) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
- (3) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050–1059, 2016.
Alex Kendall and Yarin Gal.
What uncertainties do we need in bayesian deep learning for computer vision?In NeurIPS, pages 5574–5584, 2017.
- (5) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, pages 1321–1330, 2017.
- (6) Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In NeurIPS, pages 3581–3590, 2017.
- (7) Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages 694–699, 2002.
John C. Platt.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.In Advances in Large Margin Classifiers, pages 61–74, 1999.
- (9) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.
- (10) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
- (11) Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining Well Calibrated Probabilities Using Bayesian Binning. In AAAI, pages 2901–2907, 2015.
Alexandru Niculescu-Mizil and Rich Caruana.
Predicting good probabilities with supervised learning.In ICML, pages 625–632, 2005.
- (13) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- (14) G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261–2269, 2017.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in PyTorch.In NeurIPS Autodiff Workshop, 2017.
- (16) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
Appendix A Appendix
a.1 Expected Uncertainty Calibration Error
We restate the definition of Expected Uncertainty Calibration Error (UCE) from Eq. (8):
The error per bin is defined as
where and . Uncertainty per bin is defined as
a.2 Temperature Scaling with Monte Carlo Dropout
Temperature scaling with MC dropout variational inference is derived by closely following the derivation of frequentist temperature scaling in the appendix of . Let be a set of logit vectors obtained by MC dropout with stochastic forward passes for each input with true labels . Temperature scaling is the solution to entropy maximization
Guo et al. solve this constrained optimization problem with the method of Lagrange multipliers. We skip reviewing their proof as one can see that the solution to in the case of MC dropout integration provides
which recovers temperature scaling for . is optimized on the validation set using MC dropout.
a.3 Training Settings
The model implementations from PyTorch 1.2  are used and trained with following settings:
batch size of
AdamW optimizer  with initial learn rate of and
weight decay of
negative-log likelihood (cross entropy) loss
reduce-on-plateau learn rate scheduler with factor of
additional validation set is randomly extracted from the training set (5000 samples)
dropout with probability of before the last linear layer was used in all models during training
in MC dropout, forward passes with dropout probability of were performed
Code is available at: github.com/mlaves/bayesian-temperature-scaling.
a.4 Additional Reliability Diagrams
In this section, reliability diagrams for the other data set/model combinations from Tab. 1 are visualized to provide additional insight into the calibration performance. The proposed method is able to calibrate all models with respect to both UCE and ECE across all bins.