Calibration of Model Uncertainty for Dropout Variational Inference

06/20/2020 ∙ by Max-Heinrich Laves, et al. ∙ 0

The model uncertainty obtained by variational Bayesian inference with Monte Carlo dropout is prone to miscalibration. In this paper, different logit scaling methods are extended to dropout variational inference to recalibrate model uncertainty. Expected uncertainty calibration error (UCE) is presented as a metric to measure miscalibration. The effectiveness of recalibration is evaluated on CIFAR-10/100 and SVHN for recent CNN architectures. Experimental results show that logit scaling considerably reduce miscalibration by means of UCE. Well-calibrated uncertainty enables reliable rejection of uncertain predictions and robust detection of out-of-distribution data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in deep learning have led to high accuracy predictions for classification tasks, making deep-learning classifiers an attractive choice for safety-critical applications like autonomous driving

(Chen et al., 2015) or computer-aided diagnosis (Esteva et al., 2017). However, the high accuracy of recent deep learning models in not sufficient for such applications. In cases, where serious decisions are made upon model’s predictions, it is essential to also consider the uncertainty of these predictions. We need to know if the prediction of a model is likely to be incorrect or if invalid input data is presented to a deep model, e. g. data that is far away from the training domain or obtained from a defective sensor. The consequences of a false decision based on an uncertain prediction can be fatal.

Figure 1:

Calibration of uncertainty: (Left) reliability diagrams with uncertainty calibration error (UCE) and (right) detection of out-of-distribution (OoD) data. Uncalibrated uncertainty does not correspond well with the model error. Logit scaling is able to recalibrate deep Bayesian neural networks, which enables robust OoD detection. The dashed line denotes perfect calibration.

A natural expectation is that the certainty of a prediction should be directly correlated with the quality of the prediction. In other words, a prediction with a high certainty is more likely to be accurate than an uncertain prediction which is likely to be incorrect. A common misconception is the assumption that the estimated class likelihood (of a softmax activation) can be directly used as a confidence measure for the predicted class. This expectation is dangerous in the context of critical decision-making. The estimated likelihood of a model trained by minimizing the negative log-likelihood (i. e. cross entropy) is highly overconfident. That is, the estimated likelihood is considerably higher than the observed frequency of accurate predictions with that likelihood

(Guo et al., 2017).

Guo et al. proposed calibration of the likelihood estimation by scaling the logit output of a neural network to achieve a correlation between the predicted likelihood and the expected likelihood. However, they follow a frequentist approach, where they assume a single best point estimate of the parameters (or weights) of a neural network. In frequentist inference, the weights of a deep model are obtained by maximum likelihood estimation (Bishop, 2006), and the normalized output likelihood for an unseen test input does not consider uncertainty in the weights (Kendall and Gal, 2017). Weight uncertainty (also referred to as model or epistemic uncertainty) is a considerable source of predictive uncertainty for models trained on data sets of limited size (Bishop, 2006; Kendall and Gal, 2017). Bayesian neural networks and recent advances in their approximation provide valuable mathematical tools for quantification of model uncertainty (Gal and Ghahramani, 2016; Kingma and Welling, 2014). Instead of assuming the existence of a single best parameter set, we place distributions over the parameters and want to consider all possible parameter configurations, weighted by their posterior. More formally, given a training data set of labeled images and an unseen test image with class label , we are interested in evaluating the predictive distribution

(1)

This integral requires to evaluate the posterior , which involves the intractable marginal likelihood (Gal, 2016). One practical approximation of the posterior is variational inference with Monte Carlo (MC) dropout (Gal and Ghahramani, 2016). It is commonly used to obtain epistemic uncertainty, which is caused by uncertainty in the model weights. However, epistemic uncertainty from MC dropout still tends to be miscalibrated, i. e. the uncertainty does not correspond well with the model error (Gal et al., 2017a). The quality of uncertainty highly depends on the approximate posterior (Louizos and Welling, 2017). In (Lakshminarayanan et al., 2017) it is stated that MC dropout uncertainty does not allow to robustly detect out-of-distribution data. However, calibrated uncertainty is essential as miscalibration can lead to decisions with catastrophic consequences in the aforementioned task domains.

We therefore propose a notion for perfect calibration of uncertainty and propose a definition of expected uncertainty calibration error

(UCE), derived from ECE. We then show how current calibration techniques (for confidence) based on logit scaling can be extended to calibrate model uncertainty. We compare calibration results for temperature scaling, vector scaling and auxiliary scaling

(Guo et al., 2017; Kuleshov et al., 2018) using our metric UCE as well as established ECE. We finally show how calibrated model uncertainty improves out-of-distribution (OoD) detection, as well as predictive accuracy by rejecting high-uncertainty predictions. To the best of our knowledge, logit scaling has not been used to calibrate model uncertainty in Bayesian inference for classification.

In summary the main contributions of our work are

  1. a new metric for perfect calibration of uncertainty,

  2. derivation of logit scaling for Gaussian Dropout,

  3. first to apply logit scaling calibration to a Bayesian classifier obtained from MC Dropout, and

  4. empirical evidence that logit scaling leads to well-calibrated model uncertainty which allows robust OoD detection (in contrast to what is stated in (Lakshminarayanan et al., 2017); shown for different network architectures on CIFAR-10/100 and SVHN.

Our code is available at: https://github.com/link-withheld.

2 Related Work

Overconfident predictions of neural networks have been addressed by entropy regularization techniques. Szegedy et al. presented label smoothing as regularization of models during supervised training for classification (Szegedy et al., 2016)

. They state that a model trained with one-hot encoded labels is prone to becoming overconfident about its predictions, which causes overfitting and poor generalization. Pereyra et al. link label smoothing to confidence penalty and propose a simple way to prevent overconfident networks

(Pereyra et al., 2017). Low entropy output distributions are penalized by adding the negative entropy to the training objective. However, the referred works do not apply entropy regularization to the calibration of confidence or uncertainty. In the last decades, several non-parametric and parametric calibration approaches such as isotonic regression (Zadrozny and Elkan, 2002) or Platt scaling (Platt, 1999) have been presented. Recently, temperature scaling has been demonstrated to lead to well-calibrated model likelihood in non-Bayesian deep neural networks (Guo et al., 2017). It uses a single scalar to scale the logits and smoothen () or sharpen () the softmax output and thus regularize the entropy. Logit scaling has also been introduced to approximate categorical distributions by the Gumbel-Softmax or Concrete distribution (Jang et al., 2016; Maddison et al., 2016). Recently, (Kull et al., 2019) stated that temperature scaling does not lead to classwise-calibrated models because the single parameter cannot calibrate each class individually. They proposed Dirichlet calibration to address this problem. To verify this statement, we will investigate classwise logit scaling in addition to temperature scaling. We will show later that temperature scaling for calibrating model uncertainty in Bayesian deep learning, which takes into account all classes, does not have this shortcoming. More complex methods, such as a neural network as auxiliary recalibration model, have been used in calibrated regression (Kuleshov et al., 2018).

3 Methods

In this section, we discuss how model uncertainty is obtained by Monte Carlo Gaussian dropout and how it can be calibrated with logit scaling. We define the expected uncertainty calibration error as a new metric to quantify miscalibration and describe confidence penalty as an alternative to logit scaling.

3.1 Uncertainty Estimation

We assume a general multi-class classification task with classes. Let input

be a random variable with corresponding label

. Let be the output (logits) of a neural network with weight matrices , and with model likelihood for class

, which is sampled from a probability vector

, obtained by passing the model output through the softmax function . From a frequentist perspective, the softmax likelihood is often interpreted as confidence of prediction. Throughout this paper, we follow this definition.

To determine model uncertainty, dropout variational inference is performed by training the model with dropout (Srivastava et al., 2014) and using dropout at test time to sample from the approximate posterior distribution by performing stochastic forward passes (Gal and Ghahramani, 2016; Kendall and Gal, 2017). This is also referred to as MC dropout. In MC dropout, the final probability vector is obtained by MC integration:

(2)

The entropy of the softmax likelihood is used to describe uncertainty of prediction (Kendall and Gal, 2017). In contrast to confidence as a quality measure of prediction (see § 3.3), uncertainty takes into account the likelihoods of all classes. We propose to use the normalized entropy to scale the values to a range between and :

(3)

Besides MC dropout there are other methods for estimating the model uncertainty such as Bayes by Backprop (Blundell et al., 2015), which uses Monte Carlo gradient estimation to learn a distribution on the weights of a neural network, or SWAG (Maddox et al., 2019)

, which approximates the posterior distribution with a Gaussian using the trajectory of stochastic gradient descent. These methods are however not discussed in this paper.

3.2 Monte Carlo Gaussian Dropout

Figure 2: Implicit output distribution of MC dropout and corresponding Gaussian dropout. Gaussian dropout replaces Bernoulli dropout and allows a learnable dropout rate . The input and the weights of the convolutional layer are randomly initialized.

We will first review Gaussian dropout, which was proposed by (Wang and Manning, 2013), and subsequently use it to obtain model uncertainty with MC dropout. Dropout is a stochastic regularization technique, where entries of the input to a weight layer are randomly set to zero by elementwise multiplication with

(4)
(5)

with dropout rate . This introduces Bernoulli noise during optimization and reduces overfitting of the training data. The resulting output

of a layer with dropout is a weighted sum of Bernoulli random variables. Then, the central limit theorem states, that

is approximately normally distributed (see Fig. 

2

). Instead of sampling from the weights and computing the resulting output, we can directly sample from the implicit Gaussian distribution of dropout

(6)

with

(7)
(8)

using the reparameterization trick (Kingma et al., 2015)

(9)

Gaussian dropout is a continuous approximation to Bernoulli dropout, and in comparison it will better approximate the true posterior distribution and is expected to provide improved uncertainty estimates (Louizos and Welling, 2017)

. Throughout this paper, Gaussian dropout is used as a substitute to Bernoulli dropout to obtain epistemic uncertainty under the MC dropout framework. It can efficiently be implemented in four lines of PyTorch code (see Fig. 

3). The dropout rate is now a learnable parameter and does not need to be chosen carefully by hand. In fact,

could be optimized w.r.t. uncertainty calibration, scaling the variance of the implicit Gaussian of dropout. A similar approach was presented by

(Gal et al., 2017a) using the Concrete distribution. However, we will focus on logit scaling methods for calibration and therefore fixed in our subsequent experiments.

Gaussian dropout has been used in the context of uncertainty estimation in prior work. In (Louizos and Welling, 2017)

, it is used together with multiplicative normalizing flows to improve the approximate posterior. A similar Gaussian approximation of Batch Normalization was presented in

(Teye et al., 2018), where Monte Carlo Batch Normalization is proposed as approximate Bayesian inference.

def Gaussian_dropout(x, p, layer):     mu = conv2d(x, layer.weight.data)     sigma = conv2d(x**2, layer.weight.data**2)     sigma = (p / (1 - p) * sigma).sqrt()     eps = randn_like(mu)     return mu + sigma * eps

Figure 3: PyTorch implementation of Gaussian dropout for a 2D convolutional layer. Gaussian dropout can be used for all common weight layers.

3.3 Calibration of Uncertainty

To give an insight into our general approach to calibration of uncertainty, we will first revisit the definition of perfect calibration of confidence (Guo et al., 2017) and show how this concept can be extended to calibration of uncertainty.

Let be the most likely class prediction of input with likelihood and true label . Then, following (Guo et al., 2017), perfect calibration of confidence is defined as

(10)

That is, the probability of a correct prediction given the prediction confidence should exactly correspond to the prediction confidence.

From Eq. (10) and Eq. (3), we define perfect calibration of uncertainty as

(11)

That is, in a batch of inputs that are all predicted with uncertainty of e. g. , a top-1 error of is expected. The confidence is interpreted as the probability of belonging to a particular class, which should naturally correlate with the model error of that class. This characteristic does not generally apply to entropy, and therefore the question arises why entropy should resonate with the model error. However, entropy is considered a measure of uncertainty, and we expect that a prediction with lower uncertainty is less likely to be false and vice versa. In fact, our experimental results for uncalibrated models show that the confidence is as miscalibrated as the normalized entropy (see Fig. 4).

3.4 Expected Uncertainty Calibration Error (UCE)

Due to optimizing the weights via minimization of the negative log-likelihood of , modern deep models are prone to overly confident predictions and are therefore miscalibrated (Guo et al., 2017; Gal et al., 2017a). A popular way to quantify miscalibration of neural networks with a scalar value is the expectation of the difference between predicted softmax likelihood and accuracy

(12)

based on the natural expectation that confidence should linearly correlate to the likelihood of a correct prediction. This expectation of the difference can be approximated by the Expected Calibration Error (ECE) (Naeini et al., 2015; Guo et al., 2017). The output of a neural network is partitioned into bins with equal width and a weighted average of the difference between accuracy and confidence (softmax likelihood) is taken:

(13)

with total number of inputs and set of indices of inputs whose confidence falls into that bin (see (Guo et al., 2017) for more details). We propose the following slightly modified notion of Eq. (12) to quantify miscalibration of uncertainty:

(14)

We refer to this as Expected Uncertainty Calibration Error (UCE) and approximate analogously with

(15)

The error per bin is defined as

(16)

where and . Uncertainty per bin is defined as

(17)

In (Kull et al., 2019), it is stated that the ECE has a fundamental limitation. Due to binning across all classes, over-confidence on one class can be compensated by under-confidence on another class. Thus, a model can achieve low ECE values even if the confidence for each classes is either over- or underestimated. They propose the classwise ECE (cECE) and, following that, we additionally define the classwise UCE (cUCE) as

(18)

to evaluate classwise calibration. It is defined as the mean of all UCEs per class, which are denoted by . Additionally, we plot vs.  to create reliability diagrams and visualize calibration.

3.5 Temperature Scaling for Dropout Variational Inference

State-of-the-art deep neural networks are generally miscalibrated with regard to softmax likelihood (Guo et al., 2017). However, when obtaining model uncertainty with dropout variational inference, this also tends to be not well-calibrated (Louizos and Welling, 2017; Gal et al., 2017a; Lakshminarayanan et al., 2017). Fig. 1 (left) shows reliability diagrams (Niculescu-Mizil and Caruana, 2005) for ResNet-101 trained on CIFAR-100. The divergence from the identity function reveals miscalibration. Furthermore, it is not possible to robustly detect OoD data from uncalibrated uncertainty (see Fig. 1 (right)). If the fraction of OoD data in a batch of test images is , there is almost no increase in mean uncertainty. We first address the problem using temperature scaling, which is the most straightforward logit scaling method for recalibration.

Temperature scaling with MC dropout variational inference is derived by closely following the derivation of frequentist temperature scaling in the appendix of (Guo et al., 2017). Let be a set of logit vectors obtained by MC dropout with stochastic forward passes for each input with true labels . Temperature scaling is the solution to entropy maximization

(19)

subject to

(20)
(21)
(22)

Guo et al. solve this constrained optimization problem with the method of Lagrange multipliers. We skip reviewing their proof as one can see that the solution to in the case of MC dropout integration provides

(23)
(24)

which yields temperature scaling for (Guo et al., 2017). A scalar parameter cannot rescale the class logits individually. Thus, more complex logit scaling can be derived by using any function at this point to smoothen or sharpen the softmax output (see next section).

In this work, Gaussian dropout is inserted between each weight layer with fixed dropout rate of . Temperature scaling with is inserted before final softmax activation and before MC integration:

(25)

First, is trained with Gaussian dropout until convergence on the training set. Next, we fix the parameters and optimize with respect to the negative log-likelihood on a separate calibration set using MC Gaussian dropout. This is equivalent to maximizing the entropy of (Guo et al., 2017).

3.6 Classwise Logit Scaling

It is stated by (Kull et al., 2019) that temperature scaling would be inferior to more complex calibration methods when compared by means of classwise calibration. In (Guo et al., 2017), temperature scaling is used to calibrate the confidence that takes into account only one class probability. In contrast, we use temperature scaling to calibrate the model uncertainty, expressed via normalized entropy. This considers all class probabilities and thus, we hypothesize that temperature scaling implicitly leads to well-calibrated classwise uncertainty.

To demonstrate this experimentally, we implement vector scaling and auxiliary scaling and compare them using classwise UCE. Vector scaling is a multi-class extension of temperature scaling, where an individual scaling factor for each class is used to scale the final softmax output:

(26)

with . Auxiliary scaling makes use of a more powerful auxiliary recalibration model consisting of a two-layer fully-connected network with

hidden units and leaky ReLU activations after the hidden layer:

(27)

which is inspired by (Kuleshov et al., 2018). The intuition behind this is that recalibration may require a more complex function than simple scaling. Both and the parameters of the auxiliary model are optimized w.r.t. negative log-likelihood in a separate calibration phase by gradient descent. We initialize with and , respectively. Thus, recalibration is started form the identity function.

It must be emphasized that in contrast to temperature scaling, both vector and aux scaling can change the maximum of the softmax and thus affect model accuracy.

3.7 Confidence Penalty

Additionally, we compare temperature scaling to entropy regularization, where low entropy output distributions are penalized by adding the negative entropy

of the softmax output to the negative log-likelihood training objective, weighted by an additional hyperparameter

. This leads to the following optimization function:

(28)

We reproduce the experiment of Pereyra et al. on supervised image classification (Pereyra et al., 2017) and compare the quality of calibration of confidence and uncertainty to logit scaling calibration methods. Calibration by confidence penalty must be performed during the training and cannot be done afterwards. Thus, a separate calibration phase is omitted.

4 Experiments

The experimental results are presented threefold: First, the proposed logit scaling methods are used to calibrate confidence and uncertainty and are compared with entropy regulation; second, predictions with high uncertainty are rejected; and third, the effect of out-of-distribution data on uncertainty is analyzed. All models were trained from random initialization. More details on the training procedure can be found in the appendix.

4.1 Uncertainty Calibration

To show the effectiveness of uncertainty calibration, we train ResNet-34 (He et al., 2016) and DenseNet-121 (Huang et al., 2017) on CIFAR-10 (Krizhevsky and Hinton, 2009) and SVHN (Netzer et al., 2011), as well as ResNet-101 and DenseNet-169 on CIFAR-100 with Gaussian dropout until convergence. We mainly focus on the calibration of uncertainty obtained by performing forward passes with MC Gaussian dropout. Additionally, we reproduce the experiments of (Guo et al., 2017) and analyze calibration of frequentist confidence along with likelihood values from MC dropout. Subsequently, the models are calibrated using the previously mentioned logit scaling methods. The validation set with 5,000 images is used as calibration set. We additionally train all networks in the exact same manner with confidence penalty loss with fixed . The proposed UCE and classwise UCE metrics are used to quantify calibration of uncertainty. Reliability diagrams (top-1 error vs. uncertainty) are used to visualize (mis-)calibration. Classwise UCE values are given in Tab. 1 and the reliability diagrams show the corresponding UCE.

4.2 Rejection of Uncertain Predictions

An example application of well-calibrated uncertainty is the rejection of uncertain predictions. In e. g. a medical imaging scenario, a critical decision should only be made on the basis of reliable predictions. We define an uncertainty threshold and reject all predictions from the test set where . A decrease in false predictions of the remaining test set is expected.

4.3 Out-of-Distribution Detection

Deep neural networks only provide reliable predictions for data on which they have been trained. In practice, however, the trained network will encounter samples that lie outside the distribution of training data. Problematically, a miscalibrated model will still produce highly confident estimates for such out-of-distribution (OoD) data (Lee et al., 2018).

To our surprise, Bayesian neural networks have not been extensively studied for out-of-distribution detection. Epistemic uncertainty from MC dropout was successfully used to detect OoD samples in neural machine translation

(Xiao et al., 2019). We reproduce the experiments presented by (Lakshminarayanan et al., 2017), where predictive uncertainty obtained from deep ensembles is used to detect if data from CIFAR10 is provided to a network trained on SVHN. They state that uncertainty produced by MC dropout is over-confident and cannot robustly detect OoD data. We expect that well-calibrated uncertainty from Bayesian methods allows us to detect if data from CIFAR10 is presented to a deep model trained on SVHN. However, the SVHN data set shows house numbers and the CIFAR data set contains everyday objects and animals; the data domains are overly disjoint. To demonstrate the OoD detection ability under more difficult conditions, we additionally provide images from CIFAR100 to a deep model trained on CIFAR10 (note that both CIFAR data sets have no mutual classes).

In this experiment, we compose a batch of 100 images from the test set of the training domain and stepwise replace images with out-of-distribution data. In practice, it is expected that models are applied to a mix of known and unknown classes. After each step, we evaluate the batch mean uncertainty and expect, that the mean uncertainty increases as a function of the fraction of OoD data.

5 Results

uncalibrated conf. penalty temp. scaling vector scaling aux. scaling
Data Set Model cECE cUCE cECE cUCE cECE cUCE cECE cUCE cECE cUCE
CIFAR-10 ResNet-34 4.46 4.03 8.29 19.8 1.95 3.68 2.09 3.73 2.10 2.38
CIFAR-10 DenseNet-121 10.1 9.52 8.49 18.5 3.05 5.72 3.15 6.09 2.98 4.55
CIFAR-100 ResNet-101 20.5 23.2 14.6 19.4 10.8 11.5 10.7 11.4 32.9 35.3
CIFAR-100 DenseNet-169 32.4 37.1 15.6 20.6 12.9 13.9 12.8 13.8 48.9 52.6
SVHN ResNet-34 2.37 2.07 9.11 22.3 1.47 3.47 1.44 3.43 1.34 1.85
SVHN DenseNet-121 2.91 2.47 7.53 19.7 2.06 5.08 1.96 4.88 1.51 2.46
Table 1: Classwise ECE and UCE test set results in % ( bins). 0 % means perfect calibration.
ResNet-101/CIFAR100 DenseNet-169/CIFAR100
Figure 4: Reliability diagrams ( bins) on CIFAR-100 for ResNet-101 (left) and DenseNet-169 (right). Top row: Uncalibrated frequentist confidence, and likelihood and uncertainty obtained by MC Gaussian dropout. The following rows show the results of the logit scaling methods. The dotted lines illustrates perfect calibration. Additional diagrams can be found in the supplemental material.
Rejection of unreliable predictions Out-of-Distribution Detection
Figure 5: (Left) The effect of the uncertainty threshold on the test set error for the rejection of uncertain predictions. (Right) Test set results of out-of-distribution detection.

In this section, the results of the above mentioned experimental setup are presented and discussed.

5.1 Uncertainty Calibration

Tab. 1 reports classwise UCE test set results and Fig. 4 shows reliability diagrams for the experimental setup described in the previous section. All logit scaling methods considerably reduce miscalibration on CIFAR-10/100 by means of cECE and cUCE. For the smaller networks on CIFAR-10 and SVHN, the more powerful aux scaling yields lowest cUCE. On CIFAR-100, however, aux scaling increases miscalibration. In this case, the auxiliary model has units in the hidden layer and easily overfits the calibration set (we observe calibration set accuracy of 100 %). This results in worse calibration on the test set than the uncalibrated model. A possible solution to this is adding regularization (e. g. early stopping or weight decay) during optimization of . If the model is already well-calibrated (e. g. for SVHN in our experiments), temperature scaling and vector scaling can slightly worsen calibration. In this case, a larger calibration set is preferred or recalibration can be omitted at all. Confidence penalty only slightly reduces miscalibration for larger models on CIFAR-100. On all other configurations, it leads to worse calibration. As hypothesized in § 2, temperature scaling results in classwise calibrated uncertainty and is only marginally outperformed by the classwise logit scaling methods. The reliability diagrams in Fig. 4 give additional insight and show, that calibrated uncertainty corresponds well with the model error. It is worth noting that the likelihood in the Bayesian approach is generally better calibrated than the frequentist confidence.

5.2 Rejection of Uncertain Predictions

Fig. 5 (left) shows the top-1 error as a function of decreasing . For both uncalibrated and calibrated uncertainty, decreasing reduces the top-1 error. Again, we can observe the underestimation of uncalibrated uncertainty: has little effect at first and few uncertain predictions are rejected. Using calibrated uncertainty with temperature or vector scaling, the relationship is almost linear, allowing robust rejection of uncertain predictions. Except for aux scaling on CIFAR-100, logit scaling is capable of reducing the top-1 error below 1 %. Further, we observe that confidence penalty can lead to over-estimation of uncertainty.

5.3 Out-of-Distribution Detection

Fig. 5 (right) shows the effect of calibrated uncertainty to OoD detection. All calibration approaches are able to improve the detection of OoD data. The benefit of calibration is most noticeable on ResNet (C10  C100) and DenseNet (SVHN  C10, C10  SVHN), where the mean uncertainty stays almost constant for OoD data and thus, robust OoD detection is only possible after calibration. As in Fig. 5 (left), we can observe overestimation of uncertainty for confidence penalty. In some cases (e. g. DenseNet SVHN  C10), this causes a more robust OoD detection. This is in contrast to the results presented in (Lakshminarayanan et al., 2017), where MC dropout uncertainty was not able to capture OoD data sufficiently.

6 Conclusion

In this paper, calibration of Bayesian model uncertainty is discussed. We derive logit scaling as entropy maximization technique to recalibrate the uncertainty of deep models trained with Gaussian dropout. Following commonly accepted metrics for calibration of confidence, we present the (classwise) expected uncertainty calibration error to quantify miscalibration of uncertainty.

Logit scaling calibrates uncertainty obtained by Monte Carlo Gaussian dropout with high effectiveness. The experimental results show that better calibrated uncertainty allows more robust predictions and detection of out-of-distribution data; a key feature that is particularly important in safety-critical applications. Logit scaling is easy to implement and more effective than confidence penalty during training. Simple scaling methods are preferred over more complex methods, as they provide similar results and do not tend to overfit the calibration set. Temperature scaling improves uncertainty estimation without affecting the accuracy of the model. Vector and auxiliary scaling also improve calibration of uncertainty, but can have (positive or negative) influence on predictive accuracy. By using entropy, the classwise uncertainty calibrated by vector and auxiliary scaling is not substantially better than that calibrated by temperature scaling. Logit scaling calibrates not only the frequentist confidence but also the Bayesian uncertainty.

7 Outlook

Throughout this work, we used a fixed dropout rate for Gaussian dropout. In (Gal et al., 2017a)

, the Concrete distribution was used as a continuous approximation to the discrete Bernoulli distribution in dropout, which allows optimizing

w.r.t. calibrated uncertainty. Using Gaussian dropout as described above, we can also recalibrate models by optimizing w.r.t. NLL on the calibration set, which scales to reduce underestimation of uncertainty.

In Bayesian active learning we want to train a model with the minimal number of expert queries from a pool of unlabeled data. Calibrated uncertainty can further be useful to acquire the most uncertain samples from pool data to increase information efficiency

(Gal et al., 2017b).

Additionally, pseudo-labels can be generated from the least uncertain predictions in semi-supervised learning. However, there are many factors (e. g. network architecture, weight decay, dropout configuration) influencing the uncertainty in Bayesian deep learning that have not been discussed in this paper and are open to future work.

References

  • C. M. Bishop (2006) Pattern recognition and machine learning. Springer. External Links: ISBN 978-0-387-31073-2 Cited by: §1.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural network. In ICML, pp. 1613–1622. Cited by: §3.1.
  • C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) DeepDriving: learning affordance for direct perception in autonomous driving. In ICCV, Cited by: §1.
  • A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 (7639), pp. 115–118. Cited by: §1.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In ICML, pp. 1050–1059. Cited by: §1, §3.1.
  • Y. Gal, J. Hron, and A. Kendall (2017a) Concrete dropout. In NeurIPS, pp. 3581–3590. Cited by: §1, §3.2, §3.4, §3.5, §7.
  • Y. Gal, R. Islam, and Z. Ghahramani (2017b) Deep Bayesian active learning with image data. In ICML, pp. 1183–1192. Cited by: §7.
  • Y. Gal (2016) Uncertainty in deep learning. Ph.D. Thesis, Department of Engineering, University of Cambridge. Cited by: §1.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In ICML, pp. 1321–1330. Cited by: §1, §1, §2, §3.3, §3.3, §3.4, §3.5, §3.5, §3.5, §3.6, §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.1.
  • G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, pp. 2261–2269. External Links: Document Cited by: §4.1.
  • E. Jang, S. Gu, and B. Poole (2016) Categorical Reparameterization with Gumbel-Softmax. In Bayesian Deep Learning Workshop, NeurIPS, Cited by: §2.
  • A. Kendall and Y. Gal (2017)

    What uncertainties do we need in bayesian deep learning for computer vision?

    .
    In NeurIPS, pp. 5574–5584. Cited by: §1, §3.1.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §1.
  • D. P. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In NeurIPS, pp. 2575–2583. Cited by: §3.2.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • V. Kuleshov, N. Fenner, and S. Ermon (2018) Accurate uncertainties for deep learning using calibrated regression. In ICML, pp. 2796–2804. Cited by: §1, §2, §3.6.
  • M. Kull, M. P. Nieto, M. Kängsepp, T. Silva Filho, H. Song, and P. Flach (2019) Beyond temperature scaling: obtaining well-calibrated multi-class probabilities with dirichlet calibration. In NeurIPS, pp. 12295–12305. Cited by: §2, §3.4, §3.6.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, pp. 6402–6413. Cited by: item 4, §1, §3.5, §4.3, §5.3.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In NeurIPS, pp. 7167–7177. Cited by: §4.3.
  • C. Louizos and M. Welling (2017) Multiplicative normalizing flows for variational bayesian neural networks. In ICML, pp. 2218–2227. Cited by: §1, §3.2, §3.2, §3.5.
  • C. J. Maddison, A. Mnih, and Y. W. Teh (2016)

    The concrete distribution: a continuous relaxation of discrete random variables

    .
    In Bayesian Deep Learning Workshop, NeurIPS, Cited by: §2.
  • W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G. Wilson (2019) A simple baseline for bayesian uncertainty in deep learning. In NeurIPS, pp. 13132–13143. Cited by: §3.1.
  • M. P. Naeini, G. F. Cooper, and M. Hauskrecht (2015) Obtaining Well Calibrated Probabilities Using Bayesian Binning. In AAAI, pp. 2901–2907. Cited by: §3.4.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In Deep Learning and Unsupervised Feature Learning Workshop (NeurIPS), Cited by: §4.1.
  • A. Niculescu-Mizil and R. Caruana (2005) Predicting good probabilities with supervised learning. In ICML, pp. 625–632. Cited by: §3.5.
  • G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton (2017) Regularizing neural networks by penalizing confident output distributions. In arXiv, Note: arXiv:1701.06548 Cited by: §2, §3.7.
  • J. C. Platt (1999)

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

    .
    In Advances in Large Margin Classifiers, pp. 61–74. Cited by: §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR 15, pp. 1929–1958. Cited by: §3.1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In CVPR, pp. 2818–2826. External Links: Document Cited by: §2.
  • M. Teye, H. Azizpour, and K. Smith (2018) Bayesian uncertainty estimation for batch normalized deep networks. In arXiv, Note: arXiv:1802.06455 Cited by: §3.2.
  • S. Wang and C. Manning (2013) Fast dropout training. In ICML, pp. 118–126. Cited by: §3.2.
  • T. Z. Xiao, A. N. Gomez, and Y. Gal (2019) Wat heb je gezegd? detecting out-of-distribution translations with variational transformers. In Bayesian Deep Learning Workshop (NeurIPS), Cited by: §4.3.
  • B. Zadrozny and C. Elkan (2002) Transforming classifier scores into accurate multiclass probability estimates. In KDD, pp. 694–699. External Links: Document Cited by: §2.

Appendix A Reviews

This paper was submitted to International Conference on Machine Learning (ICML) 2020 and rejected with the following scores:

  • Below the acceptance threshold, I would rather not see it at the conference.

  • Borderline paper, but has merits that outweigh flaws.

  • Borderline paper, but has merits that outweigh flaws.

  • Borderline paper, but the flaws may outweigh the merits.

In the following, we disclose the anonymous reviews and our rebuttal.

a.1 Meta-Review

1. Please provide a meta-review for this paper that explains to both the program chairs and the authors the key positive and negative aspects of this submission. Because authors cannot see reviewer discussions, please also summarize any relevant points that can help improve the paper. Please be sure to make clear what your assessment of the pros/cons of this paper are, especially if your assessment is at odds with the overall reviewer scores. Please do not explicitly mention your recommendation in the meta-review (or you may have to edit it later).

The authors calibrate Gaussian dropout models and observe better calibrated uncertainty. After a discussion, the reviewers converged towards rejection being a more appropriate decision at this time. The reviewers agreed that the empirical evidence that model calibration is beneficial and that the analysis is sound. However, they generally felt that the novelty of the methods is limited and lacks justification.

8. I agree to keep the paper and supplementary materials (including code submissions and Latex source), and reviews confidential, and delete any submitted code at the end of the review cycle to comply with the confidentiality requirements.

Agreement accepted

9. I acknowledge that my meta-review accords with the ICML code of conduct (see https://icml.cc/public/CodeOfConduct).

Agreement accepted

a.2 Review #2

Questions

1. Please summarize the main claim(s) of this paper in two or three sentences.

The authors apply the standard calibration techniques to Gaussian dropout models and observe better calibrated uncertainty.

2. Merits of the Paper. What would be the main benefits to the machine learning community if this paper were presented at the conference? Please list at least one.

The paper provides additional empirical evidence that model calibration is beneficial.

3. Please provide an overall evaluation for this submission.

Below the acceptance threshold, I would rather not see it at the conference.

4. Score Justification Beyond what you’ve written above as ”merits“, what were the major considerations that led you to your overall score for this paper?

The results of the paper are trivial. Temperature scaling as a well-known technique that improves the performance of basically all classification models. This particular paper applies it to Gaussian dropout networks.

5. Detailed Comments for Authors Please comment on the following, as relevant: - The significance and novelty of the paper’s contributions. - The paper’s potential impact on the field of machine learning. - The degree to which the paper substantiates its main claims. - Constructive criticism and feedback that could help improve the work or its presentation. - The degree to which the results in the paper are reproducible. - Missing references, presentation suggestions, and typos or grammar improvements.

I have read the author response. Some of my points have been addressed. I am willing to slightly increase my score, by I still think that the paper is below the acceptance threshold.

It is not clear to me why the authors focus on Gaussian dropout. Their main results, eq. 23-24, can be applied to any ensemble. Overall, the result is trivial: one can just take the predictive distribution of any model, be it a single neural network, a deep ensemble, or the result of MC dropout integration, and apply the temperature scaling, vector scaling or matrix scaling to this distribution. Moreover, the authors use Gaussian dropout as an approximation of binary dropout. Why do that when one can just start with Gaussian dropout? Moreover, since the authors mentioned the framework of variational inference, why not just stick with fully factorized Gaussian variational inference from the beginning? It has been a standard technique in Bayesian deep learning for years and does not require the extra steps going from binary dropout to its Bayesian interpretation, to its Gaussian approximation. This makes the paper much more confusing.

”the main contributions of our work are … 3. first to apply logit scaling calibration to a Bayesian classifier obtained from MC dropout“ This has already been done by Ashukha et al 2020. They apply logit scaling to different kinds of ensembles, including Bayesian neural networks in general and beth MC dropout and FFG variational inference in particular.

The expected calibration error is a biased metric. Its bias depends on the model, so it cannot be used to compare the calibration of different models (Vaicenavicius et al 2019). The same holds for UCE. How is UCE different from other biased estimates of calibration error (ECE, TACE, SCE and others by Nixon et al 2019)? I am not convinced that this metric can provide any additional insights. Introducing more biased metrics is harmful to the community as it would make the further results on comparing different methods even less reliable. Moreover, there already are some calibration metrics that do not have such problems (Widmann et al 2019).

Nixon, Jeremy, et al. "Measuring calibration in deep learning." arXiv preprint arXiv:1904.01685 (2019).
Vaicenavicius, Juozas, et al. "Evaluating model calibration in classification." arXiv preprint arXiv:1902.06977 (2019).
Widmann, David, Fredrik Lindsten, and Dave Zachariah. "Calibration tests in multi-class classification: A unifying framework." Advances in Neural Information Processing Systems. 2019.
Ashukha, Arsenii, et al. "Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning." In International Conference on Learning Representations. 2020.

6. Please rate your expertise on the topic of this submission, picking the closest match.

I have published one or more papers in the narrow area of this submission.

7. Please rate your confidence in your evaluation of this paper, picking the closest match.

I tried to check the important points carefully. It is unlikely, though possible, that I missed something that could affect my ratings.

8. Datasets If this paper introduces a new dataset, which of the following norms are addressed? (For ICML 2020, lack of adherence is not grounds for rejection and should not affect your score; however, we have encouraged authors to follow these suggestions.)

This paper does not introduce a new dataset (skip the remainder of this question).

12. I agree to keep the paper and supplementary materials (including code submissions and Latex source) confidential, and delete any submitted code at the end of the review cycle to comply with the confidentiality requirements.

Agreement accepted

13. I acknowledge that my review accords with the ICML code of conduct (see https://icml.cc/public/CodeOfConduct).

Agreement accepted

a.3 Review #4

Questions

1. Please summarize the main claim(s) of this paper in two or three sentences.

The authors propose a methodology for calibrating model uncertainty (measured as entropy of the marginal posterior predictive distribution) instead of the parameters of the (marginal) posterior predictive distribution (ECE). They introduce their approach in the context of MC Dropout, and demonstrate results on a set of experiments.

2. Merits of the Paper. What would be the main benefits to the machine learning community if this paper were presented at the conference? Please list at least one.

The authors propose the aforementioned methodology, and back it up with a set of empirical experiments.

3. Please provide an overall evaluation for this submission. Borderline paper, but has merits that outweigh flaws.

4. Score Justification Beyond what you’ve written above as ”merits“, what were the major considerations that led you to your overall score for this paper?

The authors propose a method to calibrate the model uncertainty, but the indicated approach specifically calibrates the entropy of the marginal posterior predictive distribution, which contains both data and model uncertainty sources. Given that, I would have expected to see an experimental setup that compared the benefit of calibrating according to UCE vs. ECE. The listed experiments demonstrate that it is possible to apply calibration techniques developed for ECE to their proposed UCE, but the reader is left wondering whether UCE provides a marked improvement. The rejection experiments are useful, but it would have been good to compare the results to the alternative of thresholding on the max predicted probability (i.e., Hendrycks et al., 2017). I agree with the authors in the motivation for using model uncertainty, but I still think the paper would benefit from the comparison.

Addendum:
Thank you to the authors for the rebuttal! Given the noted inclusion of UCE vs. ECE experiments, comparison to max predicted probability, added discussion around Nixon et al. 2019, and updated text re: predictive entropy containing both data & model uncertainty, I have increased my score.

5. Detailed Comments for Authors Please comment on the following, as relevant: - The significance and novelty of the paper’s contributions. - The paper’s potential impact on the field of machine learning. - The degree to which the paper substantiates its main claims. - Constructive criticism and feedback that could help improve the work or its presentation. - The degree to which the results in the paper are reproducible. - Missing references, presentation suggestions, and typos or grammar improvements.

Significance: Considering model uncertainty and the extent to which it is calibrated is well-motivated, as is the usage of it for making rejections in order to improve performance. The experiments indicate that there is promise in both calibrating measures that incorporate model uncertainty. However, the experiments do not directly demonstrate the benefit over existing baselines using ECE and the parameters of the (marginal) predictive distribution. One other issue is that the entropy of the marginal posterior predictive distribution is a measure of both data uncertainty and model uncertainty.

Novelty: To the best of the reviewer’s knowledge, a calibration metric for predictive entropy has not been introduced before.

Presentation/clarity:

- p. 1, line 21: "considerably reduce" -> ”considerably reduces“
- p. 2, line 83, left: define ECE and cite Naeini et al., 2015.

6. Please rate your expertise on the topic of this submission, picking the closest match.

I have published one or more papers in the narrow area of this submission.

7. Please rate your confidence in your evaluation of this paper, picking the closest match.

I tried to check the important points carefully. It is unlikely, though possible, that I missed something that could affect my ratings.

8. Datasets If this paper introduces a new dataset, which of the following norms are addressed? (For ICML 2020, lack of adherence is not grounds for rejection and should not affect your score; however, we have encouraged authors to follow these suggestions.)

This paper does not introduce a new dataset (skip the remainder of this question).

12. I agree to keep the paper and supplementary materials (including code submissions and Latex source) confidential, and delete any submitted code at the end of the review cycle to comply with the confidentiality requirements.

Agreement accepted

13. I acknowledge that my review accords with the ICML code of conduct (see https://icml.cc/public/CodeOfConduct).

Agreement accepted

a.4 Review #5

Questions

1. Please summarize the main claim(s) of this paper in two or three sentences.

The main claims are a new metric for uncertainty calibration and the introduction of logit scaling with Gaussian MC Dropout. The logit scaling with MC dropout is analyzed empirically.

2. Merits of the Paper. What would be the main benefits to the machine learning community if this paper were presented at the conference? Please list at least one.

Calibration and Bayesian approaches are often seen as two, different approaches for obtaining better calibrated predictions. This paper shows that calibration is also beneficial for Bayesian DNNs. Furthermore, uncertainty in deep learning is a highly relevant topic, as it is substantial for real world deep learning in safety critical environments. Additional insight is always welcome for advancing the field.

3. Please provide an overall evaluation for this submission.

Borderline paper, but has merits that outweigh flaws.

4. Score Justification Beyond what you’ve written above as "merits", what were the major considerations that led you to your overall score for this paper?

The paper is well written and the analysis is sound. It can still be improved, but due to the importance of the topic and the works quality, i deem it over the acceptance threshold. The novelty of the methods is limited, but additional insight is sufficient for advancing a field.

5. Detailed Comments for Authors Please comment on the following, as relevant: - The significance and novelty of the paper’s contributions. - The paper’s potential impact on the field of machine learning. - The degree to which the paper substantiates its main claims. - Constructive criticism and feedback that could help improve the work or its presentation. - The degree to which the results in the paper are reproducible. - Missing references, presentation suggestions, and typos or grammar improvements.

The novel methods (uncertainty calibration metric and logit scaling for gaussian dropout) are straight forward applications of known principles and ideas. However, the paper is still somewhat significant due to the novel insight presented by the authors. Especially the combination of Bayesian approximations and calibration is relevant, as it was recently shown that Bayesian methods are not always leading to better calibrated predictions. The paper can trigger additional research into the application of calibration methods for Bayesian approximations, which is especially interesting when considering that Bayesian methods are still very expensive and calibrated Bayesian methods may offer a way to mitigate the flaws of cheaper posterior approximations.

The claims of the paper are sufficiently substantiated. Approaches and equations are well explained and understandable. However, as the paper is mostly depending on the results and analysis, this section should be extended. Some possible improvements are: Comparison with Dirichlet calibration, better comparison with frequentist results (e.g. cECE) and analysis of class distribution changes (within the same dataset). The results are likely reproducible, due the available code and the use of standard DNN architectures.

6. Please rate your expertise on the topic of this submission, picking the closest match.

I have seen talks or skimmed a few papers on this topic, and have not published in this area.

7. Please rate your confidence in your evaluation of this paper, picking the closest match.

I am willing to defend my evaluation, but it is fairly likely that I missed some details, didn’t understand some central points, or can’t be sure about the novelty of the work.

12. I agree to keep the paper and supplementary materials (including code submissions and Latex source) confidential, and delete any submitted code at the end of the review cycle to comply with the confidentiality requirements.

Agreement accepted

13. I acknowledge that my review accords with the ICML code of conduct (see https://icml.cc/public/CodeOfConduct).

Agreement accepted

a.5 Review #6

Questions

1. Please summarize the main claim(s) of this paper in two or three sentences.

The authors study the problem of calibration of uncertainty inspired by calibration of confidence. Specifically, the authors modify several existing calibration methods to do calibration of uncertainty for Gaussian dropout. The proposed methods are tested on standard calibration tasks in comparison with the corresponding calibration of confidence methods.

2. Merits of the Paper. What would be the main benefits to the machine learning community if this paper were presented at the conference? Please list at least one.

The idea of calibration of uncertainty is interesting and reasonable. As far as I understand, this is the first work to give an attempt.

3. Please provide an overall evaluation for this submission.

Borderline paper, but the flaws may outweigh the merits.

4. Score Justification Beyond what you’ve written above as "merits", what were the major considerations that led you to your overall score for this paper?

Although it is interesting to see a paper attempting calibration of uncertainty, the method is very handwavy and lack of justification.

5. Detailed Comments for Authors Please comment on the following, as relevant: - The significance and novelty of the paper’s contributions. - The paper’s potential impact on the field of machine learning. - The degree to which the paper substantiates its main claims. - Constructive criticism and feedback that could help improve the work or its presentation. - The degree to which the results in the paper are reproducible. - Missing references, presentation suggestions, and typos or grammar improvements.

Compared to previous methods, the only difference is replacing the confidence probability by uncertainty which is measured by normalized entropy. The use of normalized entropy as an uncertainty metric and the definition of the perfect calibration of uncertainty still need justification. The authors did not provide a clear connection of normalized entropy and uncertainty as well as a connection between normalized entropy and top-1 error. Therefore, the basis of all the proposed methods in the paper seems very handwavy.

For the experiments, the authors seem to only compare with ECE in the first experiment. It will be better to report the ECE results on the other experiments as well. I’m curious if calibrated MC dropout is better than a calibrated point estimate. From the results of the first experiment, it did not seem to be true.

Update: Thank the authors for the clarification. However, without seeing the new results, the concerns about experiments remain. Thus I keep the original score.

6. Please rate your expertise on the topic of this submission, picking the closest match.

I have seen talks or skimmed a few papers on this topic, and have not published in this area.

7. Please rate your confidence in your evaluation of this paper, picking the closest match.

I am willing to defend my evaluation, but it is fairly likely that I missed some details, didn’t understand some central points, or can’t be sure about the novelty of the work.

8. Datasets If this paper introduces a new dataset, which of the following norms are addressed? (For ICML 2020, lack of adherence is not grounds for rejection and should not affect your score; however, we have encouraged authors to follow these suggestions.)

This paper does not introduce a new dataset (skip the remainder of this question).

12. I agree to keep the paper and supplementary materials (including code submissions and Latex source) confidential, and delete any submitted code at the end of the review cycle to comply with the confidentiality requirements.

Agreement accepted

13. I acknowledge that my review accords with the ICML code of conduct (see https://icml.cc/public/CodeOfConduct).

Agreement accepted

a.6 Rebuttal

1. Author Response to Reviewers Please use this space to respond to any questions raised by reviewers, or to clarify any misconceptions. Please do not include any links to external material, nor include ”late-breaking“ results that are not responsive to reviewer concerns. We request that you understand that this year is especially difficult for many people, and to be considerate in your response.

We thank the reviewers for their valuable feedback. It allows us to improve our paper substantially.

We acknowledge Reviewer #2’s references to Ashukha et al., (2020) and other highly relevant work and will update our literature review accordingly. Reviewer #2’s main concern seems to be the disadvantages of ECE-like calibration metrics. After carefully reading the suggested literature (Widmann et al, 2019; Ashukha et al., 2020; Nixon et al., 2019), two major concerns with recent calibration metrics are raised, which do not apply to UCE: 1. Non-applicability to multi-class classification: In contrast to ECE, UCE considers all class predictions by using the predictive entropy as uncertainty metric. We already addressed that in our manuscript and compare to classwise ECE as suggested by Kull et al., (2019). 2. ”ECE-like scores are minimized by a model with constant uniform predictions“ (Ashukha et al., 2020; and analogously Nixon et al., 2019): This also does not apply to the UCE metric as uniform predictions would result in high entropy. Consider the following example: Binary classification with balanced class frequencies and a model with constant uniform predictions. This would result in ECE=0%, but UCE=50%.

UCE suffers from fixed bin sizes (Nixon et al., 2019), which we will discuss appropriately in our conclusion. This could easily be fixed by combining UCE with adaptive binning from ACE/TACE. We do not believe that the proposed UCE metric is harmful to the community as it does not have the major disadvantages compared to other ECE-like metrics. UCE is a useful metric and can give valuable insights into the calibration of uncertainty.

We focus on Gaussian dropout as we have derived our approach from the MC dropout framework for uncertainty estimation. We will adjust this section and refer to fully factorized Gaussian variational inference to reduce the reader’s confusion.

We thank reviewer #2 for pointing out that temperature scaling was recently applied to MC dropout by Ashukha et al., (2020). We further extend their work by applying more complex logit scaling calibration to a Bayesian classifier obtained from MC dropout. Our work therefore provides additional insights into calibration of Bayesian neural nets. Our results suggest that the more complex calibration methods (like class-wise calibration) is advantageous compared to only temperature scaling (see bold values in Tab. 1).

Based on feedback from reviewers #4 and #6, we extended our experiments to emphasize the benefits of calibration according to UCE vs. ECE. We now also compare the rejection and OoD detection experiments to thresholding on the max predicted probability (i.e., Hendrycks et al., 2017). We added additional figures and corresponding text passages to the results section of the manuscript.

Based on the comment of Reviewer #6 we realized the lack of a clear connection between normalized entropy and uncertainty/top-1 error. The use of predictive entropy to measure predictive uncertainty in classification is well motivated in Gal, (2016) pp. 51–54. Normalization was introduced to restrict the values to [0, 1] independent of the number of classes C. Normalization is not essential for calibration but gives a more "intuitive" interpretation of the uncertainty values themselves. When all entries of the probability vector are predicted with equal probability, normalized entropy equals to 1.0 and we expect the prediction to be false (i.e. the expectation of the top-1 error to be 1.0). We added a more detailed explanation on the use of normalized entropy to the manuscript.

Reviewer #4 mentioned that "the entropy of the marginal posterior predictive distribution is a measure of both data uncertainty and model uncertainty". Classification models trained by minimizing NLL (i.e. cross-entropy) already capture a data-dependent uncertainty. Therefore, the predictive entropy both contains data and model uncertainty. We added a sentence for clarification and changed the manuscript accordingly.

We hope that our revisions meet the expectations of the reviewers. The comments have greatly helped us to increase the quality of our work. We thank the reviewers for their valuable time.

Nixon, J. et al. "Measuring calibration in deep learning." arXiv preprint arXiv:1904.01685 (2019).
Widmann, D. et al. "Calibration tests in multi-class classification: A unifying framework." Advances in Neural Information Processing Systems. 2019.
Ashukha, A. et al. "Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning." In International Conference on Learning Representations. 2020.
Gal, Y. Uncertainty in Deep Learning. PhD thesis, Department of Engineering, University of Cambridge, 2016

3. I certify that this author response conforms to the ICML Code of Conduct (https://www.icml.cc/public/CodeOfConduct)

Agreement accepted