Log In Sign Up

Distance-Based Learning from Errors for Confidence Calibration

by   Chen Xing, et al.

Deep neural networks (DNNs) are poorly-calibrated when trained in conventional ways. To improve confidence calibration of DNNs, we propose a novel training method, distance-based learning from errors (DBLE). DBLE bases its confidence estimation on distances in the representation space. We first adapt prototypical learning for training of a classification model for DBLE. It yields a representation space where the distance from a test sample to its ground-truth class center can calibrate the model performance. At inference, however, these distances are not available due to the lack of ground-truth labels. To circumvent this by approximately inferring the distance for every test sample, we propose to train a confidence model jointly with the classification model by merely learning from mis-classified training samples, which we show to be highly beneficial for effective learning. On multiple datasets and DNN architectures, we demonstrate that DBLE outperforms alternative single-modal confidence calibration approaches. DBLE also achieves comparable performance with computationally-expensive ensemble approaches with lower computational cost and lower number of parameters.


page 1

page 2

page 3

page 4


Calibrating Deep Neural Network Classifiers on Out-of-Distribution Datasets

To increase the trustworthiness of deep neural network (DNN) classifiers...

On the Dark Side of Calibration for Modern Neural Networks

Modern neural networks are highly uncalibrated. It poses a significant c...

Unsupervised Early Exit in DNNs with Multiple Exits

Deep Neural Networks (DNNs) are generally designed as sequentially casca...

Battery Model Calibration with Deep Reinforcement Learning

Lithium-Ion (Li-I) batteries have recently become pervasive and are used...

Operational Calibration: Debugging Confidence Errors for DNNs in the Field

Trained DNN models are increasingly adopted as integral parts of softwar...

Calibrating for Class Weights by Modeling Machine Learning

A much studied issue is the extent to which the confidence scores provid...

Neural Network-Based Active Learning in Multivariate Calibration

In chemometrics, data from infrared or near-infrared (NIR) spectroscopy ...

1 Introduction

Deep neural networks (DNNs) are being deployed in many important decision-making scenarios (Goodfellow et al., 2016). Making wrong decisions could be very costly in most of them (Brundage et al., 2018) – it could cost human lives in medical diagnosis and autonomous transportation, and it could cost significant business losses in loan categorization and sales forecasting. To prevent these from happening, it is strongly desired for a DNN to output its confidence on its decisions. In almost all of the aforementioned scenarios, detrimental consequences could be avoided by refraining from making decisions or consulting human experts, in the cases of decisions with insufficient confidence. In addition, by tracking the confidence in decisions, dataset shifts can be detected and developers can build insights towards improving the model performance.

Despite their success in many scenarios (Mahajan et al., 2018; Park et al., 2019), confidence quantification (also referred as ‘confidence calibration’) is particularly a challenging problem for DNNs. For a ‘well-calibrated’ model, the predictions with higher confidence should be more likely to be accurate. However, as studied in (Nguyen et al., 2014; Guo et al., 2017)

, for DNNs with conventional (also referred as ‘vanilla’) training to minimize the softmax cross-entropy loss with SGD, the outputs do not contain sufficient information for well-calibrated confidence estimation. Posterior probability estimates (the softmax outputs) can be interpreted as confidence estimation, but it calibrates the decision quality poorly

(Gal & Ghahramani, 2016) – the confidence value tends to be large even when the classification is inaccurate. Therefore, it would be desirable if the training approach is redesigned to shed more light upon its decision making process to provide more accurate confidence calibration.

In this paper, we propose a novel training method, Distance-based Learning from Errors (DBLE) towards better-calibrated DNNs. Generally, DBLE learns a distance-based representation space through classification and exploits distances in the space to yield well-calibrated classification. Our motivation is that a test sample’s location in the representation space and its distance to training samples contain important information about the model’s decision-making process which is useful for guiding confidence estimation. However, in vanilla training, since both training and inference are not based on distances in the representation space, they are not optimized to fulfill this goal. Therefore, in DBLE, we propose to adapt prototypical learning for training and inference to learn a distance-based representation space through classification. In this space, a test sample’s distance to its ground-truth class center can calibrate the model’s performance. However, this distance cannot be calculated at inference directly since the ground truth label is unknown. To this end, we propose to train a separate confidence model in DBLE jointly with the classification model to estimate this distance at inference. To train the confidence model, we utilize the mis-classified training samples during training. We demonstrate that on multiple DNN models and datasets, DBLE achieves superior confidence calibration without increasing the computational cost as ensemble methods.

2 Related Work

Many studies (Nguyen et al., 2014; Guo et al., 2017) have shown that the classification performance of DNNs are poorly-calibrated under vanilla training. One direction of work to improve calibration is adapting regularization methods for vanilla training. Such regularization methods are often originally proposed for other purposes. For example, Label Smoothing (Szegedy et al., 2016), which is proposed for improving generalization, has empirically shown to improve confidence calibration (Müller et al., 2019). Mixup (Zhang et al., 2017), which is designed mostly for adversarial robustness, also improves confidence calibration (Thulasidasan et al., 2019). The benefits from such approaches are typically limited, as the modified objective functions do not represent the goal of confidence estimation. To directly minimize a confidence scoring objective, Temperature Scaling (Guo et al., 2017), modifies learning by minimizing a Negative Log-Likelihood (Friedman et al., 2001)

objective on a small training subset. It yields better-calibrated confidence by directly minimizing an evaluation metric of the calibration task. However, since it requires training set to be split for the two tasks, classification and calibration, it results in a trade-off – if more training data is used for calibration, then the model’s classification performance would drop because less training data are left for classification.

Bayesian DNNs constitute another direction of related work. Bayesian DNNs promise to directly model the posterior distribution for the test set given the training set (Rohekar et al., 2019). In Bayesian DNNs, since getting the posterior of large amount of parameters is usually intractable, different approximations are applied, such asMarkov chain Monte Carlo (MCMC) methods (Neal, 2012) and variational Bayesian methods (Blundell et al., 2015; Graves, 2011)

. Therefore, the calibration of Bayesian DNNs heavily depends on the degree of approximation and the pre-defined prior in variational methods. Moreover, in practice, Bayesian DNNs often yield prohibitively slow training due to the slow convergence of the posterior predictive estimation. 

(Lakshminarayanan et al., 2017). Recent work on Bayesian DNNs (Heek & Kalchbrenner, 2019) reports that the training time of their model is an order of magnitude higher than vanilla training. As Bayesian-inspired approaches, Monte Carlo Dropout (Gal & Ghahramani, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017) are two widely-used methods since they only require minor changes to the classic training method of DNNs and are relatively faster to train.

3 Learning a distance-based space for confidence calibration through classification

We first introduce how DBLE applies prototypical learning to train DNNs for classification. Then, we explain that under this new training method, a distance-based representation space can be achieved to benefit confidence calibration.

3.1 Prototypical Learning for Classification

DBLE bases the training of the classification model on prototypical learning (Snell et al., 2017). Prototypical learning is originally proposed to learn a distance-based representation space for few-shot learning (Ravi & Larochelle, 2016; Oreshkin et al., 2018; Xing et al., 2019). In prototypical learning, both training and prediction solely depend on the distance of the samples to their corresponding class ‘prototypes’ (referred as ‘class centers’ in the paper) in the representation space. It trains the model with the goal of minimizing the intra-class distances, whereas maximizing inter-class distances such that related samples are clustered together. We adapt this training method for classification training in DBLE. Subsequent subsections describe the two main components of prototypical training: episodic training and prototypical loss for classification. Later, we explain the inference at test in DBLE and why it leads to a representation space that benefits confidence calibration.

3.1.1 Episodic training

Vanilla DNN training for classification is based on variants of mini-batch gradient descent (Bottou, 2010). Episodic training, on the other hand, samples -shot, -way episodes at every update (as opposed to sampling a batch of training data). An episode is created by first sampling random classes out of total classes111Note that we do not require to be equal to because fitting the support samples of all classes in a batch to processor memory can be challenging when is very large., and then sampling two sets of training samples for these classes: (i) the support set containing examples for each of the classes and (ii) the query set containing different examples from the same classes.

Figure 1: The training of DBLE. Taking support set and a query image as input, learns a representation space through the training of classification model. In this space, is the encoded query inputs and are the class centers. For classification, DBLE employs proto-loss to update . For calibration, DBLE employs the centers and the sampled representation for the query to update . If the query is correctly classified, the components in the dotted rectangular are not be computed. Solid arrows represent the forward pass and dotted arrows represent the backward pass to update the network.

At episode , the model is updated to map the query to its correct label , with the help of support set . The model is a function parameterized by and the loss is negative log-likelihood of the ground-truth class label of each query sample given the support set according to Eq. 1.

3.1.2 Prototypical loss

How does the support set help the classification of query samples at episode ? We firstly employ the DNN classification model, to encode inputs in a representation space, where are trainable parameters. Prototypical training then uses the support set to compute the center for each class (in the sampled episode). The query samples are classified based on their distance to each class center in the representation space. For every episode , each center is computed by averaging the representations of the support samples from class :


where is the subset of support samples belonging to class . The prototypical loss calculates the predictive label distribution of query based on its distances to the centers:


where is the representation of in the space:


The model is trained by minimizing Eq. 1 with in Eq. 3. Through this training, in the representation space that we calculate and class centers, the inter-class distances are maximized and the intra-class distances are minimized. Therefore, training samples belonging to the same class are clustered together and clusters representing different classes are pushed apart.

3.1.3 Inference and the test sample’s distance to its ground-truth class center

At inference, we calculate the center of every class using the training set, by averaging the representations of all corresponding training samples:


where is the set of all training samples belonging to class . Then, given a test sample in the test set (where is the ground-truth label), the distances of its representation (Eq. 4) to all class centers are calculated. The prediction of the label of is based on these distances:


where consists of all classes in the training set. In other words, is assigned to the class with the closest center in the representation space. Consequently, for a test sample , if at inference is too far away from its ground-truth class center , it is likely to be mis-classified. Therefore, in this distance-based representation space, a test sample’s distance to its ground-truth class center, is a strong signal indicating how well the model would perform on this sample. In the next section, we empirically show it under DBLE and compare with other methods.

3.2 Distance to the ground-truth class center calibrates classification performance in DBLE

(a) CIFAR-100


Figure 2: Average test accuracy as or increases. is the distance of a test sample to its ground-truth class center and is its distance to the predicted class center. It shows that in the space achieved by prototypical learning in DBLE can better estimate the model’s performance on , since DBLE’s accuracy curve is more monotonic and less oscillating as the distance increases.

We design an experiment to show that in the space learned by prototypical learning of DBLE, the model’s performance given a test sample can be estimated by the test sample’s L2-distance to its ground-truth class center. On the other hand, other intuitive measures such as L2-distance to the predicted class center in prototypical learning, L2-distance to the predicted class center in vanilla training or L2-distance to the ground-truth center in vanilla training, can’t calibrate performance very well. Here, we describe the empirical observations on DBLE’s classification model with prototypical learning, to motivate the confidence modeling of DBLE before we explain the method.

In the learned representation space (either the output space of a model from vanilla training or prototypical learning), for every sample in the test set , we calculate its L2-distance to its ground-truth class center as:


where is given by Eq. 5. is the trained classification model and is L2-distance. We can also calculate ’s distance to its predicted class center where is the predicted label, in both DBLE and vanilla training as comparison. We denote ’s distance to its predicted class center as . We then sort or in ascending order and partition them in equally-spaced bins. For every bin, we calculate the model’s classification accuracy of the test samples lying in the bin. Then we plot the test accuracy curve as the distance increases. The curve shows the the relationship between the distance and the model’s performance on , illustrating how well the calculated distance can act as confidence score, calibrating the model’s prediction.

Shown in Fig. 2, of DBLE, which is the distance of a test sample to its ground-truth class center, calibrates the model’s performance the best. The curve of under DBLE’s prototypical learning, despite its slight oscillation near in the end, decreases monotonically to almost zero. It indicates that farther away the sample is from its ground-truth class center, more poorly the model performs on it. These results suggest that can be used as a gold confidence measure, calibrating the quality of the model’s decision of for a classification model under prototypical learning. This benefit of DBLE’s prototypical learning is due to the distance-based training and inference – if a test sample is farther away from its ground-truth class center, it is more likely to be mis-classified as other classes.

Although we have observed this gold calibration in DBLE enabled by prototypical learning, it requires access to the ground-truth labels of test samples, . At inference however, we do not have access to the ground-truth label of at inference and cannot be directly calculated. Therefore, we propose to use a separate confidence model to estimate , trained jointly with the classification training of DBLE. It regresses by learning from the mis-classified training samples in DBLE’s classification training, described next.

4 Confidence Modeling by Learning from Errors

In the classification model learned by DBLE, a test sample ’s L2-distance to its ground-truth class center (Eq. 7), is highly-calibrated with the model’s performance on it, as described in Sec. 3. However, cannot be directly computed without ground-truth labels, which is the case at inference. Therefore, we introduce a confidence model parameterized by , to learn to estimate jointly with the training of classification in DBLE.

To train , a straightforward option would be using the distance mapping for all training samples, . Through this way, can be trained to give an estimate of for the test sample at inference. However, correctly-classified samples constitute the vast majority during training, especially considering that state-of-the-art DNNs yield much lower training errors compared to test errors (Neyshabur et al., 2014). Therefore, if all data is used, training of would be dominated by the small distances of the correctly-classified samples, which would make it harder for to capture the larger for the minor mis-classified samples. Moreover, given that we choose as a simple MLP with limited capacity for the purpose of fast training, it becomes more challenging to capture larger from the incorrectly-classified samples if all training samples are used (i.e. the confidence model would underfit mis-classified training data). Therefore, we propose to track all training samples that are mis-classified in episode during the training of the classification model in DBLE. We save them in for each episode to train . With our ablation studies in Sec. 5.4, we demonstrate the importance of learning from mis-classified training samples vs. learning from all.

In the following, we introduce the training procedure of with mis-classified samples for estimation of at inference for test sample . We train

by sampling from the isotropic Gaussian distribution

, where the standard deviation

is ’s output given mis-classified training sample . is trained to output a larger for with a larger . takes the representation (calculated with Eq. 4) of a mis-classified training sample in as input, and outputs as,


To train , we firstly sample another representation for from the isotropic Gaussian distribution parameterized by and , . Then, we optimize based on the prototypical loss with , using ’s predictive label distribution:


When updating with the prototypical loss given , it encourages a larger when the of is larger. This is because when is fixed for in the space, maximizing Eq. 9 forces to be as close to its ground-truth class center as possible. Given is sampled from , if is farther away from its ground-truth center (which is the case of mis-classified training samples), then it requires a larger for to go close to it. Fig. 3 visualizes how changes before vs. after updating . The training of DBLE is described in Fig. 1 and in Algorithm 1 in the Appendix A.1. Note that in order to make sampling of differentiable, we use the reparameterization trick (Kingma & Welling, 2013).

(a) Before updating
(b) After updating
Figure 3: The Gaussian distribution within one standard deviation away from mean, shown before and after updating . The dotted circles represent the of the sample inside it. Red and blue dots represent training samples belonging two different classes in the representation space. The dotted line is the decision boundary in the space. Let’s take the two mis-classified training samples , and the correctly classified as examples. (a), before updating , the s of both correctly and wrongly located samples are initialized with a small value. (b), after updating , of mis-classified samples are much larger. Because the proto-loss for calibration will move , sampled from , to be as close to the correct class center as possible.

At inference, for every test sample , we make prediction with (see Eq. 6). While for confidence estimation of the predictions, we take advantage of . For confidence estimation, after getting and , we sample multiple from and average their predictive label distributions as confidence estimation:


is the total number of representation samples . is used as the confidence score calibrating the prediction

. Through this way, for a test sample farther away from its ground-truth class center (which means they are more likely to be mis-classified), the model will add more randomness to the representation sampling since its estimated variance from

is large. More randomness in the softmax input space leads to a lower expected softmax output (Gal & Ghahramani, 2016), which is the confidence score .

5 Experiments

In this section, we firstly compare our method DBLE with recent baselines on confidence calibration of DNNs. We conduct experiments on a variety of data sets and network architectures and evaluate DBLE’s performance on two most commonly used metrics for confidence calibration: Expected Calibration Error (ECE) (Naeini et al., 2015) and Negative Log Likelihood (NLL) (Friedman et al., 2001). Results show that DBLE outperforms single-modal baselines on confidence calibration in every scenario tested. We then conduct analysis ablation study to verify the effectiveness of the two main components in DBLE. Implementation and training details of DBLE and baselines are described in Appendix.

5.1 Experimental Setup

Baselines. We compare our method with baselines that use a single DNN: vanilla training, MC-Dropout (Gal & Ghahramani, 2016), Temperature Scaling (Guo et al., 2017), Mixup (Thulasidasan et al., 2019), Label Smoothing (Szegedy et al., 2016) and TrustScore (Jiang et al., 2018). We also compare DBLE with Deep Ensemble (Lakshminarayanan et al., 2017) with DNNs.

Datasets and Network Architectures. We conduct experiments on various combinations of datasets and architectures: MLP on MNIST (LeCun et al., 1998), VGG-11 (Simonyan & Zisserman, 2014) on CIFAR-10 (Krizhevsky et al., 2009), ResNet-50 (He et al., 2016) on CIFAR-100 (Krizhevsky et al., 2009) and ResNet-50 on Tiny-ImageNet (Deng et al., 2009).

Evaluation Metrics. We evaluate DBLE on model calibration with Expected Calibration Error (ECE) and Negative Log Likelihood (NLL). ECE approximates the expectation of the difference between accuracy and confidence. It partitions the confidence estimations (the likelihood of the predicted label ) of all test samples into equally-spaced bins and calculates the average confidence and accuracy of test samples lying in each bin :


where is the predicted label of . NLL averages the negative log-likelihood of all test samples:


5.2 Results

Accuracy ECE NLL Accuracy ECE NLL
Vanilla Training 98.32 1.73 0.29 90.48 6.3 0.43
MC-Dropout 98.32 1.71 0.34 90.48 3.9 0.47
Temperature Scaling 95.14 1.32 0.17 89.83 3.1 0.33
Label Smoothing 98.77 1.68 0.30 90.71 2.7 0.38
Mixup 98.83 1.74 0.24 90.59 3.3 0.37
TrustScore 98.32 2.14 0.26 90.48 5.3 0.40
DBLE 98.69 0.97 0.12 90.92 1.5 0.29
Deep Ensemble-4 networks 99.36 0.99 0.08 92.4 1.8 0.26
Method CIFAR100-ResNet50 Tiny-ImageNet-ResNet50
Accuracy ECE NLL Accuracy ECE NLL
Vanilla Training 71.57 19.1 1.58 46.71 25.2 2.95
MC-Dropout 71.57 9.7 1.48 46.72 17.4 3.17
Temperature Scaling 69.84 2.5 1.23 45.03 4.8 2.59
Label Smoothing 71.92 3.3 1.39 47.19 5.6 2.93
Mixup 71.85 2.9 1.44 46.89 6.8 2.66
TrustScore 71.57 10.9 1.43 46.71 19.2 2.75
DBLE 71.03 1.1 1.09 46.45 3.6 2.38
Deep Ensemble-4 networks 73.58 1.3 0.82 51.28 2.4 1.81
Table 1: Test Accuracy, ECE and NLL of DBLE and baselines under 4 scenarios.

Table 1 shows the results. In every scenario tested, DBLE is comparable with vanilla training in test accuracy – applying prototypical learning for classification problems does not hurt generalization of models on classification. On confidence calibration, DBLE performs better than all single-modal baselines on every scenario tested on both ECE and NLL. Moreover, our method reaches comparable results with Deep Ensemble of 4 networks with a smaller time complexity and parameter size. For MNIST-MLP, CIFAR10-VGG11 and CIFAR100-ResNet50, our method outperforms Deep ensemble (4 networks) on ECE. Among other baselines, Temperature Scaling performs the best in every scenario, which is because Temperature Scaling directly optimizes NLL on the small sub-training set. MC-dropout gives the least improvement on confidence calibration, especially on NLL. The potential reason for MC-dropout’s limited improvement on NLL can be that when applying dropout at inference, it adds similar level of randomness on mis-classified samples and correctly classified samples. Therefore, the predictive likelihood of the correct labels becomes smaller in general, which leads to worse NLL. Compared to Temperature Scaling, DBLE decouples classification and calibration, therefore achieves better calibration without sacrificing classification performance. Compared to MC-dropout, our model is trained to add randomness on predictive distributions for mis-classified samples specifically, thus achieving significantly better calibration.

5.3 Computational Cost

DBLE adds extra training time complexity and trainable parameters compared with vanilla training. However, compared with Deep Ensembles and other Bayesian methods, DBLE’s cost is significantly less. For training time complexity, although DBLE requires two forwards passes for at each update step (see Algorithm 1), its actual training time is less than twice of vanilla training since the number of iterations required until convergence is smaller. Empirically we observe that DBLE’s total training time on CIFAR10-VGG11 is 1.4x of vanilla training and on CIFAR100-ResNet50 is 1.7x of vanilla training. While Deep Ensemble of 4 networks takes 4x vanilla training time. DBLE adds extra trainable parameters compared with vanilla training. is the parameters of a MLP in practice. It’s size is usually of the classification model, while ensembling 4 DNNs in Deep Ensemble increases the size of trainable parameters to 4x of vanilla training.

5.4 Algorithm Analysis and Ablation Study

DBLE alleviates the NLL overfitting problem. NLL overfitting problem of vanilla training has been observed in (Guo et al., 2017) – after annealing the learning rate, as the test accuracy goes up, the model overfits to NLL score (test NLL starts to increase instead of decreasing or maintaining flat). Fig. 4 shows that this phenomenon is alleviated by DBLE. Under vanilla training, the model starts overfitting to NLL after the first learning rate annealing. DBLE on the other hand, decreases and maintains its test NLL every time after learning rate annealing. We see the same trend for test ECE as well – with vanilla training the model overfits to ECE after learning rate annealing while DBLE decreases and maintains its test ECE.

Figure 4:

Test ECE and NLL of the last 45 epochs during training on CIFAR-100. The learning rate of both vanilla training and DBLE is annealed by 10x at epoch 60, 70, 80.

Distance-based representation space and learning from errors are both essential for DBLE. We conduct ablation studies to verify the effectiveness of the two main design choices of our model, the distance-based space and learning calibration from training errors. Table 2 shows the results. In “Learning with errors in Vanilla training”, we conduct calibration learning with errors in vanilla training. In other words, instead of updating by maximizing Eq. 9, we update by maximizing the softmax likelihood with

as logits and

is updated with vanilla training. In “DBLE with calibration learning using all samples”, we update with all training samples by maximizing Eq. 9. We can see from the results that firstly, learning to calibrate with errors also helps vanilla training. It improves calibration in vanilla training slightly since it to some extent, introduces more randomness in the decision making process for misclassified samples. However, the improvement is very small without the distance-based space. Secondly, we notice that in DBLE if we learn calibration with all training samples, it significantly decreases DBLE’s performance on calibration. The potential reason is the model’s underfitting to mis-classified training samples in learning to estimate the distance. Auxiliary analysis on these two can be found in Appendix A.3.

Method CIFAR100 Tiny-ImageNet
Vanilla Training 19.1 1.58 25.2 2.95
Learning with errors in vanilla training 18.3 1.43 20.9 2.61
DBLE with calibration learning using all samples 18.9 1.54 24.8 2.87
DBLE 1.1 1.09 3.6 2.38
Table 2: Ablation study on CIFAR-100 and Tiny-ImageNet

6 Conclusion

Confidence calibration of DNNs, which has significant practical impacts, still remains an open problem. In this paper, we have proposed Distance-Based Learning from Errors (DBLE) to try to solve this problem. DBLE starts with applying prototypical learning to train the DNN classification model. It results in a distance-based representation space in which the model’s performance on a test sample is calibrated by the sample’s distance to its ground-truth class center. DBLE then trains a confidence model with mis-classified training samples for confidence estimation. We have empirically shown the effectiveness of DBLE on various classification datasets and network architectures.


Appendix A Appendix

a.1 Algorithm of one update of DBLE

  Input: Training set . .
   Build training episode
   Select classes for episode
   Sample supports and queries for every class in
  for  in  do
  end for
   Compute Loss
   Compute center representation for every class in
  for  in  do
  end for
   Compute prototypical loss for classification
  for  in  do
     for  in  do
     end for
  end for
   Compute confidence loss
   Make predictions and track mis-classified training samples
  for  in  do
     for  in  do
        if  then
        end if
     end for
  end for
   Compute confidence loss with mis-classified training samples
  for  in  do
  end for
   Update with prototypical loss for classification
   Update with confidence loss
Algorithm 1 One update of DBLE. is the total number of classes in the training set, is the number of classes in every episode, is the number of supports for each class, is the number of queries for each class.

a.2 Implementation and Training Details of DBLE and Baselines

For a fair comparison with baseline methods, in every scenario tested, the network architecture is identical for DBLE and all other baselines. As regularization techniques such as BatchNorm, weight decay are observed to affect confidence calibration (Guo et al., 2017), we also keep them constant while comparing to baselines in every scenario. All models are trained with stochastic gradient descent with momentum (Sutskever et al., 2013)

. We use an initial learning rate of 0.1 and a fixed momentum coefficient of 0.9 for all methods tested. The learning rate scheduling is tuned according to classification performance on validate set. All other hyperparameters of classification models for baselines are also chosen based on accuracy on validation set.

There are several unique hyper-parameters in DBLE. We describe how we choose them in the following paragraph. In DBLE, we stop the training when the classification model converges. The confidence model is a two-layer MLP with Dropout (Srivastava et al., 2014)

added in between. we use ReLU non-linearity 

(Glorot et al., 2011) for the MLP. We fix the dropout rate as 0.5. The (number of classes), (number of shots) and the batch-size of the query set for every episode in DBLE’s are firstly tuned according to the classification performance on validation set to reach comparable performance with vanilla training. Then the , and the batch-size of the query set for every episode are fine-tuned according to the DBLE’s calibration performance on validation set. At inference, we fix the number of representation sampling in Eq. 10 as 20. This is because we empirically observed that a larger than 20 doesn’t give performance improvements. We set the number of bins in Eq. 11 as 15 following (Guo et al., 2017).

a.3 Training the confidence with all training samples vs. with mis-classified training samples

In addition to the final results we report in the ablation study (the last two rows of Table 2) comparing training the confidence with all training samples vs. with mis-classified training samples, we also checked statistics of , which is the standard deviation predicted by the confidence model. For CIFAR-10, the mean of of the model trained with all samples is 0.203, while that of the model trained with mis-classified training samples is 0.740. This suggests that if we train the confidence model with all training samples, the model outputs for test samples are indeed smaller in general, which is aligned with our intuition. The histogram of under these two methods are shown in Figure 5.

Figure 5: The histogram of of models trained with all training samples vs. with mis-classified training samples on CIFAR-10.