1 Introduction
Deep neural networks (DNNs) are being deployed in many important decisionmaking scenarios (Goodfellow et al., 2016). Making wrong decisions could be very costly in most of them (Brundage et al., 2018) – it could cost human lives in medical diagnosis and autonomous transportation, and it could cost significant business losses in loan categorization and sales forecasting. To prevent these from happening, it is strongly desired for a DNN to output its confidence on its decisions. In almost all of the aforementioned scenarios, detrimental consequences could be avoided by refraining from making decisions or consulting human experts, in the cases of decisions with insufficient confidence. In addition, by tracking the confidence in decisions, dataset shifts can be detected and developers can build insights towards improving the model performance.
Despite their success in many scenarios (Mahajan et al., 2018; Park et al., 2019), confidence quantification (also referred as ‘confidence calibration’) is particularly a challenging problem for DNNs. For a ‘wellcalibrated’ model, the predictions with higher confidence should be more likely to be accurate. However, as studied in (Nguyen et al., 2014; Guo et al., 2017)
, for DNNs with conventional (also referred as ‘vanilla’) training to minimize the softmax crossentropy loss with SGD, the outputs do not contain sufficient information for wellcalibrated confidence estimation. Posterior probability estimates (the softmax outputs) can be interpreted as confidence estimation, but it calibrates the decision quality poorly
(Gal & Ghahramani, 2016) – the confidence value tends to be large even when the classification is inaccurate. Therefore, it would be desirable if the training approach is redesigned to shed more light upon its decision making process to provide more accurate confidence calibration.In this paper, we propose a novel training method, Distancebased Learning from Errors (DBLE) towards bettercalibrated DNNs. Generally, DBLE learns a distancebased representation space through classification and exploits distances in the space to yield wellcalibrated classification. Our motivation is that a test sample’s location in the representation space and its distance to training samples contain important information about the model’s decisionmaking process which is useful for guiding confidence estimation. However, in vanilla training, since both training and inference are not based on distances in the representation space, they are not optimized to fulfill this goal. Therefore, in DBLE, we propose to adapt prototypical learning for training and inference to learn a distancebased representation space through classification. In this space, a test sample’s distance to its groundtruth class center can calibrate the model’s performance. However, this distance cannot be calculated at inference directly since the ground truth label is unknown. To this end, we propose to train a separate confidence model in DBLE jointly with the classification model to estimate this distance at inference. To train the confidence model, we utilize the misclassified training samples during training. We demonstrate that on multiple DNN models and datasets, DBLE achieves superior confidence calibration without increasing the computational cost as ensemble methods.
2 Related Work
Many studies (Nguyen et al., 2014; Guo et al., 2017) have shown that the classification performance of DNNs are poorlycalibrated under vanilla training. One direction of work to improve calibration is adapting regularization methods for vanilla training. Such regularization methods are often originally proposed for other purposes. For example, Label Smoothing (Szegedy et al., 2016), which is proposed for improving generalization, has empirically shown to improve confidence calibration (Müller et al., 2019). Mixup (Zhang et al., 2017), which is designed mostly for adversarial robustness, also improves confidence calibration (Thulasidasan et al., 2019). The benefits from such approaches are typically limited, as the modified objective functions do not represent the goal of confidence estimation. To directly minimize a confidence scoring objective, Temperature Scaling (Guo et al., 2017), modifies learning by minimizing a Negative LogLikelihood (Friedman et al., 2001)
objective on a small training subset. It yields bettercalibrated confidence by directly minimizing an evaluation metric of the calibration task. However, since it requires training set to be split for the two tasks, classification and calibration, it results in a tradeoff – if more training data is used for calibration, then the model’s classification performance would drop because less training data are left for classification.
Bayesian DNNs constitute another direction of related work. Bayesian DNNs promise to directly model the posterior distribution for the test set given the training set (Rohekar et al., 2019). In Bayesian DNNs, since getting the posterior of large amount of parameters is usually intractable, different approximations are applied, such asMarkov chain Monte Carlo (MCMC) methods (Neal, 2012) and variational Bayesian methods (Blundell et al., 2015; Graves, 2011)
. Therefore, the calibration of Bayesian DNNs heavily depends on the degree of approximation and the predefined prior in variational methods. Moreover, in practice, Bayesian DNNs often yield prohibitively slow training due to the slow convergence of the posterior predictive estimation.
(Lakshminarayanan et al., 2017). Recent work on Bayesian DNNs (Heek & Kalchbrenner, 2019) reports that the training time of their model is an order of magnitude higher than vanilla training. As Bayesianinspired approaches, Monte Carlo Dropout (Gal & Ghahramani, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017) are two widelyused methods since they only require minor changes to the classic training method of DNNs and are relatively faster to train.3 Learning a distancebased space for confidence calibration through classification
We first introduce how DBLE applies prototypical learning to train DNNs for classification. Then, we explain that under this new training method, a distancebased representation space can be achieved to benefit confidence calibration.
3.1 Prototypical Learning for Classification
DBLE bases the training of the classification model on prototypical learning (Snell et al., 2017). Prototypical learning is originally proposed to learn a distancebased representation space for fewshot learning (Ravi & Larochelle, 2016; Oreshkin et al., 2018; Xing et al., 2019). In prototypical learning, both training and prediction solely depend on the distance of the samples to their corresponding class ‘prototypes’ (referred as ‘class centers’ in the paper) in the representation space. It trains the model with the goal of minimizing the intraclass distances, whereas maximizing interclass distances such that related samples are clustered together. We adapt this training method for classification training in DBLE. Subsequent subsections describe the two main components of prototypical training: episodic training and prototypical loss for classification. Later, we explain the inference at test in DBLE and why it leads to a representation space that benefits confidence calibration.
3.1.1 Episodic training
Vanilla DNN training for classification is based on variants of minibatch gradient descent (Bottou, 2010). Episodic training, on the other hand, samples shot, way episodes at every update (as opposed to sampling a batch of training data). An episode is created by first sampling random classes out of total classes^{1}^{1}1Note that we do not require to be equal to because fitting the support samples of all classes in a batch to processor memory can be challenging when is very large., and then sampling two sets of training samples for these classes: (i) the support set containing examples for each of the classes and (ii) the query set containing different examples from the same classes.
(1) 
At episode , the model is updated to map the query to its correct label , with the help of support set . The model is a function parameterized by and the loss is negative loglikelihood of the groundtruth class label of each query sample given the support set according to Eq. 1.
3.1.2 Prototypical loss
How does the support set help the classification of query samples at episode ? We firstly employ the DNN classification model, to encode inputs in a representation space, where are trainable parameters. Prototypical training then uses the support set to compute the center for each class (in the sampled episode). The query samples are classified based on their distance to each class center in the representation space. For every episode , each center is computed by averaging the representations of the support samples from class :
(2) 
where is the subset of support samples belonging to class . The prototypical loss calculates the predictive label distribution of query based on its distances to the centers:
(3) 
where is the representation of in the space:
(4) 
The model is trained by minimizing Eq. 1 with in Eq. 3. Through this training, in the representation space that we calculate and class centers, the interclass distances are maximized and the intraclass distances are minimized. Therefore, training samples belonging to the same class are clustered together and clusters representing different classes are pushed apart.
3.1.3 Inference and the test sample’s distance to its groundtruth class center
At inference, we calculate the center of every class using the training set, by averaging the representations of all corresponding training samples:
(5) 
where is the set of all training samples belonging to class . Then, given a test sample in the test set (where is the groundtruth label), the distances of its representation (Eq. 4) to all class centers are calculated. The prediction of the label of is based on these distances:
(6) 
where consists of all classes in the training set. In other words, is assigned to the class with the closest center in the representation space. Consequently, for a test sample , if at inference is too far away from its groundtruth class center , it is likely to be misclassified. Therefore, in this distancebased representation space, a test sample’s distance to its groundtruth class center, is a strong signal indicating how well the model would perform on this sample. In the next section, we empirically show it under DBLE and compare with other methods.
3.2 Distance to the groundtruth class center calibrates classification performance in DBLE
We design an experiment to show that in the space learned by prototypical learning of DBLE, the model’s performance given a test sample can be estimated by the test sample’s L2distance to its groundtruth class center. On the other hand, other intuitive measures such as L2distance to the predicted class center in prototypical learning, L2distance to the predicted class center in vanilla training or L2distance to the groundtruth center in vanilla training, can’t calibrate performance very well. Here, we describe the empirical observations on DBLE’s classification model with prototypical learning, to motivate the confidence modeling of DBLE before we explain the method.
In the learned representation space (either the output space of a model from vanilla training or prototypical learning), for every sample in the test set , we calculate its L2distance to its groundtruth class center as:
(7) 
where is given by Eq. 5. is the trained classification model and is L2distance. We can also calculate ’s distance to its predicted class center where is the predicted label, in both DBLE and vanilla training as comparison. We denote ’s distance to its predicted class center as . We then sort or in ascending order and partition them in equallyspaced bins. For every bin, we calculate the model’s classification accuracy of the test samples lying in the bin. Then we plot the test accuracy curve as the distance increases. The curve shows the the relationship between the distance and the model’s performance on , illustrating how well the calculated distance can act as confidence score, calibrating the model’s prediction.
Shown in Fig. 2, of DBLE, which is the distance of a test sample to its groundtruth class center, calibrates the model’s performance the best. The curve of under DBLE’s prototypical learning, despite its slight oscillation near in the end, decreases monotonically to almost zero. It indicates that farther away the sample is from its groundtruth class center, more poorly the model performs on it. These results suggest that can be used as a gold confidence measure, calibrating the quality of the model’s decision of for a classification model under prototypical learning. This benefit of DBLE’s prototypical learning is due to the distancebased training and inference – if a test sample is farther away from its groundtruth class center, it is more likely to be misclassified as other classes.
Although we have observed this gold calibration in DBLE enabled by prototypical learning, it requires access to the groundtruth labels of test samples, . At inference however, we do not have access to the groundtruth label of at inference and cannot be directly calculated. Therefore, we propose to use a separate confidence model to estimate , trained jointly with the classification training of DBLE. It regresses by learning from the misclassified training samples in DBLE’s classification training, described next.
4 Confidence Modeling by Learning from Errors
In the classification model learned by DBLE, a test sample ’s L2distance to its groundtruth class center (Eq. 7), is highlycalibrated with the model’s performance on it, as described in Sec. 3. However, cannot be directly computed without groundtruth labels, which is the case at inference. Therefore, we introduce a confidence model parameterized by , to learn to estimate jointly with the training of classification in DBLE.
To train , a straightforward option would be using the distance mapping for all training samples, . Through this way, can be trained to give an estimate of for the test sample at inference. However, correctlyclassified samples constitute the vast majority during training, especially considering that stateoftheart DNNs yield much lower training errors compared to test errors (Neyshabur et al., 2014). Therefore, if all data is used, training of would be dominated by the small distances of the correctlyclassified samples, which would make it harder for to capture the larger for the minor misclassified samples. Moreover, given that we choose as a simple MLP with limited capacity for the purpose of fast training, it becomes more challenging to capture larger from the incorrectlyclassified samples if all training samples are used (i.e. the confidence model would underfit misclassified training data). Therefore, we propose to track all training samples that are misclassified in episode during the training of the classification model in DBLE. We save them in for each episode to train . With our ablation studies in Sec. 5.4, we demonstrate the importance of learning from misclassified training samples vs. learning from all.
In the following, we introduce the training procedure of with misclassified samples for estimation of at inference for test sample . We train
by sampling from the isotropic Gaussian distribution
, where the standard deviation
is ’s output given misclassified training sample . is trained to output a larger for with a larger . takes the representation (calculated with Eq. 4) of a misclassified training sample in as input, and outputs as,(8) 
To train , we firstly sample another representation for from the isotropic Gaussian distribution parameterized by and , . Then, we optimize based on the prototypical loss with , using ’s predictive label distribution:
(9) 
When updating with the prototypical loss given , it encourages a larger when the of is larger. This is because when is fixed for in the space, maximizing Eq. 9 forces to be as close to its groundtruth class center as possible. Given is sampled from , if is farther away from its groundtruth center (which is the case of misclassified training samples), then it requires a larger for to go close to it. Fig. 3 visualizes how changes before vs. after updating . The training of DBLE is described in Fig. 1 and in Algorithm 1 in the Appendix A.1. Note that in order to make sampling of differentiable, we use the reparameterization trick (Kingma & Welling, 2013).
At inference, for every test sample , we make prediction with (see Eq. 6). While for confidence estimation of the predictions, we take advantage of . For confidence estimation, after getting and , we sample multiple from and average their predictive label distributions as confidence estimation:
(10) 
is the total number of representation samples . is used as the confidence score calibrating the prediction
. Through this way, for a test sample farther away from its groundtruth class center (which means they are more likely to be misclassified), the model will add more randomness to the representation sampling since its estimated variance from
is large. More randomness in the softmax input space leads to a lower expected softmax output (Gal & Ghahramani, 2016), which is the confidence score .5 Experiments
In this section, we firstly compare our method DBLE with recent baselines on confidence calibration of DNNs. We conduct experiments on a variety of data sets and network architectures and evaluate DBLE’s performance on two most commonly used metrics for confidence calibration: Expected Calibration Error (ECE) (Naeini et al., 2015) and Negative Log Likelihood (NLL) (Friedman et al., 2001). Results show that DBLE outperforms singlemodal baselines on confidence calibration in every scenario tested. We then conduct analysis ablation study to verify the effectiveness of the two main components in DBLE. Implementation and training details of DBLE and baselines are described in Appendix.
5.1 Experimental Setup
Baselines. We compare our method with baselines that use a single DNN: vanilla training, MCDropout (Gal & Ghahramani, 2016), Temperature Scaling (Guo et al., 2017), Mixup (Thulasidasan et al., 2019), Label Smoothing (Szegedy et al., 2016) and TrustScore (Jiang et al., 2018). We also compare DBLE with Deep Ensemble (Lakshminarayanan et al., 2017) with DNNs.
Datasets and Network Architectures. We conduct experiments on various combinations of datasets and architectures: MLP on MNIST (LeCun et al., 1998), VGG11 (Simonyan & Zisserman, 2014) on CIFAR10 (Krizhevsky et al., 2009), ResNet50 (He et al., 2016) on CIFAR100 (Krizhevsky et al., 2009) and ResNet50 on TinyImageNet (Deng et al., 2009).
Evaluation Metrics. We evaluate DBLE on model calibration with Expected Calibration Error (ECE) and Negative Log Likelihood (NLL). ECE approximates the expectation of the difference between accuracy and confidence. It partitions the confidence estimations (the likelihood of the predicted label ) of all test samples into equallyspaced bins and calculates the average confidence and accuracy of test samples lying in each bin :
(11) 
where is the predicted label of . NLL averages the negative loglikelihood of all test samples:
(12) 
5.2 Results
Method  MNISTMLP  CIFAR10VGG11  
Accuracy  ECE  NLL  Accuracy  ECE  NLL  
Vanilla Training  98.32  1.73  0.29  90.48  6.3  0.43 
MCDropout  98.32  1.71  0.34  90.48  3.9  0.47 
Temperature Scaling  95.14  1.32  0.17  89.83  3.1  0.33 
Label Smoothing  98.77  1.68  0.30  90.71  2.7  0.38 
Mixup  98.83  1.74  0.24  90.59  3.3  0.37 
TrustScore  98.32  2.14  0.26  90.48  5.3  0.40 
DBLE  98.69  0.97  0.12  90.92  1.5  0.29 
Deep Ensemble4 networks  99.36  0.99  0.08  92.4  1.8  0.26 
Method  CIFAR100ResNet50  TinyImageNetResNet50  
Accuracy  ECE  NLL  Accuracy  ECE  NLL  
Vanilla Training  71.57  19.1  1.58  46.71  25.2  2.95 
MCDropout  71.57  9.7  1.48  46.72  17.4  3.17 
Temperature Scaling  69.84  2.5  1.23  45.03  4.8  2.59 
Label Smoothing  71.92  3.3  1.39  47.19  5.6  2.93 
Mixup  71.85  2.9  1.44  46.89  6.8  2.66 
TrustScore  71.57  10.9  1.43  46.71  19.2  2.75 
DBLE  71.03  1.1  1.09  46.45  3.6  2.38 
Deep Ensemble4 networks  73.58  1.3  0.82  51.28  2.4  1.81 
Table 1 shows the results. In every scenario tested, DBLE is comparable with vanilla training in test accuracy – applying prototypical learning for classification problems does not hurt generalization of models on classification. On confidence calibration, DBLE performs better than all singlemodal baselines on every scenario tested on both ECE and NLL. Moreover, our method reaches comparable results with Deep Ensemble of 4 networks with a smaller time complexity and parameter size. For MNISTMLP, CIFAR10VGG11 and CIFAR100ResNet50, our method outperforms Deep ensemble (4 networks) on ECE. Among other baselines, Temperature Scaling performs the best in every scenario, which is because Temperature Scaling directly optimizes NLL on the small subtraining set. MCdropout gives the least improvement on confidence calibration, especially on NLL. The potential reason for MCdropout’s limited improvement on NLL can be that when applying dropout at inference, it adds similar level of randomness on misclassified samples and correctly classified samples. Therefore, the predictive likelihood of the correct labels becomes smaller in general, which leads to worse NLL. Compared to Temperature Scaling, DBLE decouples classification and calibration, therefore achieves better calibration without sacrificing classification performance. Compared to MCdropout, our model is trained to add randomness on predictive distributions for misclassified samples specifically, thus achieving significantly better calibration.
5.3 Computational Cost
DBLE adds extra training time complexity and trainable parameters compared with vanilla training. However, compared with Deep Ensembles and other Bayesian methods, DBLE’s cost is significantly less. For training time complexity, although DBLE requires two forwards passes for at each update step (see Algorithm 1), its actual training time is less than twice of vanilla training since the number of iterations required until convergence is smaller. Empirically we observe that DBLE’s total training time on CIFAR10VGG11 is 1.4x of vanilla training and on CIFAR100ResNet50 is 1.7x of vanilla training. While Deep Ensemble of 4 networks takes 4x vanilla training time. DBLE adds extra trainable parameters compared with vanilla training. is the parameters of a MLP in practice. It’s size is usually of the classification model, while ensembling 4 DNNs in Deep Ensemble increases the size of trainable parameters to 4x of vanilla training.
5.4 Algorithm Analysis and Ablation Study
DBLE alleviates the NLL overfitting problem. NLL overfitting problem of vanilla training has been observed in (Guo et al., 2017) – after annealing the learning rate, as the test accuracy goes up, the model overfits to NLL score (test NLL starts to increase instead of decreasing or maintaining flat). Fig. 4 shows that this phenomenon is alleviated by DBLE. Under vanilla training, the model starts overfitting to NLL after the first learning rate annealing. DBLE on the other hand, decreases and maintains its test NLL every time after learning rate annealing. We see the same trend for test ECE as well – with vanilla training the model overfits to ECE after learning rate annealing while DBLE decreases and maintains its test ECE.
Test ECE and NLL of the last 45 epochs during training on CIFAR100. The learning rate of both vanilla training and DBLE is annealed by 10x at epoch 60, 70, 80.
Distancebased representation space and learning from errors are both essential for DBLE. We conduct ablation studies to verify the effectiveness of the two main design choices of our model, the distancebased space and learning calibration from training errors. Table 2 shows the results. In “Learning with errors in Vanilla training”, we conduct calibration learning with errors in vanilla training. In other words, instead of updating by maximizing Eq. 9, we update by maximizing the softmax likelihood with
as logits and
is updated with vanilla training. In “DBLE with calibration learning using all samples”, we update with all training samples by maximizing Eq. 9. We can see from the results that firstly, learning to calibrate with errors also helps vanilla training. It improves calibration in vanilla training slightly since it to some extent, introduces more randomness in the decision making process for misclassified samples. However, the improvement is very small without the distancebased space. Secondly, we notice that in DBLE if we learn calibration with all training samples, it significantly decreases DBLE’s performance on calibration. The potential reason is the model’s underfitting to misclassified training samples in learning to estimate the distance. Auxiliary analysis on these two can be found in Appendix A.3.Method  CIFAR100  TinyImageNet  

ECE  NLL  ECE  NLL  
Vanilla Training  19.1  1.58  25.2  2.95 
Learning with errors in vanilla training  18.3  1.43  20.9  2.61 
DBLE with calibration learning using all samples  18.9  1.54  24.8  2.87 
DBLE  1.1  1.09  3.6  2.38 
6 Conclusion
Confidence calibration of DNNs, which has significant practical impacts, still remains an open problem. In this paper, we have proposed DistanceBased Learning from Errors (DBLE) to try to solve this problem. DBLE starts with applying prototypical learning to train the DNN classification model. It results in a distancebased representation space in which the model’s performance on a test sample is calibrated by the sample’s distance to its groundtruth class center. DBLE then trains a confidence model with misclassified training samples for confidence estimation. We have empirically shown the effectiveness of DBLE on various classification datasets and network architectures.
References
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.

Bottou (2010)
Léon Bottou.
Largescale machine learning with stochastic gradient descent.
In Proceedings of COMPSTAT’2010. Springer, 2010.  Brundage et al. (2018) Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, Hyrum S. Anderson, Heather Roff, Gregory C. Allen, Jacob Steinhardt, Carrick Flynn, Seán Ó hÉigeartaigh, Simon Beard, Haydn Belfield, Sebastian Farquhar, Clare Lyle, Rebecca Crootof, Owain Evans, Michael Page, Joanna Bryson, Roman Yampolskiy, and Dario Amodei. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv:1802.07228, 2018.
 Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, 2009.
 Friedman et al. (2001) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.

Gal & Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
In ICML, 2016. 
Glorot et al. (2011)
Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
Deep sparse rectifier neural networks.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, 2011.  Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT Press, 2016.
 Graves (2011) Alex Graves. Practical variational inference for neural networks. In NIPS, 2011.
 Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, 2017.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 Heek & Kalchbrenner (2019) Jonathan Heek and Nal Kalchbrenner. Bayesian inference for large scale image classification. arXiv preprint arXiv:1908.03491, 2019.
 Jiang et al. (2018) Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. To trust or not to trust a classifier. In NIPS, 2018.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009.
 Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, 2017.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Mahajan et al. (2018) Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv:1805.00932, 2018.
 Müller et al. (2019) Rafael Müller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? arXiv preprint arXiv:1906.02629, 2019.
 Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI, 2015.
 Neal (2012) Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 Neyshabur et al. (2014) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning. arXiv:1412.6614, 2014.
 Nguyen et al. (2014) Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. arXiv:1412.1897, 2014.
 Oreshkin et al. (2018) Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved fewshot learning. In NIPS, 2018.
 Park et al. (2019) Daniel S. Park, William Chan, Yu Zhang, ChungCheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. arXiv:1904.08779, Apr 2019.
 Ravi & Larochelle (2016) Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. 2016.
 Rohekar et al. (2019) Raanan Y Rohekar, Yaniv Gurwicz, Shami Nisimov, and Gal Novik. Modeling uncertainty by learning a hierarchy of deep neural connections. arXiv preprint arXiv:1905.13195, 2019.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In NIPS, 2017.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 2014.
 Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.

Szegedy et al. (2016)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna.
Rethinking the inception architecture for computer vision.
In CVPR, 2016.  Thulasidasan et al. (2019) Sunil Thulasidasan, Gopinath Chennupati, Jeff Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. arXiv preprint arXiv:1905.11001, 2019.
 Xing et al. (2019) Chen Xing, Negar Rostamzadeh, Boris N Oreshkin, and Pedro O Pinheiro. Adaptive crossmodal fewshot learning. arXiv preprint arXiv:1902.07104, 2019.
 Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Appendix A Appendix
a.1 Algorithm of one update of DBLE
a.2 Implementation and Training Details of DBLE and Baselines
For a fair comparison with baseline methods, in every scenario tested, the network architecture is identical for DBLE and all other baselines. As regularization techniques such as BatchNorm, weight decay are observed to affect confidence calibration (Guo et al., 2017), we also keep them constant while comparing to baselines in every scenario. All models are trained with stochastic gradient descent with momentum (Sutskever et al., 2013)
. We use an initial learning rate of 0.1 and a fixed momentum coefficient of 0.9 for all methods tested. The learning rate scheduling is tuned according to classification performance on validate set. All other hyperparameters of classification models for baselines are also chosen based on accuracy on validation set.
There are several unique hyperparameters in DBLE. We describe how we choose them in the following paragraph. In DBLE, we stop the training when the classification model converges. The confidence model is a twolayer MLP with Dropout (Srivastava et al., 2014)
added in between. we use ReLU nonlinearity
(Glorot et al., 2011) for the MLP. We fix the dropout rate as 0.5. The (number of classes), (number of shots) and the batchsize of the query set for every episode in DBLE’s are firstly tuned according to the classification performance on validation set to reach comparable performance with vanilla training. Then the , and the batchsize of the query set for every episode are finetuned according to the DBLE’s calibration performance on validation set. At inference, we fix the number of representation sampling in Eq. 10 as 20. This is because we empirically observed that a larger than 20 doesn’t give performance improvements. We set the number of bins in Eq. 11 as 15 following (Guo et al., 2017).a.3 Training the confidence with all training samples vs. with misclassified training samples
In addition to the final results we report in the ablation study (the last two rows of Table 2) comparing training the confidence with all training samples vs. with misclassified training samples, we also checked statistics of , which is the standard deviation predicted by the confidence model. For CIFAR10, the mean of of the model trained with all samples is 0.203, while that of the model trained with misclassified training samples is 0.740. This suggests that if we train the confidence model with all training samples, the model outputs for test samples are indeed smaller in general, which is aligned with our intuition. The histogram of under these two methods are shown in Figure 5.