Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning

06/23/2020 ∙ by Kanil Patel, et al. ∙ Bosch 0

Post-hoc calibration is a common approach for providing high-quality confidence estimates of deep neural network predictions. Recent work has shown that widely used scaling methods underestimate their calibration error, while alternative Histogram Binning (HB) methods with verifiable calibration performance often fail to preserve classification accuracy. In the case of multi-class calibration with a large number of classes K, HB also faces the issue of severe sample-inefficiency due to a large class imbalance resulting from the conversion into K one-vs-rest class-wise calibration problems. The goal of this paper is to resolve the identified issues of HB in order to provide verified and calibrated confidence estimates using only a small holdout calibration dataset for bin optimization while preserving multi-class ranking accuracy. From an information-theoretic perspective, we derive the I-Max concept for binning, which maximizes the mutual information between labels and binned (quantized) logits. This concept mitigates potential loss in ranking performance due to lossy quantization, and by disentangling the optimization of bin edges and representatives allows simultaneous improvement of ranking and calibration performance. In addition, we propose a shared class-wise (sCW) binning strategy that fits a single calibrator on the merged training sets of all K class-wise problems, yielding reliable estimates from a small calibration set. The combination of sCW and I-Max binning outperforms the state of the art calibration methods on various evaluation metrics across different benchmark datasets and models, even when using only a small set of calibration data, e.g. 1k samples for ImageNet.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite great ability in learning discriminative features, deep neural network (DNN) classifiers often make over-confident predictions. This can lead to potentially catastrophic consequences in safety critical applications, e.g., medical diagnosis and perception tasks in autonomous driving. A perfectly calibrated classifier should be correct

of the time on cases to which it assigns confidence, and this mismatch can be measured through the Expected Calibration Error (ECE).

Since the pioneering work of pmlr-v70-guo17a, scaling methods have been widely acknowledged as an efficient post-hoc calibration solution for modern DNNs. Recently, Kumar2019; wenger2020calibration showed that the ECE metric pmlr-v70-guo17a is underestimated, and more critically, that it fails to provide a verifiable measure for scaling methods

, owing to the bias/variance tradeoff in the evaluation error. To estimate ECE from a set of evaluation samples, the prediction probabilities made by scaling methods need to be quantized/binned; this discrete approximation of a continuous function introduces a bias into the empirical estimate of ECE. Increasing the number of evaluation bins reduces this bias, as the quantization error is smaller, however, the estimation of the ground truth correctness begins to suffer from high variance. Fig. 

1-a) shows that the empirical ECE estimates of both the raw network outputs and the temperature scaling method (TS) pmlr-v70-guo17a vary over the number of evaluation bins. It is yet unclear how to choose the optimal number of bins so that the estimation error is minimized for the given number of evaluation samples. In other words, which ECE estimate best reflects the true calibration error remains an open question, rendering the empirical ECE estimation of scaling methods non-verifiable.

Histogram binning (HB) was also exploited by pmlr-v70-guo17a for post-hoc calibration. The reported calibration performance was much worse than scaling methods, and it additionally suffered from severe accuracy loss; nonetheless, HB offers verifiable ECEs Kumar2019. In contrast to scaling methods, it produces discrete prediction probabilities, averting quantization at evaluation, and thus its empirical ECE estimate is constant and irrelevant to the number of evaluation bins in the example of Fig. 1-a). Without bias induced by evaluation quantization, its ECE estimate can converge as the variance vanishes with an increased number of evaluation samples, see Fig. 1-b).

(a) Top- prediction ECE (k evaluation samples)
(b) ECE converging curve (based on bootstraps)
Figure 1: (a) Temperature scaling (TS) pmlr-v70-guo17a, equally sized-histogram binning (HB) Zadrozny2001, and our proposal, i.e., sCW I-Max binning are compared for post-hoc calibrating a CIFAR100 (WRN) classifier. (b) Binning offers a reliable ECE measure as the number of evaluation samples increases.

Aiming at verifiable ECE evaluation, we propose a novel binning scheme, I-Max binning, that is an outcome of bin optimization via mutual information (MI) maximization. The ranking performance (and accuracy) of a trained classifier depends on how well the ground truth label can be retrieved from the classifier’s output logits. However, binning these logits inherently suffers from information loss. By design, I-Max binning aims to maximally retain the label information given the available number of bins, and also allows the two design objectives, calibration and accuracy, to be nicely disentangled. We evaluate two common variants of HB, i.e., Equal (Eq.) size (uniformly partitioning the probability interval

) and Eq. mass (uniformly distributing training samples over bins) binning, and explain their accuracy loss through sub-optimal performance at label-information preservation. To cope with a limited number of calibration samples, we propose a novel, sample-efficient strategy that fits one binning scheme for all per-class calibrations, which we call shared class-wise (sCW) I-Max binning.

I-Max binning is evaluated according to multiple performance metrics, including accuracy, ECE (class-wise and top- prediction), negative log-likelihood, and Brier score, and compared with benchmark calibration methods across multiple datasets and trained classifiers.

Compared to the baseline, I-Max obtains up to a % (%) reduction in class-wise ECE and % (%) reduction in top1-ECE for ImageNet (CIFAR100). Similarly, compared to the state-of-the-art scaling method GP wenger2020calibration and using its evaluation framework, we obtain a % (%) reduction in class-wise ECE and % (%) reduction in top1-ECE for ImageNet (CIFAR100). We additionally extend our sCW calibration idea to other calibration methods, as well as combine I-Max binning with scaling methods for further improved performance.

2 Related Work

Confidence calibration is an active research topic in deep learning. Bayesian DNNs 

pmlr-v37-blundell15; Kingma2015; pmlr-v70-louizos17a and their approximations DropoutGalG16; EnsemblesNIPS2017 are resource-demanding methods which consider predictive model uncertainty. However, applications with limited complexity overhead and latency require sampling-free and single-model based calibration methods. Examples include modifying the training loss pmlr-v80-kumar18a, scalable Gaussian processes Milios2018, sampling-free uncertainty estimation Postels_2019_ICCV, data augmentation patel2019onmanifold; OnMixupTrainThul; yun2019cutmix; hendrycks2020augmix and ensemble distribution distillation Malinin2020Ensemble. In comparison, a simple approach that requires no retraining of the models is post-hoc calibration pmlr-v70-guo17a.

Prediction probabilities (logits) scaling and binning are the two main solutions for post-hoc calibration. Scaling methods use parametric or non-parametric models to adjust the raw logits.


investigated linear models, ranging from the single-parameter based temperature scaling to more complicated vector/matrix scaling. To avoid overfitting,  

Kull2019 suggested to regularize matrix scaling with a loss on the model weights. Recently,  wenger2020calibration adopted a latent Gaussian process for multi-class calibration.

As an alternative to scaling, HB quantizes the raw confidences with either Eq. size or Eq. mass bins Zadrozny2001. Isotonic regression Zadrozny2002

and Bayesian binning into quantiles (BBQ) 


are often viewed as binning methods. However, from the verifiable ECE viewpoint, they are scaling methods: though isotonic regression fits a piecewise linear function, its predictions are continuous as they are interpolated for unseen data. BBQ considers multiple binning schemes with different numbers of bins, and combines them using a continuous Bayesian score, resulting in continuous predictions.

Based on empirical results pmlr-v70-guo17a concluded that HB underperforms compared to scaling. However, one critical issue of scaling methods discovered in the recent work Kumar2019 is their non-verifiable ECE under sample-based evaluation. As HB offers verifiable ECE evaluation, Kumar2019 proposed to apply it after scaling. In this work, we improve the current HB design by casting bin optimization into a MI maximization problem. The resulting I-Max binning is superior to existing scaling methods on multiple evaluation criteria. Furthermore, our findings can also be used to improve scaling methods.

3 Method

Here we introduce the I-Max binning scheme, which addresses the issues of HB in terms of preserving label-information in multi-class calibration. After the problem setup in Sec. 3.1, Sec. 3.2) presents a sample-efficient technique for one-vs-rest calibration. In Sec. 3.3 we formulate the training objective of binning as MI maximization and derive a simple algorithm for I-Max binning.

3.1 Problem Setup

We address supervised multi-class classification tasks, where each input belongs to one of

classes, and the ground truth labels are one-hot encoded, i.e.,

. Let be a DNN trained using cross-entropy loss. maps each onto a logit vector , and its responses are used to rank the possible classes of the current instance, i.e., being the top- ranked class. The ranking performance of determines the classifier’s top- prediction accuracy. Besides ranking, are frequently interpreted as an approximation to the ground truth prediction distribution. However, as the trained classifier tends to overfit to the cross-entropy loss rather than the accuracy (i.e., loss), as prediction probabilities are typically poorly calibrated. ECE averaged over the classes quantifies the multi-class calibration performance of :


It measures the expected deviation of the ground truth probability from the predicted per-class confidence . Multi-class post-hoc calibration uses a calibration set , independent from the classifier training set, in order to reduce by revising , e.g. by scaling or binning.

For evaluation, the unknown ground truth is empirically estimated from the labeled samples that receive the same prediction confidence . Since continuous prediction probabilities are naturally non-repetitive, additional quantization is inevitable for evaluating scaling methods.

3.2 One-vs-Rest Strategy for Multi-class Calibration

Histogram binning (HB) was initially developed for two-class calibration. When dealing with multi-class calibration, it separately calibrates each class in a one-vs-rest fashion: For any class- we use as the binary label for a two-class calibration task, in which the class- means and class- collects all other classes. We can use a scalar logit to parameterize the raw prediction probability of using the -model, i.e.,


HB then revises by mapping the raw logit onto a given number of bins, and reproducing them with the empirical frequency of class- samples in each bin. The empirical frequencies are estimated from the class- training set that is constructed from the calibration set under the one-vs-rest strategy.

It is important to note that the two-class ratio in is , so can suffer from a severe class imbalance when is large. For ImageNet with k classes, constructed from a class-balanced of size k contains only class- samples, see Fig. 2-a). Too few class- samples will compromise the optimization of the calibration schemes. To address this, we propose to merge across classes, i.e., . The size of the resulting merged set will be and will contain k class- samples ( times larger than ), see Fig. 2-b).

For bin optimization, both and serve as empirical approximations to the inaccessible ground truth distribution . The former suffers from high variances, arising from insufficient samples, while the latter is biased due to having samples drawn from the other classes. As the calibration set size is usually small, the variance is expected to outweigh the approximation error over the bias (see an empirical analysis in the appendix). Given that, we propose to use for training, yielding one binning scheme shareable to all two-class calibration tasks, i.e., shared class-wise (sCW) binning instead of CW binning respectively trained on .

(a) train set for CW binning.
(b) train. set for shared-CW binning.
Figure 2: Histogram of ImageNet (InceptionResNetv2) logits for (a) CW and (b) sCW training. By means of the set merging strategy to handle the class imbalance , has times more class- samples than with the same k calibration samples from .

3.3 Bin Optimization via Mutual Information (MI) Maximization

Binning can be viewed as a quantizer that maps the real-valued logit to the bin interval if , where is the total number of bin intervals, and the bin edges are sorted (, and ). Any logit binned to will be reproduced to the same bin representative . In the context of calibration, the bin representative assigned to the logit is used as the calibrated prediction probability of the class-. As multiple classes can be assigned with the same bin representative, we will then encounter ties when making top- predictions based on calibrated probabilities. Therefore, binning as lossy quantization generally does not preserve the raw logit-based ranking performance, being subject to potential accuracy loss.

Unfortunately, increasing to reduce the quantization error is not a good solution here. For a given calibration set, the number of samples per bin generally reduces as increases, and a reliable frequency estimation for setting the bin representatives demands sufficient samples per bin.

Considering that the top- accuracy is an indicator of how well the ground truth label can be recovered from the logits, we propose bin optimization via maximizing the MI between the quantized logits and the label


where the index

is viewed as a discrete random variable with

and . Such a formulation offers a quantizer optimal at preserving the label information for a given budget on the number of bins. Unlike designing distortion-based quantizers, the reproducer values of raw logits, i.e., the bin representatives , are not a part of the optimization space, as it is sufficient to know the mapped bin index of each logit for preserving the label information. In other words, the bin edge and representative settings are disentangled in the MI maximization formulation: the former for preserving label information, and the latter for calibration.

It is interesting to note the equality in (S1), which is based on the relation of MI to the entropy and conditional entropy of . Specifically, the first term is maximized if is uniform, which is attained by Eq. mass binning. As a uniform sample distribution over the bins ensures a balanced empirical frequency estimation quality, it is a desirable choice for setting the bin representatives. However, for MI maximization, equally distributing the mass is insufficient due to the second term . More importantly, each bin interval should collect the that actually encodes the label information, otherwise will reduce to the entropy , and the MI will be zero. As a result, we observe from Fig. 3 that the bin edges of I-Max binning are densely located in an area where the uncertainty of given the logit is high. This uncertainty results from small gaps between the top class predictions. With small bin widths, such nearby prediction logits are more likely located to different bins, and thus distinguishable after binning. On the other hand, Eq. mass binning has a single bin stretching across this high-uncertainty area due to the class imbalance (Fig. 3). Eq. size binning follows a pattern closer to I-Max binning.111Eq. size binning uniformly partitions the interval in the probability domain. The observed dense and symmetric bin location around zero is the outcome of probability-to-logit translation via the function. However, its very narrow bin widths around zero may introduce large empirical frequency estimation errors when setting the bin representatives.

Figure 3: Histogram and KDE of CIFAR100 (WRN) logits in constructed from k calibration samples. The bin edges of Eq. mass binning are located at the high mass region, mainly covering class- due to the imbalanced two class ratio . Both Eq. size and I-Max binning cover the high uncertainty region, but here only I-Max yields reasonable bin widths ensuring enough mass per bin.

For solving the problem (S1), we formulate an equivalent problem.

Theorem 1.

The MI maximization problem given in (S1) is equivalent to


where the loss is defined as


and as a set of real-valued auxiliary variables are introduced here to ease the optimization.

Proof. See the appendix for the proof. ∎

Next, we compute the derivatives of the loss with respect to . When the conditional distribution takes the model, i.e., , the stationary points of , zeroing the gradients over , have a closed-form expression


where the approximation for arises from using the logits in the training set as an empirical approximation to and . So, we can solve the problem by iteratively and alternately updating and based on (S12). The convergence and initialization of such an iterative method as well as the -model assumption are discussed along with the proof of Theorem 1 in the appendix.

As the iterative method operates under an approximation of the inaccessible ground truth distribution , here we synthesize an example, see Fig. 4, to assess its effectiveness.222The formulation (S1) is a special case of IB, i.e., the Lagrange multiplier and stochastic quantization degenerating to a deterministic one. If using stochastic binning for calibration, it outputs a weighted sum of all bin representatives, thereby being continuous and not ECE verifiable. Given that, we do not use it for calibration. In Fig. 4-a, we show the monotonic increase in MI over the iterative steps. Furthermore, Fig. 4-b shows that I-Max approaches the theoretical limits and provides an information-theoretic perspective on the sub-optimal performance of the alternative binning schemes.

(a) Convergence behavior
(b) Label-information vs. compression rate
Figure 4: MI evaluation: The KDEs of for shown in Fig. 3 are used as the ground truth distribution to synthesize a dataset and evaluate the MI of Eq. mass, Eq. size, and I-Max binning trained over . (a) The developed iterative solution for I-Max bin optimization over successfully increases the MI over iterations, approaching the theoretical upper bound . For comparison, I-Max is initialized with both Eq. size and Eq. mass bin edges, both of which are suboptimal at label information preservation. (b) We compare the three binning schemes with to quantization levels against the information bottleneck (IB) limit on the label-information vs. the compression rate . The information-rate pairs achieved by I-Max binning are very close to the limit. The information loss of Eq. mass binning is considerably larger, whereas Eq. size binning gets stuck in the low rate regime, failing to reach the upper bound even with more bins.

4 Experiment

Datasets and Models

We evaluate post-hoc calibration methods on three benchmark datasets, i.e., ImageNet imagenet_cvpr09, CIFAR-100 cifar, and CIFAR-10 cifar, and across three modern DNNs for each dataset, i.e., InceptionResNetV2 inceptionresnetv2, DenseNet161 dnpaper and ResNet152 He2016DeepRL for ImageNet, and Wide ResNet (WRN) wrnpaper, DenseNet-BC (L=190, k=40) dnpaper and ResNext8x64 resnextpaper for the two CIFAR datasets. These models are publicly available pre-trained networks and details are reported in the appendix.

Training and Evaluation Details

We perform class-balanced random splits of the data test set: the calibration and evaluation set sizes are both k for ImageNet, and k for the CIFAR datasets. All reported numbers are the means across random splits; stds can be found in the appendix. Note that some calibration methods only use a subset of the available calibration samples for training, showing their sample efficiency. Further calibrator training details are provided in the appendix.

We empirically evaluate MI, Accuracy (top- and ACCs), ECE (class-wise and top- prediction), NLL and Brier score; the latter is shown in the appendix. The ECEs of the scaling methods are obtained by using Eq. size binning with evaluation bins, as in wenger2020calibration. Additional quantization schemes for ECE evaluation of scaling methods can be found in the appendix.

4.1 Eq. Size, Eq. Mass vs. I-Max Binning

When comparing different calibration methods, it is important to jointly consider ECE and ACC. A naive classifier that always predicts all classes with confidence minimizes the ECE to zero, but its ACC is only at chance level. Additionally, only having a small may be an artifact. For instance, the of ImageNet will be dominated by cases where a properly trained classifier ranks the class- as very unlikely, as the prior ensures the correctness in cases. However, in practice, the relevance of the prediction probability being calibrated or not only matters when the event is likely to happen. Given that, we introduce a threshold into the evaluation as in (1) to only consider cases when the predicted probability of class- exceeds its prior .

In Tab. 1, we compare three binning schemes: Eq. size, Eq. mass and I-Max binning. The accuracy performances of the three binning schemes are proportional to their MI (in nats); Eq. mass binning is highly sub-optimal at label information preservation, and thus shows a severe accuracy drop. Eq. size binning accuracy is more similar to that of I-Max binning, but still lower, in particular at . Also note that I-Max can approach the MI theoretical limit of =. Advantages of I-Max become even more prominent when comparing the NLLs of the binning schemes. For all three ECE evalution metrics, I-Max binning improves on the baseline calibration performance, and outperforms Eq. size binning. Eq. mass binning is out of this comparison scope due to its poor accuracy deeming the method impractical. Overall, I-Max successfully mitigates the negative impact of quantization on ACCs while still providing an improved and verifiable ECE performance. Additionally, I-Max achieves an even better calibration when using sCW with k calibration samples, instead of the standard CW binning with k calibration samples, highlighting the effectiveness of sCW binning.

In the appendix we perform additional ablations on the number of bins and calibration samples. Accordingly, a post-hoc analysis investigates how the quantization error of the binning schemes change the ranking order. Observations are consistent with the intuition behind the problem formulation (see Sec. 3.3) and empirical results from Tab. 1 that MI maximization is a proper criterion for multi-class calibration and it maximally mitigates the potential accuracy loss.

Binn. sCW(?) size MI NLL
Baseline - - 80.33 95.10 0.000442 0.0486 0.0357 0.8406
Eq. Mass 25k 0.0026 7.78 27.92 0.000173 0.0016 0.0606 3.5960
1k 0.0026 5.02 26.75 0.000165 0.0022 0.0353 3.5272
Eq. Size 25k 0.0053 78.52 89.06 0.000310 0.1344 0.0547 1.5159
1k 0.0062 80.14 88.99 0.000298 0.1525 0.0279 1.2671
I-Max 25k 0.0066 80.27 95.01 0.000346 0.0342 0.0329 0.8499
1k 0.0066 80.20 94.86 0.000296 0.0302 0.0200 0.7860
Table 1: ACCs and ECEs of Eq. mass, Eq. size and I-Max binning for the case of ImageNet (InceptionResNetV2). Due to the poor accuracy of Eq. mass binning, its ECEs are not considered for comparison. The MI is empirically evaluated based on KDE analogous to Fig. 4, where the MI upper bound is =. For the other datasets and models, we refer to the appendix.

4.2 Scaling vs. I-Max Binning

In Tab. 2, we compare I-Max binning to benchmark scaling methods. Namely, matrix scaling Kull2019 has a large model capacity compared to other scaling methods using parametric models, while TS pmlr-v70-guo17a only uses a single parameter. As a non-parametric method, GP wenger2020calibration yields state of the art calibration performance. Benefiting from its model capacity, matrix scaling achieves the best accuracy. I-Max binning achieves the best calibration on CIFAR-100; on ImageNet, it has the best , and is similar to GP on . For a broader scope of comparison, we refer to appendix.

To showcase the complementary nature of scaling and binning, we investigate combining binning with GP (a top performing non-parametric scaling method, though with the drawback of high complexity) and TS (a commonly used scaling method). Here, we propose to bin the raw logits and use the GP/TS scaled logits of the samples per bin for setting the bin representatives, replacing the empirical frequency estimates. As GP is then only needed at the calibration learning phase, complexity is no longer an issue. More importantly, binning offers GP/TS (and any scaling method) verifiable ECE performance. Being mutually beneficial, GP helps improving ACCs and ECEs of binning.

Using the combination of the best binning method and scaling method (i.e. I-Max (sCW) w. GP), only marginally degrades the accuracy performance of the baseline by as little as % (%) on for ImageNet (CIFAR100) and % on for ImageNet. This minor drop is insignificant compared to the improvement in calibration. We obtain a % (%) reduction in and % (%) reduction in of the baseline for ImageNet (CIFAR100). Similarly, compared to the state-of-the-art scaling method (i.e. GP wenger2020calibration), we obtain a % (%) reduction in and % (%) reduction in top1-ECE for ImageNet (CIFAR100).

CIFAR100 (WRN) ImageNet (InceptionResNetV2)
Baseline 81.35 0.1113 0.0748 80.33 95.10 0.0486 0.0357
Mtx Scal. Kull2019 81.44 0.1085 0.0692 80.78 95.38 0.0508 0.0282
TS pmlr-v70-guo17a 81.35 0.0911 0.0511 80.33 95.10 0.0559 0.0439
GP wenger2020calibration 81.34 0.1074 0.0358 80.33 95.11 0.0485 0.0186
I-Max 81.30 0.0518 0.0231 80.20 94.86 0.0302 0.0200
I-Max w. TS 81.34 0.0510 0.0365 80.20 94.87 0.0354 0.0402
I-Max w. GP 81.34 0.0559 0.0179 80.20 94.87 0.0300 0.0121
Table 2: ACCs and ECEs of I-Max binning ( bins) and scaling methods. All methods use k calibration samples, except for Mtx. Scal., which requires the complete calibration set, i.e., k/k for ImageNet/CIFAR100. Additional scaling methods can be found in the appendix.

4.3 Shared Class Wise Helps Scaling Methods

Though without quantization loss, some scaling methods, i.e., Beta pmlr-v54-kull17a, Isotonic regression Zadrozny2002, and Platt scaling Platt1999, even suffer from more severe accuracy degradation than I-Max binning. As they also use the one-vs-rest strategy for multi-class calibration, we find that the proposed shared CW binning strategy is beneficial for reducing their accuracy loss and improving their ECE performance, with only k calibration samples, see Tab. 3.

CIFAR100 (WRN) ImageNet (InceptionResNetV2)
Calibrator sCW(?)
Baseline - 81.35 0.1113 0.0748 80.33 95.10 0.0489 0.0357
Beta pmlr-v54-kull17a 81.02 0.1066 0.0638 77.80 86.83 0.1662 0.1586
Beta 81.35 0.0942 0.0357 80.33 95.10 0.0625 0.0603
I-Max w. Beta 81.34 0.0508 0.0161 80.20 94.87 0.0381 0.0574
Isotonic Reg. Zadrozny2002 80.62 0.0989 0.0785 77.82 88.36 0.1640 0.1255
Isotonic Reg. 81.30 0.0602 0.0257 80.22 95.05 0.0345 0.0209
I-Max w. Isotonic Reg. 81.34 0.0515 0.0212 80.20 94.87 0.0299 0.0170
Platt Scal. Platt1999 81.31 0.0923 0.1035 80.36 94.91 0.0451 0.0961
Platt Scal. 81.35 0.0816 0.0462 80.33 95.10 0.0565 0.0415
I-Max w. Platt Scal. 81.34 0.0511 0.0323 80.20 94.87 0.0293 0.0392
Table 3: ACCs and ECEs of scaling methods using the one-vs-rest conversion for multi-class calibration. Here we compare using k samples for both CW and sCW scaling for ImageNet and CIFAR100. For each scheme and metric, we display the best variant of the scaling methods in bold.

5 Conclusion

In this work, we proposed I-Max binning for multi-class calibration. By maximally preserving the label-information under lossy quantization, it outperforms histogram binning (HB) in reducing potential accuracy losses. Additionally, through the novel shared class-wise (sCW) strategy, we solved the sample-inefficiency issue of training binning and scaling methods that rely on one-vs-rest conversion for multi-class calibration. Our experiments showed that I-Max binning yields consistent class-wise and top- calibration improvements over multiple datasets and model architectures, outperforming HB and state-of-the-art scaling methods. The combination of I-Max and scaling methods offers further, and more importantly, verifiable calibration performance gains.

Future work includes diversifying the calibration set through data augmentation or multiple-inference models for generalization to dataset shifts unobserved during training. Another exciting primary direction is the inclusion of I-Max during the classifier’s training, allowing I-Max to influence the logit distribution to exploit further calibration gains. Last but not least, as a quantization scheme, I-Max binning may also be exploited for calibration evaluation, providing better, ideally verifiable empirical ECE estimates of scaling methods.


S1 No Extra Normalization after Class-wise Calibrations

There is a group of calibration schemes that rely on one-vs-rest conversion to turn multi-class calibration into class-wise calibrations, e.g., histogram binning (HB), Platt scaling and Isotonic regression. After per-class calibration, the calibrated prediction probabilities of all classes no longer fulfill the constraint, i.e., . An extra normalization step was taken in pmlr-v70-guo17a to regain the normalization constraint. Here, we note that this extra normalization is unnecessary and partially undoes the per-class calibration effect. For HB, normalization even renders its ECE evaluation non-verifiable like any other scaling methods.

One-vs-rest strategy essentially marginalizes the multi-class predictive distribution over each class. After such marginalization, each class and its prediction probability shall be treated independently, thus no longer being constrained by the multi-class normalization constraint. This is analogous to train a CIFAR or ImageNet classifier with rather than cross entropy loss, e.g., Ryou_2019_ICCV. At training and test time, each class prediction probability is individually taken from the respective -response without normalization. The class with the largest response is then top-ranked, and normalization itself has no influence on the ranking performance.

Figure S1: Empirical approximation error of vs. , where Jensen-Shannon divergence (JSD) is used to measure the difference between the empirical distributions underlying the training sets for class-wise bin optimization. Overall, the merged set is a more sample efficient choice over .

S2 vs. as Empirical Approximations to for Bin Optimization

In Sec. 3.2 of the main paper, we discussed the class imbalance issue, arising from one-vs-rest conversion, see Fig. 2. To tackle it, we proposed to merge the class-wise (CW) bin training sets and use the merged one to train a single binning scheme for all classes, i.e., shared CW (sCW) instead of CW binning. Tab. 3 showed that our proposal is also beneficial to scaling methods that use one-vs-rest for multi-class calibration.

As pointed out in Sec. 3.2, both and are empirical approximations to the inaccessible ground truth for bin optimization. In Fig. S1, we empirically analyze their approximation errors. From the CIFAR10 test set, we take k samples to approximate per-class logit distribution by means of histogram density estimation333Here, we focus on as its empirical estimation suffers from the class imbalance, being much more challenging than as illustrated in Fig. 2 of the main paper., and then use it as the baseline for comparison, i.e., in Fig. S1. The rest of the k samples in the CIFAR10 test set are reserved for constructing and . For each class, we respectively evaluate the square root of the Jensen-Shannon divergence (JSD) from the baseline to the empirical distribution of or attained at different numbers of samples. Despite suffering from the bias induced by set merging, achieves noticeably smaller JSDs than in nearly all cases, and shows to be more sample efficient. Only after using all k samples, has slightly smaller JSDs than in the class-, and and . From CIFAR10 to CIFAR100 and ImageNet, the class imbalance issue becomes more severe. We expect the sample inefficiency issue of becomes more critical, and therefore it is advisable to use for bin optimization.

S3 Proof of Theorem 1

Theorem 1.

The mutual information (MI) maximization problem given as follows:


is equivalent to


where the loss is defined as


As a set of real-valued auxiliary variables, are introduced here to ease the optimization.

Proof. Before staring our proof, we note that the upper-case indicates probability mass functions of discrete random variables, e.g., the label and the bin interval index ; whereas the lower-case

is reserved for probability density functions of continuous random variables, e.g., the raw logit


The key to prove the equivalence is to show the inequality


and the equality is attainable by minimizing over .

By the definition of MI, we firstly expand as


where the conditional distribution is given as


From the above expression, we note that MI maximization effectively only accounts to the bin edges . The bin representatives can be arbitrary as long as they can indicate the condition . So, the bin interval index is sufficient to serve the role in conditioning the probability mass function of , i.e., . After optimizing the bin edges, we have the freedom to set the bin representatives for the sake of post-hoc calibration.

Next, based on the MI expression, we compute its sum with


The equality is based on


From the equality to , it is simply because of identifying the term in of the equality

as the Kullback-Leibler divergence (KLD) between two probability mass functions of

. As the probability mass function and the KLD both are non-negative, we reach to the inequality at , where the equality holds if . By further noting that is convex over and nulls out its gradient over , we then reach to


The obtained equality then concludes our proof


s3.1 Convergence of the Iterative Method

For convenience, we recall the update equations for in Sec. 3.3 of the main paper here


In the following, we show that the updates on and according to (S12) continuously decrease the loss , i.e.,


The second inequality is based on the explained property of . Namely, it is convex over and the minimum for any given is attained by . As is the log-probability ratio of , we shall have


where in this case is induced by and . Plugging and into (S7), the resulting at the iteration yields the update equation of as given in (S12).

To prove the first inequality, we start from showing that is a local minimum of . The update equation on is an outcome of solving the stationary point equation of over under the condition for any


Being a stationary point is the necessary condition of local extremum when the function’s first-order derivative exists at that point, i.e., first-derivative test. To further show that the local extremum is actually a local minimum, we resort to the second-derivative test, i.e., if the Hessian matrix of is positive definite at the stationary point . Due to with the monotonically increasing function in its update equation, we have


implying that all eigenvalues of the Hessian matrix are positive (equivalently, is positive definite). Therefore,

as the stationary point of is a local minimum.

It is important to note that from the stationary point equation (S15), as a local minimum is unique among with for any . In other words, the first inequality holds under the condition for any . Binning is a lossy data processing. In order to maximally preserve the label information, it is natural to exploit all bins in the optimization, not wasting any single bin in the area without mass, i.e., . Having said that, it is reasonable to constrain with over iterations, thereby concluding that the iterative method will converge to a local minimum based on the two inequalities (S13).

s3.2 Initialization of the Iterative Method

We propose to initialize the iterative method by modifying the k-means++ algorithm 

kmeans++ that was developed to initialize the cluster centers for k-means clustering algorithms. It is based on the following identification


As is a constant with respect to , minimizing is equivalent to minimizing the term on the RHS of (S17). The last approximation is reached by turning the binning problem into a clustering problem, i.e., grouping the logit samples in the training set according to the KLD measure, where are effectively the centers of each cluster. k-means++ algorithm kmeans++ initializes the cluster centers based on the Euclidean distance. In our case, we alternatively use the JSD as the distance measure to initialize . Comparing with KLD, JSD is symmetric and bounded.

s3.3 A Remark on the Iterative Method Derivation

The closed-form update on in (S12) is based on the -model approximation, which has been validated through our empirical experiments. It is expected to work with properly trained classifiers that are not overly overfitting to the cross-entropy loss, e.g., using data augmentation and other regularization techniques at training. Nevertheless, even in corner cases that classifiers are poorly trained, the iterative method can still be operated without the -model approximation. Namely, as shown in Fig. 3 of the main paper, we can resort to KDE for an empirical estimation of the ground truth distribution . Using the KDEs, we can compute the gradient of over and perform iterative gradient based update on , replacing the closed-form based update. Essentially, the -model approximation is only necessary to find the stationary points of the gradient equations, speeding up the convergence of the method. If attempting to keep the closed-form update on , an alternative solution could be to use the KDEs for adjusting the -model, e.g., , where and are chosen to match the KDE based approximation to . After setting and , they will be used as a scaling and bias term in the original closed-form update equations


S4 Post-hoc Analysis on the Experiment Results in Sec. 4.1

In Tab. 1 of Sec. 4.1, we compared three different binning schemes by measuring their ACCs and ECEs. The observation on their accuracy performance is aligned with our mutual information maximization viewpoint introduced in Sec. 3.3 and Fig. 3. Here, we re-present Fig. 3 and provide an alternative explanation to strengthen our understanding on how the location of bin edges affects the accuracy, e.g., why Eq. Size binning performed acceptable at the top- ACC, but failed at the top- ACC. Specifically, Fig. S2 shows the histograms of raw logits that are grouped based on their ranks instead of their labels as in Fig. 3. As expected, the logits with low ranks (i.e., below top- in Fig. S2) are small and thus take the left hand side of the plot, whereas the top- logits are mostly located on the right hand side. Besides sorting logits according to their ranks, we additionally estimate the density of the ground truth (GT) classes associated logits, i.e., in Fig. S2). With a properly trained classifier, the histogram of top- logits shall largerly overlap with the density curve , i.e., top- prediction being correct in most cases.

Figure S2: Histogram of CIFAR100 (WRN) logits in constructed from k calibration samples, using the same setting as Fig. 3 in the main paper. Instead of categorizing the logits according to their two-class label as in Fig. 3, here we sort them according to their ranks given by the CIFAR100 WRN classifier. As a baseline, we also plot the KDE of logits associated to the ground truth classes, i.e., .

From the bin edge location of Eq. Mass binning, it attempts to attain small quantization errors for logits of low ranks rather than top-. This will certainly degrade the accuracy performance after binning. On contrary, Eq. Size binning aims at small quantization error for the top- logits, but ignores top- ones. As a result, we observed its poor top- ACCs. I-Max binning nicely distributes its bin edges in the area where the GT logits are likely to locate, and the bin width becomes smaller in the area where the top- logits are close by (i.e., the overlap region between the red and blue histograms). Note that, any logit larger than zero must be top- ranked, as there can exist at most one class with prediction probability larger than . Given that, the bins located above zero are no longer to maintain the ranking order, rather to reduce the precision loss of top- prediction probability after binning.

Binn. NLL
Baseline 80.33 - 95.10 - 0.0357 - 0.8406
Eq. Mass  5.02 80.33 26.75 95.10 0.0353 0.7884 3.5272
Eq. Size 80.14 80.21 88.99 95.10 0.0279 0.0277 1.2671
I-Max 80.20 80.33 94.86 95.10 0.0200 0.0202 0.7860
Table S1: Comparison of sCW binning methods in the case of ImageNet - InceptionResNetV2. As sCW binning creates ties at top predictions, the ACCs initially reported in Tab. 1 of Sec. 4.1 use the class index as the secondary sorting criterion. Here, we add and which are attained by using the raw logits as the secondary sorting criterion. As the CW ECEs are not affected by this change, here we only report the new .

The second part of our post-hoc analysis is on the sCW binning strategy. When using the same binning scheme for all per-class calibration, the chance of creating ties in top- predictions is much higher than CW binning, e.g., more than one class are top- ranked according to the calibrated prediction probabilities. Our reported ACCs in the main paper are attained by simply returning the first found class, i.e., using the class index as the secondary sorting criterion. This is certainly a suboptimal solution. Here, we investigate on how the ties affect ACCs of sCW binning. To this end, we use raw logits (before binning) as the secondary sorting criterion. The resulting and are shown in Tab.S1. Interestingly, such a simple change reduces the accuracy loss of Eq. Mass and I-Max binning to zero, indicating that they can preserve the top- ranking order of the raw logits but not in a strict monotonic sense, i.e., some are replaced by . As opposed to I-Max binning, Eq. Mass binning has a poor performance at calibration, i.e., the really high NLL and ECE. This is because it trivially ranks many classes as top-, but each of them has a very and same small confidence score. Given that, even though the accuracy loss is no longer an issue, it is still not a good solution for multi-class calibration. For Eq. Size binning, resolving ties only helps restore the baseline top- but not top- ACC. Its poor bin representative setting due to unreliable empirical frequency estimation over too narrow bins can result in a permutation among the top- predictions.

Concluding from the above, our post-hoc analysis confirms that I-Max binning outperforms the other two binning schemes at mitigating the accuracy loss and multi-class calibration. In particular, there exists a simple solution to close the accuracy gap to the baseline, at the same time still retaining the desirable calibration gains.

S5 Ablation on the Number of Bins and Calibration Set Size

In Tab. 1 of Sec. 4.1, sCW I-Max binning is the top performing one at the ACCs, ECEs and NLL measures. In this part, we further investigate on how the number of bins and calibration set size influences its performance. Tab. S2 shows that in order to benefit from more bins we shall accordingly increase the number of calibration samples. More bins help reduce the quantization loss, but increase the empirical frequency estimation error for setting the bin representatives. Given that, we observe a reduced ACCs and increased ECEs for having bins with only k calibration samples. By increasing the calibration set size to k, then we start seeing the benefits of having more bins to reduce quantization error for better ACCs. Next, we further exploit scaling method, i.e., GP wenger2020calibration, for improving the sample efficiency of binning at setting the bin representatives. As a result, the combination is particularly beneficial to improve the ACCs and top- ECE. Overall, more bins are beneficial to ACCs, while ECEs favor less number of bins.

Binn. Bins
Baseline - 80.33 95.10 0.0486 0.0357 80.33 95.10 0.0486 0.0357
1k Calibration Samples 5k Calibration Samples
GP 80.33 95.11 0.0485 0.0186 80.33 95.11 0.0445 0.0177
I-Max 10 80.09 94.59 0.0316 0.0156 80.14 94.59 0.0330 0.0107
15 80.20 94.86 0.0302 0.0200 80.21 94.90 0.0257 0.0107
20 80.10 94.94 0.0266 0.0234 80.25 94.98 0.0220 0.0133
30 80.15 94.99 0.0343 0.0266 80.25 95.02 0.0310 0.0150
40 80.11 95.05 0.0365 0.0289 80.24 95.08 0.0374 0.0171
50 80.21 94.95 0.0411 0.0320 80.23 95.06 0.0378 0.0219
I-Max w. GP 10 80.09 94.59 0.0396 0.0122 80.14 94.59 0.0330 0.0072
15 80.20 94.87 0.0300 0.0121 80.21 94.88 0.0256 0.0080
20 80.23 94.95 0.0370 0.0133 80.25 95.00 0.0270 0.0091
30 80.26 95.04 0.0383 0.0141 80.27 95.02 0.0389 0.0097
40 80.27 95.11 0.0424 0.0145 80.26 95.08 0.0402 0.0108
50 80.30 95.08 0.0427 0.0153 80.28 95.08 0.0405 0.0114
Table S2: Ablation on the number of bins and calibration samples for sCW I-Max binning, where the basic setting is identical to the Tab. 1 in Sec. 4.1 of the main paper.

S6 Empirical ECE Estimation of Scaling Methods under Multiple Evaluation Schemes

As mentioned in the main paper, scaling methods suffer from not being able to provide verifiable ECEs, see Fig. 1. In Tab. S3 we adopt four different quantization schemes to empirically evaluate the ECEs of scaling methods. As we can see, the obtained results are evaluation scheme dependent. On contrary, I-Max binning with and without GP are not affected, and more importantly, their ECEs are better than that of scaling methods, regardless of the evaluation scheme.

Calibrator Mean
Baseline 80.33 0.0486 0.0003 0.0459 0.0004 0.0484 0.0004 0.0521 0.0004 0.0488
25k Calibration Samples
BBQ (Naeini2015) 53.89 0.0287 0.0009 0.0376 0.0014 0.0372 0.0014 0.0316 0.0008 0.0338
Beta pmlr-v54-kull17a 80.47 0.0706 0.0003 0.0723 0.0005 0.0742 0.0005 0.0755 0.0004 0.0732
Isotonic Reg. Zadrozny2002 80.08 0.0644 0.0015 0.0646 0.0015 0.0652 0.0016 0.0655 0.0015 0.0649
Platt Platt1999 80.48 0.0597 0.0007 0.0593 0.0008 0.0613 0.0008 0.0634 0.0008 0.0609
Vec Scal. Kull2019 80.53 0.0494 0.00020 0.0472 0.0004 0.0498 0.0003 0.0531 0.0003 0.0499
Mtx Scal. Kull2019 80.78 0.0508 0.0003 0.0488 0.0004 0.0512 0.0005 0.0544 0.0004 0.0513
1k Calibration Samples
TS pmlr-v70-guo17a 80.33 0.0559 0.0015 0.0548 0.0018 0.0573 0.0017 0.0598 0.0015 0.0570
GP wenger2020calibration 80.33 0.0485 0.0037 0.0450 0.0040 0.0475 0.0039 0.0520 0.0038 0.0483
I-Max 80.20 0.0302 0.0041
I-Max w. GP 80.20 0.0300 0.0041
Calibrator Mean
Baseline 80.33 0.0357 0.0010 0.0345 0.0010 0.0348 0.0012 0.0352 0.0016 0.0351
25k Calibration Samples
BBQ (Naeini2015) 53.89 0.2689 0.0033 0.2690 0.0034 0.2690 0.0034 0.2689 0.0032 0.2689
Beta pmlr-v54-kull17a 80.47 0.0346 0.0022 0.0360 0.0017 0.0360 0.0022 0.0357 0.0019 0.0356
Isotonic Reg. Zadrozny2002 80.08 0.0468 0.0020 0.0434 0.0019 0.0436 0.0020 0.0468 0.0015 0.0452
Platt Platt1999 80.48 0.0775 0.0015 0.0772 0.0015 0.0771 0.0016 0.0773 0.0014 0.0772
Vec Scal. Kull2019 80.53 0.0300 0.0010 0.0298 0.0012 0.0300 0.0016 0.0303 0.0011 0.0300
Mtx Scal. Kull2019 80.78 0.0282 0.0014 0.0287 0.0011 0.0286 0.0014 0.0289 0.0014 0.0286
1k Calibration Samples
TS pmlr-v70-guo17a 80.33 0.0439 0.0022 0.0452 0.0022 0.0454 0.0020 0.0443 0.0020 0.0447
GP wenger2020calibration 80.33 0.0186 0.0034 0.0182 0.0019 0.0186 0.0026 0.0190 0.0022 0.0186
I-Max 80.20 0.0200 0.0033
I-Max w. GP 80.20 0.0121 0.0048
Table S3: ECEs of scaling methods under various evaluation quantization schemes for ImageNet InceptionResNetV2. Overall, we consider four evaluation schemes, namely (1) : equal size binning; (2) : equal mass binning, (3) : MSE-based KMeans clustering and (4) : I-Max binning. All of them use bins. Note that, the ECEs of I-Max binning (as a calibrator rather than evaluation scheme) are agnostic to the evaluation scheme, i.e., offering verifiable ECEs. Furthermore, BBQ suffers from severe accuracy degradation.

S7 Training Details

s7.1 Pre-trained Classification Networks

We evaluate post-hoc calibration methods on three benchmark datasets, i.e., ImageNet imagenet_cvpr09, CIFAR-100 cifar, and CIFAR-10 cifar, and across three modern DNNs for each dataset, i.e., InceptionResNetV2 inceptionresnetv2, DenseNet161 dnpaper and ResNet152 He2016DeepRL for ImageNet, and Wide ResNet (WRN) wrnpaper, DenseNet-BC (, dnpaper and ResNext8x64 resnextpaper for the two CIFAR datasets.

All these models are publicly available pre-trained networks and details are reported at the respective websites, i.e., ImageNet classifiers: and CIFAR classifiers:

s7.2 Training Scaling Methods

The hyper-parameters were decided based on the original respective scaling methods publications with some exceptions. We found that the following parameters were the best for all the scaling methods. All scaling methods use the Adam optimizer with batch size for CIFAR and for ImageNet. The learning rate was set to for temperature scaling pmlr-v70-guo17a and Platt scaling Platt1999, for vector scaling pmlr-v70-guo17a and for matrix scaling pmlr-v70-guo17a. Matrix scaling was further regularized as suggested by Kull2019 with a

loss on the bias vector and the off-diagonal elements of the weighting matrix. BBQ 

Naeini2015, isotonic regression Zadrozny2002 and Beta pmlr-v54-kull17a hyper-parameters were taken directly from wenger2020calibration.

s7.3 Training I-Max Binning

The I-Max bin optimization started from k-means++ initialization, which uses JSD instead of Euclidean metric as the distance measure, see Sec. S3.2. Then, we iteratively and alternatively updated and according to (S12) until iterations. With the attained bin edges , we set the bin representatives based on the empirical frequency of class-. If a scaling method is combined with binning, an alternative setting for is to take the averaged prediction probabilities based on the scaled logits of the samples per bin, e.g., in Tab. 2 of Sec. 4.2. Note that, for CW binning in Tab. 1, the number of samples from the minority class is too few, i.e., . We only have about samples per bin, which are too few to use empirical frequency estimates. Alternatively, we set based on the raw prediction probabilities.

S8 Extend Tab. 1 for More Datasets and Models.

Tab. 1 from Sec. 4.1 of the main paper is replicated across datasets and models, where the basic setting remains the same. Specifically, three different ImageNet models can be found in Tab. S4, Tab. S5 and Tab. S6. Three models for CIFAR100 can be found in Tab. S7, Tab. S8 and Tab. S9. Similarly, CIFAR10 models can be found in Tab. S10, Tab.