## 1 Introduction

Despite great ability in learning discriminative features, deep neural network (DNN) classifiers often make over-confident predictions. This can lead to potentially catastrophic consequences in safety critical applications, e.g., medical diagnosis and perception tasks in autonomous driving. A perfectly calibrated classifier should be correct

of the time on cases to which it assigns confidence, and this mismatch can be measured through the Expected Calibration Error (ECE).Since the pioneering work of pmlr-v70-guo17a, scaling methods have been widely acknowledged as an efficient post-hoc calibration solution for modern DNNs.
Recently, Kumar2019; wenger2020calibration showed that the ECE metric pmlr-v70-guo17a is underestimated, and more critically, that it fails to provide a verifiable measure for *scaling methods*

, owing to the bias/variance tradeoff in the evaluation error. To estimate ECE from a set of evaluation samples, the prediction probabilities made by scaling methods need to be quantized/binned; this discrete approximation of a continuous function introduces a bias into the empirical estimate of ECE. Increasing the number of evaluation bins reduces this bias, as the quantization error is smaller, however, the estimation of the ground truth correctness begins to suffer from high variance. Fig.

1-a) shows that the empirical ECE estimates of both the raw network outputs and the temperature scaling method (TS) pmlr-v70-guo17a vary over the number of evaluation bins. It is yet unclear how to choose the optimal number of bins so that the estimation error is minimized for the given number of evaluation samples. In other words, which ECE estimate best reflects the true calibration error remains an open question, rendering the empirical ECE estimation of*scaling methods*non-verifiable.

Histogram binning (HB) was also exploited by pmlr-v70-guo17a for post-hoc calibration. The reported calibration performance was much worse than scaling methods, and it additionally suffered from severe accuracy loss; nonetheless, HB offers verifiable ECEs Kumar2019. In contrast to scaling methods, it produces discrete prediction probabilities, averting quantization at evaluation, and thus its empirical ECE estimate is constant and irrelevant to the number of evaluation bins in the example of Fig. 1-a). Without bias induced by evaluation quantization, its ECE estimate can converge as the variance vanishes with an increased number of evaluation samples, see Fig. 1-b).

Aiming at verifiable ECE evaluation, we propose a novel binning scheme, I-Max binning, that is an outcome of bin optimization via mutual information (MI) maximization. The ranking performance (and accuracy) of a trained classifier depends on how well the ground truth label can be retrieved from the classifier’s output logits. However, binning these logits inherently suffers from information loss. By design, I-Max binning aims to maximally retain the label information given the available number of bins, and also allows the two design objectives, calibration and accuracy, to be nicely disentangled. We evaluate two common variants of HB, i.e., Equal (Eq.) size (uniformly partitioning the probability interval

) and Eq. mass (uniformly distributing training samples over bins) binning, and explain their accuracy loss through sub-optimal performance at label-information preservation. To cope with a limited number of calibration samples, we propose a novel, sample-efficient strategy that fits one binning scheme for all per-class calibrations, which we call shared class-wise (sCW) I-Max binning.

I-Max binning is evaluated according to multiple performance metrics, including accuracy, ECE (class-wise and top- prediction), negative log-likelihood, and Brier score, and compared with benchmark calibration methods across multiple datasets and trained classifiers.

Compared to the baseline, I-Max obtains up to a % (%) reduction in class-wise ECE and % (%) reduction in top1-ECE for ImageNet (CIFAR100). Similarly, compared to the state-of-the-art scaling method GP wenger2020calibration and using its evaluation framework, we obtain a % (%) reduction in class-wise ECE and % (%) reduction in top1-ECE for ImageNet (CIFAR100). We additionally extend our sCW calibration idea to other calibration methods, as well as combine I-Max binning with scaling methods for further improved performance.

## 2 Related Work

Confidence calibration is an active research topic in deep learning. Bayesian DNNs

pmlr-v37-blundell15; Kingma2015; pmlr-v70-louizos17a and their approximations DropoutGalG16; EnsemblesNIPS2017 are resource-demanding methods which consider predictive model uncertainty. However, applications with limited complexity overhead and latency require sampling-free and single-model based calibration methods. Examples include modifying the training loss pmlr-v80-kumar18a, scalable Gaussian processes Milios2018, sampling-free uncertainty estimation Postels_2019_ICCV, data augmentation patel2019onmanifold; OnMixupTrainThul; yun2019cutmix; hendrycks2020augmix and ensemble distribution distillation Malinin2020Ensemble. In comparison, a simple approach that requires no retraining of the models is post-hoc calibration pmlr-v70-guo17a.Prediction probabilities (logits) scaling and binning are the two main solutions for post-hoc calibration. Scaling methods use parametric or non-parametric models to adjust the raw logits.

pmlr-v70-guo17ainvestigated linear models, ranging from the single-parameter based temperature scaling to more complicated vector/matrix scaling. To avoid overfitting,

Kull2019 suggested to regularize matrix scaling with a loss on the model weights. Recently, wenger2020calibration adopted a latent Gaussian process for multi-class calibration.As an alternative to scaling, HB quantizes the raw confidences with either Eq. size or Eq. mass bins Zadrozny2001. Isotonic regression Zadrozny2002

and Bayesian binning into quantiles (BBQ)

Naeini2015are often viewed as binning methods. However, from the verifiable ECE viewpoint, they are scaling methods: though isotonic regression fits a piecewise linear function, its predictions are continuous as they are interpolated for unseen data. BBQ considers multiple binning schemes with different numbers of bins, and combines them using a continuous Bayesian score, resulting in continuous predictions.

Based on empirical results pmlr-v70-guo17a concluded that HB underperforms compared to scaling. However, one critical issue of scaling methods discovered in the recent work Kumar2019 is their non-verifiable ECE under sample-based evaluation. As HB offers verifiable ECE evaluation, Kumar2019 proposed to apply it after scaling. In this work, we improve the current HB design by casting bin optimization into a MI maximization problem. The resulting I-Max binning is superior to existing scaling methods on multiple evaluation criteria. Furthermore, our findings can also be used to improve scaling methods.

## 3 Method

Here we introduce the I-Max binning scheme, which addresses the issues of HB in terms of preserving label-information in multi-class calibration. After the problem setup in Sec. 3.1, Sec. 3.2) presents a sample-efficient technique for one-vs-rest calibration. In Sec. 3.3 we formulate the training objective of binning as MI maximization and derive a simple algorithm for I-Max binning.

### 3.1 Problem Setup

We address supervised multi-class classification tasks, where each input belongs to one of

classes, and the ground truth labels are one-hot encoded, i.e.,

. Let be a DNN trained using cross-entropy loss. maps each onto a logit vector , and its responses are used to rank the possible classes of the current instance, i.e., being the top- ranked class. The ranking performance of determines the classifier’s top- prediction accuracy. Besides ranking, are frequently interpreted as an approximation to the ground truth prediction distribution. However, as the trained classifier tends to overfit to the cross-entropy loss rather than the accuracy (i.e., loss), as prediction probabilities are typically poorly calibrated. ECE averaged over the classes quantifies the multi-class calibration performance of :(1) |

It measures the expected deviation of the ground truth probability from the predicted per-class confidence . Multi-class post-hoc calibration uses a calibration set , independent from the classifier training set, in order to reduce by revising , e.g. by scaling or binning.

For evaluation, the unknown ground truth is empirically estimated from the labeled samples that receive the same prediction confidence . Since continuous prediction probabilities are naturally non-repetitive, additional quantization is inevitable for evaluating scaling methods.

### 3.2 One-vs-Rest Strategy for Multi-class Calibration

Histogram binning (HB) was initially developed for two-class calibration.
When dealing with multi-class calibration, it separately calibrates each class in a *one-vs-rest* fashion:
For any class- we use as the binary label for a two-class calibration task, in which the class- means and class- collects all other classes. We can use a scalar logit to parameterize the raw prediction probability of using the -model, i.e.,

(2) |

HB then revises by mapping the raw logit onto a given number of bins, and reproducing them with the empirical frequency of class- samples in each bin. The empirical frequencies are estimated from the class- training set that is constructed from the calibration set under the one-vs-rest strategy.

It is important to note that the two-class ratio in is , so can suffer from a severe class imbalance when is large. For ImageNet with k classes, constructed from a class-balanced of size k contains only class- samples, see Fig. 2-a). Too few class- samples will compromise the optimization of the calibration schemes. To address this, we propose to merge across classes, i.e., . The size of the resulting merged set will be and will contain k class- samples ( times larger than ), see Fig. 2-b).

For bin optimization, both and serve as empirical approximations to the inaccessible ground truth distribution . The former suffers from high variances, arising from insufficient samples, while the latter is biased due to having samples drawn from the other classes. As the calibration set size is usually small, the variance is expected to outweigh the approximation error over the bias (see an empirical analysis in the appendix). Given that, we propose to use for training, yielding one binning scheme shareable to all two-class calibration tasks, i.e., *shared* class-wise (sCW) binning instead of CW binning respectively trained on .

### 3.3 Bin Optimization via Mutual Information (MI) Maximization

Binning can be viewed as a quantizer that maps the real-valued logit to the bin interval if , where is the total number of bin intervals, and the bin edges are sorted (, and ). Any logit binned to will be reproduced to the same bin representative . In the context of calibration, the bin representative assigned to the logit is used as the calibrated prediction probability of the class-. As multiple classes can be assigned with the same bin representative, we will then encounter ties when making top- predictions based on calibrated probabilities. Therefore, binning as lossy quantization generally does not preserve the raw logit-based ranking performance, being subject to potential accuracy loss.

Unfortunately, increasing to reduce the quantization error is not a good solution here. For a given calibration set, the number of samples per bin generally reduces as increases, and a reliable frequency estimation for setting the bin representatives demands sufficient samples per bin.

Considering that the top- accuracy is an indicator of how well the ground truth label can be recovered from the logits, we propose bin optimization via maximizing the MI between the quantized logits and the label

(3) |

where the index

is viewed as a discrete random variable with

and . Such a formulation offers a quantizer optimal at preserving the label information for a given budget on the number of bins. Unlike designing distortion-based quantizers, the reproducer values of raw logits, i.e., the bin representatives , are not a part of the optimization space, as it is sufficient to know the mapped bin index of each logit for preserving the label information. In other words, the bin edge and representative settings are disentangled in the MI maximization formulation: the former for preserving label information, and the latter for calibration.It is interesting to note the equality in (S1), which is based on the relation of MI to the entropy and conditional entropy of . Specifically, the first term is maximized if is uniform, which is attained by Eq. mass binning. As a uniform sample distribution over the bins ensures a balanced empirical frequency estimation quality, it is a desirable choice for setting the bin representatives. However, for MI maximization, equally distributing the mass is insufficient due to the second term .
More importantly, each bin interval should collect the that actually encodes the label information, otherwise will reduce to the entropy , and the MI will be zero.
As a result, we observe from Fig. 3 that the bin edges of I-Max binning are densely located in an area where the uncertainty of given the logit is high. This uncertainty results from small gaps between the top class predictions. With small bin widths, such nearby prediction logits are more likely located to different bins, and thus distinguishable after binning. On the other hand, Eq. mass binning has a single bin stretching across this high-uncertainty area due to the class imbalance (Fig. 3). Eq. size binning follows a pattern closer to I-Max binning.^{1}^{1}1Eq. size binning uniformly partitions the interval in the probability domain. The observed dense and symmetric bin location around zero is the outcome of probability-to-logit translation via the function. However, its very narrow bin widths around zero may introduce large empirical frequency estimation errors when setting the bin representatives.

For solving the problem (S1), we formulate an equivalent problem.

###### Theorem 1.

The MI maximization problem given in (S1) is equivalent to

(4) |

where the loss is defined as

(5) |

and as a set of real-valued auxiliary variables are introduced here to ease the optimization.

Proof. See the appendix for the proof. ∎

Next, we compute the derivatives of the loss with respect to . When the conditional distribution takes the model, i.e., , the stationary points of , zeroing the gradients over , have a closed-form expression

(6) |

where the approximation for arises from using the logits in the training set as an empirical approximation to and . So, we can solve the problem by iteratively and alternately updating and based on (S12). The convergence and initialization of such an iterative method as well as the -model assumption are discussed along with the proof of Theorem 1 in the appendix.

As the iterative method operates under an approximation of the inaccessible ground truth distribution , here we synthesize an example, see Fig. 4, to assess its effectiveness.^{2}^{2}2The formulation (S1) is a special case of IB, i.e., the Lagrange multiplier and stochastic quantization degenerating to a deterministic one. If using stochastic binning for calibration, it outputs a weighted sum of all bin representatives, thereby being continuous and not ECE verifiable. Given that, we do not use it for calibration.
In Fig. 4-a, we show the monotonic increase in MI over the iterative steps.
Furthermore, Fig. 4-b shows that I-Max approaches the theoretical limits and provides an information-theoretic perspective on the sub-optimal performance of the alternative binning schemes.

## 4 Experiment

#### Datasets and Models

We evaluate post-hoc calibration methods on three benchmark datasets, i.e., ImageNet imagenet_cvpr09, CIFAR-100 cifar, and CIFAR-10 cifar, and across three modern DNNs for each dataset, i.e., InceptionResNetV2 inceptionresnetv2, DenseNet161 dnpaper and ResNet152 He2016DeepRL for ImageNet, and Wide ResNet (WRN) wrnpaper, DenseNet-BC (L=190, k=40) dnpaper and ResNext8x64 resnextpaper for the two CIFAR datasets. These models are publicly available pre-trained networks and details are reported in the appendix.

#### Training and Evaluation Details

We perform class-balanced random splits of the data test set: the calibration and evaluation set sizes are both k for ImageNet, and k for the CIFAR datasets. All reported numbers are the means across random splits; stds can be found in the appendix. Note that some calibration methods only use a subset of the available calibration samples for training, showing their sample efficiency. Further calibrator training details are provided in the appendix.

We empirically evaluate MI, Accuracy (top- and ACCs), ECE (class-wise and top- prediction), NLL and Brier score; the latter is shown in the appendix. The ECEs of the scaling methods are obtained by using Eq. size binning with evaluation bins, as in wenger2020calibration.
Additional quantization schemes for ECE evaluation of *scaling methods* can be found in the appendix.

### 4.1 Eq. Size, Eq. Mass vs. I-Max Binning

When comparing different calibration methods, it is important to jointly consider ECE and ACC. A naive classifier that always predicts all classes with confidence minimizes the ECE to zero, but its ACC is only at chance level. Additionally, only having a small may be an artifact. For instance, the of ImageNet will be dominated by cases where a properly trained classifier ranks the class- as very unlikely, as the prior ensures the correctness in cases. However, in practice, the relevance of the prediction probability being calibrated or not only matters when the event is likely to happen. Given that, we introduce a threshold into the evaluation as in (1) to only consider cases when the predicted probability of class- exceeds its prior .

In Tab. 1, we compare three binning schemes: Eq. size, Eq. mass and I-Max binning. The accuracy performances of the three binning schemes are proportional to their MI (in nats); Eq. mass binning is highly sub-optimal at label information preservation, and thus shows a severe accuracy drop. Eq. size binning accuracy is more similar to that of I-Max binning, but still lower, in particular at . Also note that I-Max can approach the MI theoretical limit of =. Advantages of I-Max become even more prominent when comparing the NLLs of the binning schemes. For all three ECE evalution metrics, I-Max binning improves on the baseline calibration performance, and outperforms Eq. size binning. Eq. mass binning is out of this comparison scope due to its poor accuracy deeming the method impractical. Overall, I-Max successfully mitigates the negative impact of quantization on ACCs while still providing an improved and verifiable ECE performance. Additionally, I-Max achieves an even better calibration when using sCW with k calibration samples, instead of the standard CW binning with k calibration samples, highlighting the effectiveness of sCW binning.

In the appendix we perform additional ablations on the number of bins and calibration samples. Accordingly, a post-hoc analysis investigates how the quantization error of the binning schemes change the ranking order. Observations are consistent with the intuition behind the problem formulation (see Sec. 3.3) and empirical results from Tab. 1 that MI maximization is a proper criterion for multi-class calibration and it maximally mitigates the potential accuracy loss.

Binn. | sCW(?) | size | MI | NLL | |||||
---|---|---|---|---|---|---|---|---|---|

Baseline | ✗ | - | - | 80.33 | 95.10 | 0.000442 | 0.0486 | 0.0357 | 0.8406 |

Eq. Mass | ✗ | 25k | 0.0026 | 7.78 | 27.92 | 0.000173 | 0.0016 | 0.0606 | 3.5960 |

✓ | 1k | 0.0026 | 5.02 | 26.75 | 0.000165 | 0.0022 | 0.0353 | 3.5272 | |

Eq. Size | ✗ | 25k | 0.0053 | 78.52 | 89.06 | 0.000310 | 0.1344 | 0.0547 | 1.5159 |

✓ | 1k | 0.0062 | 80.14 | 88.99 | 0.000298 | 0.1525 | 0.0279 | 1.2671 | |

I-Max | ✗ | 25k | 0.0066 | 80.27 | 95.01 | 0.000346 | 0.0342 | 0.0329 | 0.8499 |

✓ | 1k | 0.0066 | 80.20 | 94.86 | 0.000296 | 0.0302 | 0.0200 | 0.7860 |

### 4.2 Scaling vs. I-Max Binning

In Tab. 2, we compare I-Max binning to benchmark scaling methods. Namely, matrix scaling Kull2019 has a large model capacity compared to other scaling methods using parametric models, while TS pmlr-v70-guo17a only uses a single parameter. As a non-parametric method, GP wenger2020calibration yields state of the art calibration performance. Benefiting from its model capacity, matrix scaling achieves the best accuracy. I-Max binning achieves the best calibration on CIFAR-100; on ImageNet, it has the best , and is similar to GP on . For a broader scope of comparison, we refer to appendix.

To showcase the complementary nature of scaling and binning, we investigate combining binning with GP (a top performing non-parametric scaling method, though with the drawback of high complexity) and TS (a commonly used scaling method). Here, we propose to bin the raw logits and use the GP/TS scaled logits of the samples per bin for setting the bin representatives, replacing the empirical frequency estimates. As GP is then only needed at the calibration learning phase, complexity is no longer an issue. More importantly, binning offers GP/TS (and any scaling method) verifiable ECE performance. Being mutually beneficial, GP helps improving ACCs and ECEs of binning.

Using the combination of the best binning method and scaling method (i.e. I-Max (sCW) w. GP), only marginally degrades the accuracy performance of the baseline by as little as % (%) on for ImageNet (CIFAR100) and % on for ImageNet. This minor drop is insignificant compared to the improvement in calibration. We obtain a % (%) reduction in and % (%) reduction in of the baseline for ImageNet (CIFAR100). Similarly, compared to the state-of-the-art scaling method (i.e. GP wenger2020calibration), we obtain a % (%) reduction in and % (%) reduction in top1-ECE for ImageNet (CIFAR100).

CIFAR100 (WRN) | ImageNet (InceptionResNetV2) | ||||||
---|---|---|---|---|---|---|---|

Calibrator | |||||||

Baseline | 81.35 | 0.1113 | 0.0748 | 80.33 | 95.10 | 0.0486 | 0.0357 |

Mtx Scal. Kull2019 | 81.44 | 0.1085 | 0.0692 | 80.78 | 95.38 | 0.0508 | 0.0282 |

TS pmlr-v70-guo17a | 81.35 | 0.0911 | 0.0511 | 80.33 | 95.10 | 0.0559 | 0.0439 |

GP wenger2020calibration | 81.34 | 0.1074 | 0.0358 | 80.33 | 95.11 | 0.0485 | 0.0186 |

I-Max | 81.30 | 0.0518 | 0.0231 | 80.20 | 94.86 | 0.0302 | 0.0200 |

I-Max w. TS | 81.34 | 0.0510 | 0.0365 | 80.20 | 94.87 | 0.0354 | 0.0402 |

I-Max w. GP | 81.34 | 0.0559 | 0.0179 | 80.20 | 94.87 | 0.0300 | 0.0121 |

### 4.3 Shared Class Wise Helps Scaling Methods

Though without quantization loss, some scaling methods, i.e., Beta pmlr-v54-kull17a, Isotonic regression Zadrozny2002, and Platt scaling Platt1999, even suffer from more severe accuracy degradation than I-Max binning. As they also use the one-vs-rest strategy for multi-class calibration, we find that the proposed shared CW binning strategy is beneficial for reducing their accuracy loss and improving their ECE performance, with only k calibration samples, see Tab. 3.

CIFAR100 (WRN) | ImageNet (InceptionResNetV2) | |||||||

Calibrator | sCW(?) | |||||||

Baseline | - | 81.35 | 0.1113 | 0.0748 | 80.33 | 95.10 | 0.0489 | 0.0357 |

Beta pmlr-v54-kull17a | ✗ | 81.02 | 0.1066 | 0.0638 | 77.80 | 86.83 | 0.1662 | 0.1586 |

Beta | ✓ | 81.35 | 0.0942 | 0.0357 | 80.33 | 95.10 | 0.0625 | 0.0603 |

I-Max w. Beta | ✓ | 81.34 | 0.0508 | 0.0161 | 80.20 | 94.87 | 0.0381 | 0.0574 |

Isotonic Reg. Zadrozny2002 | ✗ | 80.62 | 0.0989 | 0.0785 | 77.82 | 88.36 | 0.1640 | 0.1255 |

Isotonic Reg. | ✓ | 81.30 | 0.0602 | 0.0257 | 80.22 | 95.05 | 0.0345 | 0.0209 |

I-Max w. Isotonic Reg. | ✓ | 81.34 | 0.0515 | 0.0212 | 80.20 | 94.87 | 0.0299 | 0.0170 |

Platt Scal. Platt1999 | ✗ | 81.31 | 0.0923 | 0.1035 | 80.36 | 94.91 | 0.0451 | 0.0961 |

Platt Scal. | ✓ | 81.35 | 0.0816 | 0.0462 | 80.33 | 95.10 | 0.0565 | 0.0415 |

I-Max w. Platt Scal. | ✓ | 81.34 | 0.0511 | 0.0323 | 80.20 | 94.87 | 0.0293 | 0.0392 |

## 5 Conclusion

In this work, we proposed I-Max binning for multi-class calibration. By maximally preserving the label-information under lossy quantization, it outperforms histogram binning (HB) in reducing potential accuracy losses. Additionally, through the novel shared class-wise (sCW) strategy, we solved the sample-inefficiency issue of training binning and scaling methods that rely on one-vs-rest conversion for multi-class calibration. Our experiments showed that I-Max binning yields consistent class-wise and top- calibration improvements over multiple datasets and model architectures, outperforming HB and state-of-the-art scaling methods. The combination of I-Max and scaling methods offers further, and more importantly, verifiable calibration performance gains.

Future work includes diversifying the calibration set through data augmentation or multiple-inference models for generalization to dataset shifts unobserved during training. Another exciting primary direction is the inclusion of I-Max during the classifier’s training, allowing I-Max to influence the logit distribution to exploit further calibration gains. Last but not least, as a quantization scheme, I-Max binning may also be exploited for calibration evaluation, providing better, ideally verifiable empirical ECE estimates of scaling methods.

## References

## S1 No Extra Normalization after Class-wise Calibrations

There is a group of calibration schemes that rely on one-vs-rest conversion to turn multi-class calibration into class-wise calibrations, e.g., histogram binning (HB), Platt scaling and Isotonic regression. After per-class calibration, the calibrated prediction probabilities of all classes no longer fulfill the constraint, i.e., . An extra normalization step was taken in pmlr-v70-guo17a to regain the normalization constraint. Here, we note that this extra normalization is unnecessary and partially undoes the per-class calibration effect. For HB, normalization even renders its ECE evaluation non-verifiable like any other scaling methods.

One-vs-rest strategy essentially marginalizes the multi-class predictive distribution over each class. After such marginalization, each class and its prediction probability shall be treated independently, thus no longer being constrained by the multi-class normalization constraint. This is analogous to train a CIFAR or ImageNet classifier with rather than cross entropy loss, e.g., Ryou_2019_ICCV. At training and test time, each class prediction probability is individually taken from the respective -response without normalization. The class with the largest response is then top-ranked, and normalization itself has no influence on the ranking performance.

## S2 vs. as Empirical Approximations to for Bin Optimization

In Sec. 3.2 of the main paper, we discussed the class imbalance issue, arising from one-vs-rest conversion, see Fig. 2. To tackle it, we proposed to merge the class-wise (CW) bin training sets and use the merged one to train a single binning scheme for all classes, i.e., shared CW (sCW) instead of CW binning. Tab. 3 showed that our proposal is also beneficial to scaling methods that use one-vs-rest for multi-class calibration.

As pointed out in Sec. 3.2, both and are empirical approximations to the inaccessible ground truth for bin optimization. In Fig. S1, we empirically analyze their approximation errors. From the CIFAR10 test set, we take k samples to approximate per-class logit distribution by means of histogram density estimation^{3}^{3}3Here, we focus on as its empirical estimation suffers from the class imbalance, being much more challenging than as illustrated in Fig. 2 of the main paper., and then use it as the baseline for comparison, i.e., in Fig. S1. The rest of the k samples in the CIFAR10 test set are reserved for constructing and . For each class, we respectively evaluate the square root of the Jensen-Shannon divergence (JSD) from the baseline to the empirical distribution of or attained at different numbers of samples. Despite suffering from the bias induced by set merging, achieves noticeably smaller JSDs than in nearly all cases, and shows to be more sample efficient. Only after using all k samples, has slightly smaller JSDs than in the class-, and and .
From CIFAR10 to CIFAR100 and ImageNet, the class imbalance issue becomes more severe. We expect the sample inefficiency issue of becomes more critical, and therefore it is advisable to use for bin optimization.

## S3 Proof of Theorem 1

###### Theorem 1.

The mutual information (MI) maximization problem given as follows:

(S1) |

is equivalent to

(S2) |

where the loss is defined as

(S3) | |||

(S4) |

As a set of real-valued auxiliary variables, are introduced here to ease the optimization.

Proof. Before staring our proof, we note that the upper-case indicates probability mass functions of discrete random variables, e.g., the label and the bin interval index ; whereas the lower-case

is reserved for probability density functions of continuous random variables, e.g., the raw logit

.The key to prove the equivalence is to show the inequality

(S5) |

and the equality is attainable by minimizing over .

By the definition of MI, we firstly expand as

(S6) |

where the conditional distribution is given as

(S7) |

From the above expression, we note that MI maximization effectively only accounts to the bin edges . The bin representatives can be arbitrary as long as they can indicate the condition . So, the bin interval index is sufficient to serve the role in conditioning the probability mass function of , i.e., . After optimizing the bin edges, we have the freedom to set the bin representatives for the sake of post-hoc calibration.

Next, based on the MI expression, we compute its sum with

(S8) |

The equality is based on

(S9) |

From the equality to , it is simply because of identifying the term in of the equality

as the Kullback-Leibler divergence (KLD) between two probability mass functions of

. As the probability mass function and the KLD both are non-negative, we reach to the inequality at , where the equality holds if . By further noting that is convex over and nulls out its gradient over , we then reach to(S10) |

The obtained equality then concludes our proof

(S11) |

∎

### s3.1 Convergence of the Iterative Method

For convenience, we recall the update equations for in Sec. 3.3 of the main paper here

(S12) |

In the following, we show that the updates on and according to (S12) continuously decrease the loss , i.e.,

(S13) |

The second inequality is based on the explained property of . Namely, it is convex over and the minimum for any given is attained by . As is the log-probability ratio of , we shall have

(S14) |

where in this case is induced by and . Plugging and into (S7), the resulting at the iteration yields the update equation of as given in (S12).

To prove the first inequality, we start from showing that is a local minimum of . The update equation on is an outcome of solving the stationary point equation of over under the condition for any

(S15) |

Being a stationary point is the necessary condition of local extremum when the function’s first-order derivative exists at that point, i.e., first-derivative test. To further show that the local extremum is actually a local minimum, we resort to the second-derivative test, i.e., if the Hessian matrix of is positive definite at the stationary point . Due to with the monotonically increasing function in its update equation, we have

(S16) |

implying that all eigenvalues of the Hessian matrix are positive (equivalently, is positive definite). Therefore,

as the stationary point of is a local minimum.It is important to note that from the stationary point equation (S15), as a local minimum is unique among with for any . In other words, the first inequality holds under the condition for any . Binning is a lossy data processing. In order to maximally preserve the label information, it is natural to exploit all bins in the optimization, not wasting any single bin in the area without mass, i.e., . Having said that, it is reasonable to constrain with over iterations, thereby concluding that the iterative method will converge to a local minimum based on the two inequalities (S13).

### s3.2 Initialization of the Iterative Method

We propose to initialize the iterative method by modifying the k-means++ algorithm

kmeans++ that was developed to initialize the cluster centers for k-means clustering algorithms. It is based on the following identification(S17) | ||||

(S18) |

As is a constant with respect to , minimizing is equivalent to minimizing the term on the RHS of (S17). The last approximation is reached by turning the binning problem into a clustering problem, i.e., grouping the logit samples in the training set according to the KLD measure, where are effectively the centers of each cluster. k-means++ algorithm kmeans++ initializes the cluster centers based on the Euclidean distance. In our case, we alternatively use the JSD as the distance measure to initialize . Comparing with KLD, JSD is symmetric and bounded.

### s3.3 A Remark on the Iterative Method Derivation

The closed-form update on in (S12) is based on the -model approximation, which has been validated through our empirical experiments. It is expected to work with properly trained classifiers that are not overly overfitting to the cross-entropy loss, e.g., using data augmentation and other regularization techniques at training. Nevertheless, even in corner cases that classifiers are poorly trained, the iterative method can still be operated without the -model approximation. Namely, as shown in Fig. 3 of the main paper, we can resort to KDE for an empirical estimation of the ground truth distribution . Using the KDEs, we can compute the gradient of over and perform iterative gradient based update on , replacing the closed-form based update. Essentially, the -model approximation is only necessary to find the stationary points of the gradient equations, speeding up the convergence of the method. If attempting to keep the closed-form update on , an alternative solution could be to use the KDEs for adjusting the -model, e.g., , where and are chosen to match the KDE based approximation to . After setting and , they will be used as a scaling and bias term in the original closed-form update equations

(S19) |

## S4 Post-hoc Analysis on the Experiment Results in Sec. 4.1

In Tab. 1 of Sec. 4.1, we compared three different binning schemes by measuring their ACCs and ECEs. The observation on their accuracy performance is aligned with our mutual information maximization viewpoint introduced in Sec. 3.3 and Fig. 3. Here, we re-present Fig. 3 and provide an alternative explanation to strengthen our understanding on how the location of bin edges affects the accuracy, e.g., why Eq. Size binning performed acceptable at the top- ACC, but failed at the top- ACC. Specifically, Fig. S2 shows the histograms of raw logits that are grouped based on their ranks instead of their labels as in Fig. 3. As expected, the logits with low ranks (i.e., below top- in Fig. S2) are small and thus take the left hand side of the plot, whereas the top- logits are mostly located on the right hand side. Besides sorting logits according to their ranks, we additionally estimate the density of the ground truth (GT) classes associated logits, i.e., in Fig. S2). With a properly trained classifier, the histogram of top- logits shall largerly overlap with the density curve , i.e., top- prediction being correct in most cases.

From the bin edge location of Eq. Mass binning, it attempts to attain small quantization errors for logits of low ranks rather than top-. This will certainly degrade the accuracy performance after binning. On contrary, Eq. Size binning aims at small quantization error for the top- logits, but ignores top- ones. As a result, we observed its poor top- ACCs. I-Max binning nicely distributes its bin edges in the area where the GT logits are likely to locate, and the bin width becomes smaller in the area where the top- logits are close by (i.e., the overlap region between the red and blue histograms). Note that, any logit larger than zero must be top- ranked, as there can exist at most one class with prediction probability larger than . Given that, the bins located above zero are no longer to maintain the ranking order, rather to reduce the precision loss of top- prediction probability after binning.

Binn. | NLL | ||||||
---|---|---|---|---|---|---|---|

Baseline | 80.33 | - | 95.10 | - | 0.0357 | - | 0.8406 |

Eq. Mass | 5.02 | 80.33 | 26.75 | 95.10 | 0.0353 | 0.7884 | 3.5272 |

Eq. Size | 80.14 | 80.21 | 88.99 | 95.10 | 0.0279 | 0.0277 | 1.2671 |

I-Max | 80.20 | 80.33 | 94.86 | 95.10 | 0.0200 | 0.0202 | 0.7860 |

The second part of our post-hoc analysis is on the sCW binning strategy. When using the same binning scheme for all per-class calibration, the chance of creating ties in top- predictions is much higher than CW binning, e.g., more than one class are top- ranked according to the calibrated prediction probabilities. Our reported ACCs in the main paper are attained by simply returning the first found class, i.e., using the class index as the secondary sorting criterion. This is certainly a suboptimal solution. Here, we investigate on how the ties affect ACCs of sCW binning. To this end, we use raw logits (before binning) as the secondary sorting criterion. The resulting and are shown in Tab.S1. Interestingly, such a simple change reduces the accuracy loss of Eq. Mass and I-Max binning to zero, indicating that they can preserve the top- ranking order of the raw logits but not in a strict monotonic sense, i.e., some are replaced by . As opposed to I-Max binning, Eq. Mass binning has a poor performance at calibration, i.e., the really high NLL and ECE. This is because it trivially ranks many classes as top-, but each of them has a very and same small confidence score. Given that, even though the accuracy loss is no longer an issue, it is still not a good solution for multi-class calibration. For Eq. Size binning, resolving ties only helps restore the baseline top- but not top- ACC. Its poor bin representative setting due to unreliable empirical frequency estimation over too narrow bins can result in a permutation among the top- predictions.

Concluding from the above, our post-hoc analysis confirms that I-Max binning outperforms the other two binning schemes at mitigating the accuracy loss and multi-class calibration. In particular, there exists a simple solution to close the accuracy gap to the baseline, at the same time still retaining the desirable calibration gains.

## S5 Ablation on the Number of Bins and Calibration Set Size

In Tab. 1 of Sec. 4.1, sCW I-Max binning is the top performing one at the ACCs, ECEs and NLL measures. In this part, we further investigate on how the number of bins and calibration set size influences its performance. Tab. S2 shows that in order to benefit from more bins we shall accordingly increase the number of calibration samples. More bins help reduce the quantization loss, but increase the empirical frequency estimation error for setting the bin representatives. Given that, we observe a reduced ACCs and increased ECEs for having bins with only k calibration samples. By increasing the calibration set size to k, then we start seeing the benefits of having more bins to reduce quantization error for better ACCs. Next, we further exploit scaling method, i.e., GP wenger2020calibration, for improving the sample efficiency of binning at setting the bin representatives. As a result, the combination is particularly beneficial to improve the ACCs and top- ECE. Overall, more bins are beneficial to ACCs, while ECEs favor less number of bins.

Binn. | Bins | ||||||||
---|---|---|---|---|---|---|---|---|---|

Baseline | - | 80.33 | 95.10 | 0.0486 | 0.0357 | 80.33 | 95.10 | 0.0486 | 0.0357 |

1k Calibration Samples | 5k Calibration Samples | ||||||||

GP | 80.33 | 95.11 | 0.0485 | 0.0186 | 80.33 | 95.11 | 0.0445 | 0.0177 | |

I-Max | 10 | 80.09 | 94.59 | 0.0316 | 0.0156 | 80.14 | 94.59 | 0.0330 | 0.0107 |

15 | 80.20 | 94.86 | 0.0302 | 0.0200 | 80.21 | 94.90 | 0.0257 | 0.0107 | |

20 | 80.10 | 94.94 | 0.0266 | 0.0234 | 80.25 | 94.98 | 0.0220 | 0.0133 | |

30 | 80.15 | 94.99 | 0.0343 | 0.0266 | 80.25 | 95.02 | 0.0310 | 0.0150 | |

40 | 80.11 | 95.05 | 0.0365 | 0.0289 | 80.24 | 95.08 | 0.0374 | 0.0171 | |

50 | 80.21 | 94.95 | 0.0411 | 0.0320 | 80.23 | 95.06 | 0.0378 | 0.0219 | |

I-Max w. GP | 10 | 80.09 | 94.59 | 0.0396 | 0.0122 | 80.14 | 94.59 | 0.0330 | 0.0072 |

15 | 80.20 | 94.87 | 0.0300 | 0.0121 | 80.21 | 94.88 | 0.0256 | 0.0080 | |

20 | 80.23 | 94.95 | 0.0370 | 0.0133 | 80.25 | 95.00 | 0.0270 | 0.0091 | |

30 | 80.26 | 95.04 | 0.0383 | 0.0141 | 80.27 | 95.02 | 0.0389 | 0.0097 | |

40 | 80.27 | 95.11 | 0.0424 | 0.0145 | 80.26 | 95.08 | 0.0402 | 0.0108 | |

50 | 80.30 | 95.08 | 0.0427 | 0.0153 | 80.28 | 95.08 | 0.0405 | 0.0114 |

## S6 Empirical ECE Estimation of Scaling Methods under Multiple Evaluation Schemes

As mentioned in the main paper, scaling methods suffer from not being able to provide verifiable ECEs, see Fig. 1. In Tab. S3 we adopt four different quantization schemes to empirically evaluate the ECEs of scaling methods. As we can see, the obtained results are evaluation scheme dependent. On contrary, I-Max binning with and without GP are not affected, and more importantly, their ECEs are better than that of scaling methods, regardless of the evaluation scheme.

Calibrator | Mean | |||||
---|---|---|---|---|---|---|

Baseline | 80.33 | 0.0486 0.0003 | 0.0459 0.0004 | 0.0484 0.0004 | 0.0521 0.0004 | 0.0488 |

25k Calibration Samples | ||||||

BBQ (Naeini2015) | 53.89 | 0.0287 0.0009 | 0.0376 0.0014 | 0.0372 0.0014 | 0.0316 0.0008 | 0.0338 |

Beta pmlr-v54-kull17a | 80.47 | 0.0706 0.0003 | 0.0723 0.0005 | 0.0742 0.0005 | 0.0755 0.0004 | 0.0732 |

Isotonic Reg. Zadrozny2002 | 80.08 | 0.0644 0.0015 | 0.0646 0.0015 | 0.0652 0.0016 | 0.0655 0.0015 | 0.0649 |

Platt Platt1999 | 80.48 | 0.0597 0.0007 | 0.0593 0.0008 | 0.0613 0.0008 | 0.0634 0.0008 | 0.0609 |

Vec Scal. Kull2019 | 80.53 | 0.0494 0.00020 | 0.0472 0.0004 | 0.0498 0.0003 | 0.0531 0.0003 | 0.0499 |

Mtx Scal. Kull2019 | 80.78 | 0.0508 0.0003 | 0.0488 0.0004 | 0.0512 0.0005 | 0.0544 0.0004 | 0.0513 |

1k Calibration Samples | ||||||

TS pmlr-v70-guo17a | 80.33 | 0.0559 0.0015 | 0.0548 0.0018 | 0.0573 0.0017 | 0.0598 0.0015 | 0.0570 |

GP wenger2020calibration | 80.33 | 0.0485 0.0037 | 0.0450 0.0040 | 0.0475 0.0039 | 0.0520 0.0038 | 0.0483 |

I-Max | 80.20 | 0.0302 0.0041 | ||||

I-Max w. GP | 80.20 | 0.0300 0.0041 | ||||

Calibrator | Mean | |||||

Baseline | 80.33 | 0.0357 0.0010 | 0.0345 0.0010 | 0.0348 0.0012 | 0.0352 0.0016 | 0.0351 |

25k Calibration Samples | ||||||

BBQ (Naeini2015) | 53.89 | 0.2689 0.0033 | 0.2690 0.0034 | 0.2690 0.0034 | 0.2689 0.0032 | 0.2689 |

Beta pmlr-v54-kull17a | 80.47 | 0.0346 0.0022 | 0.0360 0.0017 | 0.0360 0.0022 | 0.0357 0.0019 | 0.0356 |

Isotonic Reg. Zadrozny2002 | 80.08 | 0.0468 0.0020 | 0.0434 0.0019 | 0.0436 0.0020 | 0.0468 0.0015 | 0.0452 |

Platt Platt1999 | 80.48 | 0.0775 0.0015 | 0.0772 0.0015 | 0.0771 0.0016 | 0.0773 0.0014 | 0.0772 |

Vec Scal. Kull2019 | 80.53 | 0.0300 0.0010 | 0.0298 0.0012 | 0.0300 0.0016 | 0.0303 0.0011 | 0.0300 |

Mtx Scal. Kull2019 | 80.78 | 0.0282 0.0014 | 0.0287 0.0011 | 0.0286 0.0014 | 0.0289 0.0014 | 0.0286 |

1k Calibration Samples | ||||||

TS pmlr-v70-guo17a | 80.33 | 0.0439 0.0022 | 0.0452 0.0022 | 0.0454 0.0020 | 0.0443 0.0020 | 0.0447 |

GP wenger2020calibration | 80.33 | 0.0186 0.0034 | 0.0182 0.0019 | 0.0186 0.0026 | 0.0190 0.0022 | 0.0186 |

I-Max | 80.20 | 0.0200 0.0033 | ||||

I-Max w. GP | 80.20 | 0.0121 0.0048 |

## S7 Training Details

### s7.1 Pre-trained Classification Networks

We evaluate post-hoc calibration methods on three benchmark datasets, i.e., ImageNet imagenet_cvpr09, CIFAR-100 cifar, and CIFAR-10 cifar, and across three modern DNNs for each dataset, i.e., InceptionResNetV2 inceptionresnetv2, DenseNet161 dnpaper and ResNet152 He2016DeepRL for ImageNet, and Wide ResNet (WRN) wrnpaper, DenseNet-BC (, ) dnpaper and ResNext8x64 resnextpaper for the two CIFAR datasets.

All these models are publicly available pre-trained networks and details are reported at the respective websites, i.e., ImageNet classifiers: https://github.com/Cadene/pretrained-models.pytorch and CIFAR classifiers: https://github.com/bearpaw/pytorch-classification

### s7.2 Training Scaling Methods

The hyper-parameters were decided based on the original respective scaling methods publications with some exceptions. We found that the following parameters were the best for all the scaling methods. All scaling methods use the Adam optimizer with batch size for CIFAR and for ImageNet. The learning rate was set to for temperature scaling pmlr-v70-guo17a and Platt scaling Platt1999, for vector scaling pmlr-v70-guo17a and for matrix scaling pmlr-v70-guo17a. Matrix scaling was further regularized as suggested by Kull2019 with a

loss on the bias vector and the off-diagonal elements of the weighting matrix. BBQ

Naeini2015, isotonic regression Zadrozny2002 and Beta pmlr-v54-kull17a hyper-parameters were taken directly from wenger2020calibration.### s7.3 Training I-Max Binning

The I-Max bin optimization started from k-means++ initialization, which uses JSD instead of Euclidean metric as the distance measure, see Sec. S3.2. Then, we iteratively and alternatively updated and according to (S12) until iterations. With the attained bin edges , we set the bin representatives based on the empirical frequency of class-. If a scaling method is combined with binning, an alternative setting for is to take the averaged prediction probabilities based on the scaled logits of the samples per bin, e.g., in Tab. 2 of Sec. 4.2. Note that, for CW binning in Tab. 1, the number of samples from the minority class is too few, i.e., . We only have about samples per bin, which are too few to use empirical frequency estimates. Alternatively, we set based on the raw prediction probabilities.

## S8 Extend Tab. 1 for More Datasets and Models.

Tab. 1 from Sec. 4.1 of the main paper is replicated across datasets and models, where the basic setting remains the same. Specifically, three different ImageNet models can be found in Tab. S4, Tab. S5 and Tab. S6. Three models for CIFAR100 can be found in Tab. S7, Tab. S8 and Tab. S9. Similarly, CIFAR10 models can be found in Tab. S10, Tab.

Comments

There are no comments yet.