Log In Sign Up

Estimating Expected Calibration Errors

by   Nicolas Posocco, et al.

Uncertainty in probabilistic classifiers predictions is a key concern when models are used to support human decision making, in broader probabilistic pipelines or when sensitive automatic decisions have to be taken. Studies have shown that most models are not intrinsically well calibrated, meaning that their decision scores are not consistent with posterior probabilities. Hence being able to calibrate these models, or enforce calibration while learning them, has regained interest in recent literature. In this context, properly assessing calibration is paramount to quantify new contributions tackling calibration. However, there is room for improvement for commonly used metrics and evaluation of calibration could benefit from deeper analyses. Thus this paper focuses on the empirical evaluation of calibration metrics in the context of classification. More specifically it evaluates different estimators of the Expected Calibration Error (ECE), amongst which legacy estimators and some novel ones, proposed in this paper. We build an empirical procedure to quantify the quality of these ECE estimators, and use it to decide which estimator should be used in practice for different settings.


page 1

page 2

page 3

page 4


Localized Calibration: Metrics and Recalibration

Probabilistic classifiers output confidence scores along with their pred...

Beyond calibration: estimating the grouping loss of modern neural networks

Good decision making requires machine-learning models to provide trustwo...

Analysis and Comparison of Classification Metrics

A number of different performance metrics are commonly used in the machi...

Propensity score models are better when post-calibrated

Theoretical guarantees for causal inference using propensity scores are ...

A Consistent and Differentiable Lp Canonical Calibration Error Estimator

Calibrated probabilistic classifiers are models whose predicted probabil...

1 Introduction

Almost all currently used classifiers are not intrinsically well-calibrated [11], which means their output scores can’t be interpreted as probabilities. This is an issue when the model is used for decision making, as a component in a more general probabilistic pipeline, or simply when one needs a quantification of the uncertainty in model’s predictions, for example in high risk applications.

To overcome this calibration issue, two main tracks have been explored by either correcting the calibration of the model via some post-training procedure [13, 11, 8, 7] or by regularizing the model to enforce calibration during training [9]. Would it be for the quantitative comparison of the performances of calibration methods or the evaluation of prediction’s uncertainty, one needs to precisely quantify calibration. The recent literature trend is to use estimators of the Expected Calibration Error () [10], which we focus on in this work.

We propose a few improvements on current

estimators as well as a novel approach for the estimation of this metric based on kernel density estimation. We also introduce via these new estimators a continuous equivalent of the reliability diagram constructed on the proposed notion of Local Calibration Error (

). This notion can be used in practice to evaluate the uncertainty of the predicted probabilities itself, with an optional uncertainty interval. Furthermore we designed the first experimental setup to enable the assessment of the calibration metrics, in order to identify which estimators are the most relevant.

In this paper, we first present the context of this study in Section 2 and set up the formal definition of calibration in Section 3. The theoretical calibration metric, namely the Expected Calibration Error, and its legacy and newly proposed estimators, are presented in Section 4, where we also introduce the concept of Local Calibration Error. We finally assess in Section 5 the relevance of legacy and proposed estimators empirically using a broad empirical setup.111The code ensuring the reproducibility of the experiments presented in this work is available at

2 Context and Related Work

The oldest attempt to quantify calibration has been the reliability diagram [3, 11] for binary classification. Although it has been useful for the evaluation of early calibration methods, it does not provide point estimates - a single value - required to systematically compare calibration of different models. The first point estimate proposed in [16], which exploited a decision theory framework to use a profit maximisation as a proxy for calibration quality, required a specific type of dataset to be usable in practice. Mirroring the procedure used to compute the reliability diagram, the empirical Expected Calibration Error () was designed [3], and later has been proven to be an estimator for the natural theoretical notion of calibration error [4]. Meanwhile, some works have used the negative log-likelihood (NLL) or the Brier score [16], which both are weak proxis for the calibration of classifiers [6]. Using reliability diagrams has become even more difficult in multiclass settings [17].

Recent works mostly rely on the binning based legacy estimator of the

to quantify calibration. Defects have been highlighted with this estimator, such as its reliance on a hyperparameter and its bias variance trade-off

[12]. More recently [7] made clearer the notion of calibration for multiclass classifiers, and new estimators of the with adaptive binning have been proposed in [12] along side with uncertainty aware reliability diagrams [1]. Although the notion of calibration was originally defined for classifiers, this notion is currently being generalized to regression [5, 15].

In this context we aim at improving the evaluation of calibration in the setting of classification, and specifically focus on estimators of the ECE as the theoretical definition itself has been consistently adopted.

Definition 1


Let us consider the random variable

, from which are drawn i.i.d samples to build a training set, and a holdout set of size : . A classifier is a function learnt from the training set which outputs scores -ideally the probabilities - of belonging to class for , where is the probability simplex that ensures the scores sum up to one. In the rest of the paper the indexed notation represents the th

element of any vector

s and denotes the th sample of the holdout set. For readability purpose we use the notation for the output score .

3 Calibration

In this section we present and formalize properly the 4 different notions of calibration, and derive the corresponding Expected Calibration Errors ().

Calibration characterizes how much a model is able to output scores corresponding to actual posterior probabilities. The first and simplest calibration notion [13] is focused on a specific class and extends to the simultaneous calibration of every classes considering their associated scores independent, namely the class-wise calibration [17]. This version considers a classifier is well-calibrated if all one-vs-rest submodels are calibrated. The calibration concept for binary classification is equivalent to class-specific calibration focusing on the positive class and to class-wise calibration, since the score for the negative class is determined by the score for the positive class . The more recently introduced confidence calibration [4] is only concerned about the model predicting relevant scores for the class it predicts for each sample. Throughout this paper, we only tackle the confidence and class-wise settings. Finally the most rigorous evaluation of calibration should actually take into account all classes as non-independent, the corresponding definition, the multiclass-calibration [13] is almost never used in practice for computability reasons. All these notions are formalized in the following definition.

Definition 2

Different calibration notions of a probabilistic classifier. A probabilistic classifier , is

Calibrated for class :
Class-wise calibrated:

The Expected Calibration Error () of a given model can be naturally derived from these theoretical formulations by computing the expected deviation from the perfect theoretical calibration. This concept is applied to the different calibration settings and results in the following formulations:

Definition 3

Expected calibration error () for the different settings for a given model M on :

Where is the class-specific associated to class , the class-wise [17], the confidence [4] and the multiclass .

By replacing the expectation over the absolute values of the differences by a simple maximum over the absolute differences, we obtain the formulations of the Maximum Calibration Error () [10], which focus on the highest gap between posterior probabilities and the scores given by the model.

4 Estimation of calibration quantification

In this section we describe the challenges of calibration quantification, then present the existing tools to handle these challenges namely the reliability diagram and the legacy estimator. We then introduce a new formalization of these estimators based on binning and sample mapping, which help us define new binning based estimators. Finally we present the new notion of Local Calibration Error on which we rely to build continuous estimators of the ECE based on Kernel Density Estimation. All estimators are written for the class-specific calibration setting, which can then be transposed to the other settings using Definition 3.

4.1 Challenges of such quantification

Quantifying calibration is challenging in practice for two main reasons: Calibration is intrinsically a local notion. Miscalibration is defined on the neighbourhood of a given output score. Thus any global quantification of calibration depends on an aggregation procedure of local measures. This is what differentiates the , which implicitly weights all parts of the score distribution according to its local density, from the , which only cares about the worst case scenario. Since calibration depends on score distributions , any relevant estimator relies on these scores, which means that we are limited by the amount of available validation data to perform such quantification.

A good calibration metric should specifically quantify calibration: contrary to the Brier score and the NLL, which values only carry a partial information on calibration, we expect a good metric to be independent of confusion factors. It should then be theoretically well-funded as well as tractable in practice. Finally, a good calibration metric should be able to take into account cost matrices for the classification task, when available, risk management being intrinsically linked to such cost matrices.

The corresponds to the identified required properties for homogeneous cost matrices, since it directly derives from the theoretical notion of calibration and has an immediate interpretation. However, it doesn’t allow heterogeneous costs matrices, and as we will see in the next sections, current estimators provide poor estimations of the true value of the . For these reasons we focus on the setting of homogeneous cost classification, and try to provide better estimators for the . Such estimators should be robust to hyperparameter choice

, problem which can be solved by the use of a relevant heuristic. The estimator should be

data-efficient too, in order to provide good estimates with a low variance even with few holdout labeled data points. Such estimation should provide low-bias estimates with a sufficient amount of available data and should finally be consistent and computable in a reasonable amount of time.

4.2 Reliability Diagram

The reliability diagram introduces the classical way of calculating the . To build the reliability diagram (in the binary setting), a uniform binning scheme (the interval is split into equal bins) is used, and each holdout sample is mapped into a bin based on the score given by the model for the positive class (procedure defined below as 1-bin mapping). For each bin, the average score for the positive class and the proportion of samples belonging to the positive class are calculated. The first is then plotted against the second. If the model is well calibrated, each point should fall on the line . The local offset of each point tells us if the model is locally over or under-confident on its scores for the positive class. Such diagram can be seen on Figure 1 (left).

Originally designed for the binary classification case, it can be easily extended to confidence calibration in the multiclass setting. In that case, samples are sent into bins based on the score the model outputs for the class it predicts, and the ratio of correct predictions is plotted against the average over the scores given for the predicted class.

4.3 Binning based estimators

In order to present different binning-based estimators of the , we formalize the binning and affectation mapping objects. We note the score of the class of interest of the th sample, which depends on if we consider the specific-class, class-wise (fixed class) or confidence (predicted class) calibration.

Definition 4

Binning schemes
The segment is split into bins used to assign each data point to one (or more) bin. These bins are defined by their respective thresholds. Hence to define a binning scheme one only needs to specify the increasing splitting function that computes the right threshold for each bin.

Two main binning schemes have been used to compute the in the literature: Uniform binning splits the segment into bins of equal size : and Adaptive binning splits the segment so that each split contains the same number of samples : , being the permutation which sorts samples based on the score predicted for the class of interest.

Definition 5

Affectation mapping
Given a binning of a domain , an affectation mapping of in these bins is a matrix composed of positive weights, so that is the weight of the affectation of the sample in the bin . Rows of such matrix sum up to .

Using this formalisation, we start from the 1-bin mapping for which every sample is assigned to a single bin with unit weight, to go to the new proposed convex mapping for which each sample may contribute to up to two bins for the computation of the binning based ECE estimators. This mechanism is the one referred to as linear binning in the kernel density estimation field. These two mappings can be respectively mathematically written, as follows, where is the geometric centre of the th bin :

The original estimator of the is basically a weighted mean over the absolute differences calculated when the reliability diagram is computed (here expressed in the specific-class case). If is a 1-bin mapping on a uniform binning and 1 is the indicator function, the legacy estimator is:


Such estimator can be defined in the same way for and .

We unify binning-based estimators under equation (1) with different binning/mapping schemes. The uses an adaptive binning with 1-bin mapping, while the uses a uniform binning and a convex mapping, and finally the uses both improvements on the legacy estimator - adaptive binning and convex mapping. In the case of class-wise calibration, the defined in [12] is equivalent to , when all bins contain the same amount of samples.

4.4 Local Calibration Error

We define the notion of Local Calibration Error (), and then use it to build the reliability curve, a continuous version of the reliability diagram. Let us first begin with the formal definition of the LCE:

Definition 6

Local calibration error () for the class-specific and the confidence settings for a given model M on

For the class-specific case, to estimate the of a model for all scores , we have to estimate . We resort to the Bayes rule to tear down this estimation to estimating the densities of and , and the scalar . We can then rely on kernel density estimation (KDE) to estimate the two densities. Theoretically, this approach is continuous. In our implementation however, both KDEs are evaluated numerically in Fourier space (the first one on all scores for the class and the second one on all scores for the class when the ground truth is the class ), which makes the computation efficient with complexity, if is the number of numeric subdivisions of the domain . We use steps of 0.0003 for precision, and mirrored the data around and , which are the limits of the domain. This mirroring implies a slight bias in estimations due to a leak of density mass. Once again the can be estimated in the same way with the relevant scores and classes.

A continuous equivalent of the reliability diagram can be derived from such object. The reliability curve associated with the classifier and the class , for the class-specific calibration is:


An example of such reliability curve is shown in Figure 1 (middle).

The main benefit this proposed notion of local calibration error offers is its usability in practice to know the uncertainty of a model on a specific score, which cannot be evaluated with enough precision using previous tools (points in a reliability diagram can be used for an interpolation aiming at the same result, yet the precision of such procedure is very low, and interpolation at that scale is questionable).

We propose to compute this curve on bootstrapped versions of the holdout set, in order to quantify the uncertainty on this . In this context, the median curve is considered as the reliability curve and percentiles of interest are used for uncertainty quantification. This idea, illustrated on Figure 1

 (right), allows the prediction of confidence intervals for the class probabilities instead of point estimates, by only looking at the uncertainty on the bootstrapped reliability curve at the score output by the model.

Figure 1: Reliability diagram with 15 bins (left), reliability curve with a bandwidth of 0.03 (middle) and the bootstrapped version with the same bandwidth (right). Each plot brings one more level of insight.

4.5 Density based estimator:

Based on the definition of this Local Calibration Error we can derive a new estimator, which is formalized as follows:


is the probability density function of the scores given by the model for class


4.5.1 Heuristics for hyperparameter choices

For all binning-based estimators, we investigate the use of a simple heuristic to select the number of bins used for the estimations: the bin amount is the square root of the number of samples. For the kde-based approach, we propose to use Silverman’s rule [14] to select the bandwidth (the bandwidth is estimated on , and the same bandwidth is used to estimate the density of ). Other heuristics are often used for KDE computations, yet Silverman’s rule is to our knowledge the only one which provides satisfying results in small data contexts, for which legacy estimators struggle the most.

4.5.2 From class-specific to the other settings

To translate the class-specific estimators into the class-wise case, class-specific ECEs are estimated for all classes, and the class-wise ECE is the mean of these values. To get to the confidence case, scores for the class of interest are replaced by the score for the predicted class, and the class of interest is the ground truth label.

5 Experimental setup

We present the assessment of a few empirical properties of the different estimators. As pointed out in [12], the main difficulty with empirical evaluation of calibration methods and calibration metrics is that we don’t have access to ground truths in general. This is why we worked on a setup which gives us access to arbitrarily precise estimates of the considered as a the ground truth, in the class-wise and confidence settings.

5.1 Procedure

We aim at quantitatively compare the estimators in terms of approximation, data efficiency and variance. To do so we build curves which can indicate the expected performance of each estimator with its corresponding parameters, for different sizes of holdout set. In order to observe statistically robust result we introduce various degrees of variability in our experiment at distribution level, in the algorithm used to train the models, and in terms of train/holdout sets splits. The results are thus produced based on numerous realistic output score distributions.

The distribution variability is introduced by creating synthetic sample sets from Gaussian mixtures, where each class is composed of 4 modes of the mixture. For each mode we build the mean vector with elements uniformly drawn in , and the covariance matrix is built as follows: we first sample a matrix with elements uniformly drawn in then multiply it with its transposition to get the required positive definite matrix. This sample set generation is produced with various number of classes () and dimensions of the feature space () with 5 different large datasets sampled from each combination, resulting into 45 synthetic distributions.

In order to produce various relevant score distributions from these data distributions, we trained 4 different types of models (logistic regression, gaussian naïve bayes classifier, support vector classifier and random forest) on 3 train sets of size 300 sampled out from the previously generated large datasets. For each of these trained models we compute the ”ground truth”

using the legacy estimator with high granularity (2000 bins) on the remaining holdout set ( samples). Then, we build 200 evaluation sets which are bootstrapped versions of the holdout set of sizes taken between 30 and 500 on a logarithmic scale. The ”ground truth” is used as reference to compute the approximation error (the absolute value of the difference between the estimated and its true value normalized by the ground truth). Among those 200 values per evaluation set size, we keep the 95 percentile of the approximation errors, below which 95% of such errors rely. For each evaluation set size and estimator, we finally plot the median over the 540 95th percentiles obtained with each score distributions. The resulting curves can be seen in Figure 2. The number of evaluations of the learning algorithms plus the ECE estimators makes this experiment long to run, but as all the estimators have limited computation complexity the overall computation remains feasible.

Figure 2: Median 95th percentile of the approximation error (absolute value of the normalized relative deviation with respect to the ground truth ) for the different estimators of the (top) and the (bottom) (lower is better) for evaluation sets of size between 30 and 500 samples. The scale is logarithmic for both axis

5.2 Results analysis

For all settings,

the error of all estimators is very high for small data regimes (the estimation error is around 10% of the true value), and thus one shouldn’t evaluate calibration on so little data, no matter what estimator is used.

For the confidence setting,

the best performing estimator in almost all data regimes is the with Silverman’s rule. This is good news, since we now have a procedure to estimate the ECE which doesn’t rely on a sensitive hyperparameter choice, but instead on a simple heuristic. As far as it is concerned, the convex mapping scheme empirically improves the performance of the legacy estimator and the one using the adaptive binning, which underperforms when alone, probably because of the increased variance induced by the adaptivity. It is worth noting that above 300 samples, a lot of the estimators show similar performances. As far as the square root heuristic is concerned for the automatic choice of number of bins used, the graphs suggest that the number of bins grows slightly too fast in average with an increasing amount of samples.

For the class-wise setting,

there is no clear outperformer in all data regimes among the tested estimators. For less than 100 samples, the with the square root heuristic seems to be the best choice. The same estimator, this time with a fixed small number of bins, is then the most precise one. The observation made earlier about the square root heuristic still holds, and Silverman’s heuristic for the bandwidth seems to be a less relevant choice in the class-wise setting than in the confidence one. We assume it is the case because of the sharpness of the score distributions for each classes in the class-wise setting (most of the density being very close to 0 and 1), which is a context in which Silverman’s bandwidth is known to underperform for kernel density estimation.

6 Conclusions

We have introduced a few improvements on the legacy estimators, from the proposition of new binning schemes to the use of heuristics to automatically pick relevant values for hyperparameters of estimators of the ECE. On top of this, a novel approaches has been built to define properly the notion of local calibration error, which produces novel estimators for the ECEs. By testing all approaches on a synthetic experimental setup for which we had access to very precise estimates of the theoretical ECE, we have been able to compare all candidate estimators. This systematic evaluation, which had never been done until now, allowed us to formulate some recommendations on which estimator to use in what context.

Our proposed solutions lead to natural potential future works. First, the introduced calibration curve suggests a natural post-training calibration method, since it can be seen as a calibration map. Such method would be interesting to evaluate, yet poses the problem that the associated calibration maps are not monotonous, which is considered as a prerequisite for post-hoc calibration procedures in the literature. Then multiclass-calibration evaluation, which is still an open problem today, could potentially be evaluated in the scores space using an adapted variant of our kde approach, which we think wouldn’t suffer as much as legacy estimators from the increase of dimensionality. Finally, even if this paper uses classical kernels and a mirroring approach to constrain density estimations on the domain which allows standard and fast KDE computation, some preliminary investigations using a beta pseudo-kernel (the second one introduced in [2]) which is naturally constrained to this domain, show promising results. Because this kernel has a different shape for all support points in , it is computationally prohibitive for now, and needs further exploration.


  • [1] J. Bröcker and L. A. Smith (2007) Increasing the reliability of reliability diagrams. Weather and forecasting 22 (3), pp. 651–661. Cited by: §2.
  • [2] S. X. Chen (1999) Beta kernel estimators for density functions. Computational Statistics & Data Analysis 31 (2), pp. 131–145. Cited by: §6.
  • [3] M. H. DeGroot and S. E. Fienberg (1983) The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician) 32 (1-2), pp. 12–22. Cited by: §2.
  • [4] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017-06)

    On calibration of modern neural networks


    34th International Conference on Machine Learning, ICML 2017

    Vol. 3, pp. 2130–2143. External Links: 1706.04599, ISBN 9781510855144 Cited by: §2, §3, Definition 3.
  • [5] G. Keren, N. Cummins, and B. Schuller (2018) Calibrated prediction intervals for neural network regressors. IEEE Access 6, pp. 54033–54041. External Links: 1803.09546, ISSN 21693536 Cited by: §2.
  • [6] M. Kull and P. Flach (2015) Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Vol. 9284, pp. 68–85. External Links: ISBN 9783319235271, ISSN 16113349 Cited by: §2.
  • [7] M. Kull, M. Perello-Nieto, M. Kängsepp, H. Song, P. Flach, et al. (2019) Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration. In Advances in Neural Information Processing System 32, Cited by: §1, §2.
  • [8] M. Kull, T. Silva Filho, and P. Flach (2017) Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Artificial Intelligence and Statistics, pp. 623–631. Cited by: §1.
  • [9] A. Kumar, S. Sarawagi, and U. Jain (2018) Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pp. 2805–2814. Cited by: §1.
  • [10] M. P. Naeini, G. Cooper, and M. Hauskrecht (2015) Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29. Cited by: §1, §3.
  • [11] A. Niculescu-Mizil and R. Caruana (2005)

    Predicting good probabilities with supervised learning

    In ICML 2005 - Proceedings of the 22nd International Conference on Machine Learning, pp. 625–632. External Links: ISBN 1595931805 Cited by: §1, §1, §2.
  • [12] J. V. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran (2019-04)

    Measuring Calibration in Deep Learning.

    In CVPR Workshops, Vol. 2. External Links: 1904.01685, ISSN 23318422 Cited by: §2, §4.3, §5.
  • [13] J. Platt et al. (1999)

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

    Advances in large margin classifiers 10 (3), pp. 61–74. Cited by: §1, §3.
  • [14] B. W. Silverman (1986) Density estimation for statistics and data analysis. Vol. 26, CRC press. Cited by: §4.5.1.
  • [15] H. Song, T. Diethe, M. Kull, and P. Flach (2019) Distribution calibration for regression. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 5897–5906. Cited by: §2.
  • [16] B. Zadrozny and C. Elkan (2001)

    Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers

    In International Conference on Machine Learning (ICML), pp. 1–8. External Links: ISBN 1-55860-778-1 Cited by: §2.
  • [17] C. Zadrozny, Bianca and Elkan (2002) Transforming Classifier Scores into Accurate Multiclass Probability Estimates Bianca. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 704. External Links: ISBN 158113567X Cited by: §2, §3, Definition 3.