DeepAI
Log In Sign Up

Variable-Based Calibration for Machine Learning Classifiers

The deployment of machine learning classifiers in high-stakes domains requires well-calibrated confidence scores for model predictions. In this paper we introduce the notion of variable-based calibration to characterize calibration properties of a model with respect to a variable of interest, generalizing traditional score-based calibration and metrics such as expected calibration error (ECE). In particular, we find that models with near-perfect ECE can exhibit significant variable-based calibration error as a function of features of the data. We demonstrate this phenomenon both theoretically and in practice on multiple well-known datasets, and show that it can persist after the application of existing recalibration methods. To mitigate this issue, we propose strategies for detection, visualization, and quantification of variable-based calibration error. We then examine the limitations of current score-based recalibration methods and explore potential modifications. Finally, we discuss the implications of these findings, emphasizing that an understanding of calibration beyond simple aggregate measures is crucial for endeavors such as fairness and model interpretability.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/22/2021

Localized Calibration: Metrics and Recalibration

Probabilistic classifiers output confidence scores along with their pred...
05/10/2022

Bias and Priors in Machine Learning Calibrations for High Energy Physics

Machine learning offers an exciting opportunity to improve the calibrati...
05/26/2019

Towards reliable and fair probabilistic predictions: field-aware calibration with neural networks

In machine learning, it is observed that probabilistic predictions somet...
11/30/2022

A Unifying Theory of Distance from Calibration

We study the fundamental question of how to define and measure the dista...
03/08/2022

Honest calibration assessment for binary outcome predictions

Probability predictions from binary regressions or machine learning meth...
08/29/2018

Group calibration is a byproduct of unconstrained learning

Much recent work on fairness in machine learning has focused on how well...
02/18/2020

A Resolution in Algorithmic Fairness: Calibrated Scores for Fair Classifications

Calibration and equal error rates are fundamental conditions for algorit...

1 Introduction

Predictive models built by machine learning algorithms are increasingly informing decisions across high-stakes applications such as medicine (rajkomar2019machine), employment (chalfin2016productivity), and criminal justice (zavrsnik2021). There is also broad recent interest in developing systems where humans and machine learning models collaborate to make predictions and decisions (kleinberg2018human; bansal2021most; de2021classification; steyvers2022bayesian). A critical aspect of using model predictions in such contexts is calibration. In particular, in order to trust the predictions from a machine learning classifier, these predictions must be accompanied by well-calibrated confidence scores.

In practice, however, it has been well-documented that machine learning classifiers such as deep neural networks tend to produce poorly-calibrated class probabilities

(guo2017calibration; vaicenavicius2019evaluating; ovadia2019). As a result, a variety of recalibration techniques have been developed, which aim to ensure that a model’s confidence (or score) matches its true accuracy. A widely used approach is post-hoc recalibration: methods which use a separate labeled dataset to learn a mapping from the original model’s class probabilities to calibrated probabilities, often with a relatively simple one-dimensional mapping (e.g., plattprobabilistic; kull2017; kumar2019verified). These methods have been shown to generally improve the the empirical calibration error of the model, as commonly measured by the expected calibration error (ECE).

However, as we show in this paper, aggregate measures of score-based calibration error such as ECE can hide significant systematic miscalibration in other dimensions of a model’s performance. To address this issue we introduce the notion of variable-based calibration to better understand how the calibration error of a model can vary as a function of a variable of interest. The variable of interest can be an input variable to the model or some other metadata variable—we focus in particular in this paper on real-valued variables. For example, in prediction problems involving individuals (e.g., credit-scoring or medical diagnosis) one such variable would be age. Detecting systematic miscalibration is important for problems such as assessing the fairness of a model, for instance detecting that a model is significantly overconfident for some age ranges and underconfident for others.

As an illustrative example, consider a simple classifier trained to predict the presence of cardiovascular disease111https://www.kaggle.com/sulianova/cardiovascular-disease-dataset. After the application of standard post-hoc calibration, this model attains a relatively low ECE of 0.74%. This low ECE is reflected in the reliability diagram shown in Figure 0(a), which shows near-perfect alignment with the diagonal. If a user of this model were only to consider aggregate metrics such as ECE, they might reasonably conclude that the model is generally well-calibrated. However, evaluating model error and predicted error with respect to the variable Patient Age reveals an undesirable and systematic miscalibration pattern with respect to this variable, as illustrated in Figure 0(b): the model is underconfident by upwards of five percentage points for younger patients, and is significantly overconfident for older patients.

(a) Reliability diagram (for accuracy)
(b) Variable-based calibration plot (for error)
Figure 1:

Calibration plots for a neural network predicting cardiovascular disease, after recalibration with Platt scaling: (a) reliability diagram, (b) LOESS-smoothed estimates with confidence intervals of actual and model-predicted error as a function of patient age. This dataset consists of 70,000 records of patient data (49,000 train, 6,000 validation, 15,000 test), with a binary prediction task of determining the presence of cardiovascular disease.

In this paper, we systematically investigate variable-based calibration for classification models, from both theoretical and empirical perspectives. In particular, our contributions are as follows:

  1. We introduce the notion of variable-based calibration and define a per-variable calibration metric (VECE).

  2. We characterize theoretically the relationship between variable-based miscalibration measured via VECE and traditional score-based miscalibration measured via ECE.

  3. We demonstrate, across multiple well-known tabular, text, and image datasets and a variety of models, that significant variable-based miscalibration can exist in practice, even after the application of standard score-based recalibration methods.

  4. We investigate variable-based recalibration methods and demonstrate empirically that these methods can simultaneously reduce both ECE and VECE. 222All of our code is available online at https://github.com/markellekelly/variable-wise-calibration.

2 Related Work

Visualizing Model Performance by Variable:

A number of techniques have been developed for visual understanding and diagnosis of model performance with respect to a particular variable of interest. One such technique is partial dependence plots (friedman2001greedy; molnar2020interpretable), which visualize the effect of an input feature of interest on model predictions. Another approach is dashboards such as FairVis (cabrera2019fairvis) which enable the exploration of model performance (e.g., accuracy, false positive rate) across various data subgroups. However, none of this prior work has investigated the visualization of per-variable calibration properties of a model, i.e., how a model’s own predictions of accuracy (or error) vary as a function of a particular variable.

Quantifying Model Calibration by Variable:

Work on calibration for machine learning classifiers has largely focused on score-based calibration: reliability diagrams, the ECE, and standard recalibration methods are all defined with respect to confidence scores (murphy1977reliability; huang2020tutorial; song2021classifier). An exception to this is in the realm of fairness, where researchers have generally called for disaggregated model evaluation, e.g. computing metrics of interest individually for sensitive sub-populations (mitchell2019model; raji2020closing). To this end, several notions of calibration that move beyond standard aggregate measures have been introduced: johnsonmasses check calibration across all identifiable subpopulations of the data, pan2020field

evaluate calibration over data subsets corresponding to a categorical variable of interest, and

luolocalized compute “local calibration” based on the average classification error on similar samples. This prior work does not, however, address the issue of variable-based calibration from a general perspective—in contrast, in this paper we develop a systematic framework that provides theoretical insight and enables empirical calibration estimation, visualization, and recalibration on a per-variable basis.

3 Background on Score-Based ECE

Consider a classification problem mapping inputs to predictions for labels . Let be a black-box classifier which outputs label probabilities for each

. Then, for the standard 0-1 loss function, the predicted label is

and the corresponding confidence score is . It is of interest to determine whether such a model is well-calibrated, that is, whether its confidence matches the true probability that a prediction is correct.

For a given confidence score , we define .Then the calibration error (CE), as a function of the confidence score , is defined as the difference between accuracy and confidence score (kumar2019verified):

(1)

where . In this paper, we will focus on the expectation of the calibration error with , known as the ECE:

(2)

where an ECE of zero corresponds to “perfect” calibration. In practice, ECE is often estimated empirically on a labeled test dataset by creating bins over according to some binning scheme (e.g., guo2017calibration):

(3)

where is the number of datapoints in bin , is the total number of datapoints, and and are the estimated accuracy and estimated average value of confidence , respectively, in bin .

4 Variable-Based Calibration Error

In many applications, we may be motivated to understand miscalibration of a classification model relative to one or more particular variables of interest. As shown in Figure 1, traditional reliability diagrams and the ECE miscalibration measure may be insufficient to fully characterize this type of variable-based miscalibration.

Consider a real-valued variable taking values . In general can be a variable related to the inputs of the model, e.g., one of the input features, or another feature (e.g., metadata) defined per instance but not used in the model, or some function of inputs . To evaluate model calibration with respect to , we introduce the notion of variable-based calibration error (VCE), defined pointwise as a function of :

(4)

where is the accuracy of the model, marginalizing over inputs to the model that don’t involve , conditioned on . is the expected model score conditioned on a particular value :

(5)

In general, conditioning on will induce a distribution over inputs , which in turn induces a distribution over scores and predictions . As an example of , in the context of Figure 0(b), at , the model accuracy is estimated to be and the expected score is estimated to be , so the is approximately .

The expected value of , with respect to , is defined as:

(6)

Comment

Note that CE (and ECE) can be seen as a special case of VCE (and VECE) given the correspondence of Equations 1 and 2 with Equations 4 and 6 when is the model score (i.e., ). In the rest of the paper, however, we view CE and ECE as being distinct from VCE and VECE in order to highlight the differences between score-based and variable-based calibration.

As with ECE, a practical way to compute an empirical estimate of VECE is by binning, where bins are defined by some binning scheme (e.g., equal weight) over values of the variable (rather than over scores ):

(7)

Here is a bin corresponding to some sub-range of , is the number of points within this bin, and and are empirical estimates of the model’s accuracy and the model’s average confidence (average score) within bin . For example, the in Figure 1 is 0.74%, while the is 2.04%.

The definitions of and VECE above are in terms of a continuous variable , which is our primary focus in this paper. In general, the definitions above and the theoretical results in Section 5 also apply to discrete-valued , as well as to multivariate .

5 Theoretical Results

In this section, we establish a number of results on the relationship between ECE and VECE. All proofs can be found in Appendix A.

First, we show that the ECE and VECE can differ by a gap of up to 50%.

[VECE bound] There exist -ary classifiers and variables such that the classifier has both ECE = 0 and variable-based .

For example, in the binary case with , the difference between ECE and VECE can be as large as 0.25. As the number of classes grows, this gap approaches 0.5. Thus, we can have models that are perfectly calibrated according to ECE (with ECE = 0) but that can have variable-based ECE ranging from 0.25 to 0.5. We will show later in our experimental results section that this type of gap is not just a theoretical artifact but also exists in real-world datasets, for real-world classifiers and for specific variables of interest. The proof of Theorem 5 is by construction, using a model that is very underconfident for certain regions of and very overconfident in other regions of , but perfectly calibrated with respect to .

In earlier work, kumar2019verified proved that the binned empirical estimator consistently underestimates the true ECE, and showed by construction that this gap can approach 0.5. Our results complement this work in that we are concerned with the true theoretical relationship between two different measures of calibration, namely ECE and VECE, whereas kumar2019verified relate the estimate (Equation 3) with the true ECE (Equation 2).

[ECE bound] There exist K-ary classifiers and variables such that the classifier has and .

Again, we prove this by construction, where is well-calibrated with respect to a variable , but its low scores are very underconfident, and its high scores are very overconfident.

The results above illustrate that the ECE and VECE measures can be very different for the same model . In our experimental results we will also show that it is not uncommon (particularly for uncalibrated models) for ECE and VECE to be equal. To understand the case of equality, we first define the notion of consistent over- or under-confidence with respect to a variable:

[Consistent overconfidence] Let be a classifier with scores . For a variable taking values , is consistently overconfident if , i.e., the expected value of the model’s scores as a function of is always greater than the true accuracy as a function of .

Consistent underconfidence can be defined analogously, using . In the special case where the variable is defined as the score itself, we have the condition , leading to consistent overconfidence for the scores.

For the case of consistent over- or under- confidence for a model , we have the following result:

[Equality conditions of ECE and VECE] Let be a classifier that is consistently under- or over- confident with respect both to and to a variable . Then the ECE and VECE of are equal.

The results above provide insight into the relationship between ECE and VECE. Specifically, if the miscalibration is “one-sided” (i.e., consistently over- or under-confident for both the score and a variable ) then ECE and VECE will be in agreement. However, when the classifier is both over- and under-confident (as a function of either or ), then ECE and VECE can differ significantly and, as a result, ECE can mask significant systematic miscalibration with respect to variables of interest.

6 Mitigation of Variable-Based Miscalibration

6.1 Diagnosis of Variable-Based Miscalibration

In order to better detect and characterize per-variable miscalibration, we discuss below variable-based calibration plots, which we have found useful in practice. Figure 0(b) shows an example of a variable-based calibration plot for age and in Section 7 we explore how these plots can be used to characterize miscalibration across different classifiers, datasets, and variables of interest.

For ease of interpretation in the results below we focus on the model’s error rate and predicted error, rather than accuracy and confidence, although they are equivalent. Particularly for models with high accuracy, we find that it is more intuitive to discuss differences in error rate than in accuracy.

To generate these plots, we first compute the individual error and predicted error

for each observation. We then construct nonparametric error curves with LOESS. (Further details are available in Appendix B.) This approach allows us to obtain 95% confidence bars for the error rate and mean predicted error, based on standard error, thus putting the differences in curves into perspective.

Beyond visualization, we can use VECE scores to discover which variables for a dataset have the highest systematic variable-based calibration. Ranking features in order of decreasing VECE can highlight variables that may be worth investigating. An example of such a ranking for the UCI Adult Income dataset, based on a neural network with post-hoc beta recalibration (kull2017), is shown in Table 1. The years of education and age variables rank highest in VECE, so a model developer or a user of a model might find it useful to generate a variable-based calibration plot for each of these. The weekly work hours and census weight variables are of lesser concern, but could also be explored. We will perform an in-depth investigation of miscalibration with respect to the variable Age in Section 7.1.

VECE VCE()
Years of education 9.95% 20.13%
Age 9.59% 23.44%
Weekly work hours 7.94% 18.21%
Census weight 5.06% 12.08%
Table 1: Variable-based calibration error of Adult Income dataset features

It is also possible to define the maximum value of , i.e, the worst-case calibration error, as well as the value that incurs this worst-case error:

Estimating either or accurately may be difficult in practice, particularly for small sample sizes , since it involves the non-parametric estimation (bowman1996graphical) of the difference of two curves as a function of , as the shapes of the curves need not follow any convenient parametric form (e.g., see Figure 0(b)). One simple estimation strategy is to smooth both curves with LOESS and compute the maximum difference between the two estimated curves. Using this LOESS-based approach, worst-case calibration errors for the Adult Income model are also shown in Table 1.

6.2 Recalibration Methods

We found empirically, across multiple datasets, that standard score-based recalibration techniques often reduce ECE while neglecting variable-based systematic miscalibration. Because calibration error can vary as a function of a feature of interest , we propose incorporating information about during recalibration. In particular, we introduce the concept of variable-based recalibration, a family of recalibration methods that adjust confidence scores with respect to some variable of interest . As an illustrative example, we perform experiments in Section 7 with a modification of probability calibration trees (leatharttrees)

. This technique involves performing logistic calibration separately for data splits, defined by decision trees trained over the input space. We alter the method to train decision trees for

with only as input, with a minimum leaf size of one-tenth of the total calibration set size. We then perform beta calibration at each leaf (kull2017), as we found that it performs empirically better than logistic calibration. In the multi-class case, we use Dirichlet calibration, an extension of beta calibration for -class classification (kull2019). Our use of split-based recalibration using decision trees is intended to provide a straightforward illustration of the potential benefits of variable-based calibration, rather than to provide a state-of-the-art methodology that can balance ECE and VECE (which we leave to future work). We also investigated variable-based recalibration methods that operate continuously over (rather than on separate data splits) using extensions of logistic and beta calibration, but found that these were not as reliable in our experiments as the tree-based approach (see Appendix C for details).

7 Variable-Based Miscalibration in Practice

In this section, we explore several examples where the ECE obscures systematic miscalibration relative to some variable of interest, particularly after score-based recalibration. In our experiments we use four datasets that span tabular, text, and image data. For each dataset and variable of interest , we investigate both (1) several score-based calibration methods and (2) our variable-based recalibration (the tree-based method described in Section 6.2), comparing the resulting ECE, VECE, and variable-based calibration plots. In particular, we calibrate with scaling-binning (kumar2019verified), Platt scaling (plattprobabilistic), beta calibration (kull2017), and, for the multi-class case, Dirichlet calibration (kull2019). The datasets are split into training, calibration, and test sets. Each calibration method is trained on the same calibration set, and all metrics and figures are produced from the final test set. The ECE and VECE are computed with an equal-support binning scheme, with . Further details regarding datasets, models, and calibration are in Appendix B.

(a) Uncalibrated
(b) Beta-calibrated
(c) Variable-based calibrated
Figure 2: Variable-based calibration plots for the Adult Income model for Age

7.1 Adult Census Records: Predicting Income

The UCI Adult Income dataset333https://archive.ics.uci.edu/ml/datasets/adult

consists of 1994 Census records, where the goal is to predict whether an individual’s annual income is greater than $50,000. We model this data with a simple feed-forward neural network and evaluate the model’s calibration error with respect to age (i.e. let

=age). Uncalibrated, this model has an ECE and VECE of 20.67% (see Table 2). The ECE and VECE are equal precisely because of the model’s consistent overconfidence as a function of both the confidence score and (see Definition 5). The overconfidence with respect to age is reflected in the variable-based calibration plot (Figure 1(a)). The model’s error rate varies significantly as a function of age, with very high error for individuals around age 50, and much lower error for younger and older people. However, its confidence remains nearly constant at close to 100% (i.e., a predicted error close to 0%) across all ages.

ECE VECE
Uncalibrated 20.67% 20.67%
Scaling-binning 2.27% 9.25%
Platt scaling 4.57% 10.13%
Beta calibration 1.65% 9.59%
Variable-based calibration 1.64% 2.11%
Table 2: Adult Income model calibration error

After recalibrating, the ECE is dramatically reduced, with beta calibration achieving an ECE of 1.65%. However, the corresponding VECE is still very high (over 9%). As shown in Figure 1(b), the model’s self-predicted error has increased substantially, but remains near constant as a function of age. Thus, despite a significant improvement in ECE, this recalibrated model still harbors unfairness with respect to age, exhibiting overconfidence in its predictions for individuals in the 35-65 age range, and underconfidence for those outside of it. As the model is no longer consistently overconfident, the ECE and VECE diverge, as predicted theoretically.

Variable-based calibration obtains a significantly lower VECE of 2.11%, while simultaneously reducing the ECE. This improvement in VECE is reflected in Figure 1(c). The model’s predicted error now varies with age to match the true error rate. In this case, simple variable-based recalibration improves the age-wise systematic miscalibration of the model, without detriment to the overall calibration error.

(a) Uncalibrated
(b) Beta-calibrated
(c) Variable-based calibrated
Figure 3: Variable-based calibration plots for the Yelp model for Review Length

7.2 Yelp Reviews: Predicting Sentiment

To explore variable-based calibration in an NLP context, we use a fine-tuned large language model, BERT (devlin2018), on the Yelp review dataset444https://www.yelp.com/dataset. The model predicts whether a review has a positive or negative rating based on its text. In this case there are no easily-interpretable features directly input to the model. Instead, to better diagnose model behavior, we can analyze real-valued characteristics of the text, such as the length of each review or part-of-speech statistics. Here we focus on review length in characters.

Figure 2(a) shows the model’s error and predicted error with respect to review length. The error rate is lowest for reviews around 300-700 characters, around the median review length. Very short and very long reviews are associated with a higher error rate. This model is consistently overconfident, with an uncalibrated ECE and VECE of 1.93% (see Table 3).

ECE VECE
Uncalibrated 1.93% 1.93%
Scaling-binning 4.23% 4.23%
Platt scaling 3.04% 0.64%
Beta calibration 1.73% 0.37%
Variable-based calibration 1.70% 0.23%
Table 3: Yelp model calibration error

The ECE and VECE diverge after beta calibration, which obtains the lowest ECE of 1.73% and a substantially reduced VECE of 0.37%. Figure 2(b) reflects this: the model’s predicted error aligns more closely with its actual error rate, although it is still notably overconfident for very short reviews.

Variable-based recalibration reduces the VECE slightly further, while yielding a small improvement to the overall ECE. After variable-based calibration, the predicted error curve matches the true relationship between review length and true error rate more faithfully, reducing overconfidence for short reviews (Figure 2(c)).

(a) Uncalibrated
(b) Calibrated with Platt scaling
(c) Variable-based calibrated
Figure 4: Variable-based calibration plots for the Bank Marketing model for Age

7.3 Bank Marketing: Predicting Subscriptions

We also investigate miscalibration on a simple neural network modeling the UCI Bank Marketing dataset555https://archive.ics.uci.edu/ml/datasets/bank+marketing. The model predicts whether a bank customer will subscribe to a bank term deposit as a result of direct marketing. Uncalibrated, the model is overconfident, with ECE and VECE over 4.5% (see Table 4). Consider the calibration error with respect to customer age, both before (Figure 3(a)) and after (Figure 3(b)) recalibration. The best-performing recalibration technique, Platt scaling, uniformly increases the predicted error across age, reducing both ECE and VECE, but resulting in underconfidence for most ages and overconfidence at the edges of the distribution.

ECE VECE
Uncalibrated 4.69% 4.69%
Scaling-binning 4.37% 3.39%
Platt scaling 2.38% 2.83%
Beta calibration 2.48% 2.77%
Variable-based calibration 2.10% 0.52%
Table 4: Bank Marketing model calibration error

Variable-based recalibration achieves competitive ECE, while reducing VECE to about half of one percent. Figure 3(c) reflects this improvement. The predicted error after variable-based recalibration matches the true error rate more closely, reducing the miscalibration with respect to customer age.

(a) Uncalibrated
(b) Dirichlet calibrated
(c) Variable-based calibrated
Figure 5:

Variable-based calibration plots for the CIFAR-10H model for

Median Reaction Time

7.4 CIFAR-10H: Image Classification

As a multi-class example, we investigate variable-based miscalibration on CIFAR-10H, a 10-class image dataset including labels and reaction times from human annotators (peterson2019human)

. We use a standard deep learning image classification architecture (a DenseNet model) to predict the image category, and investigate median annotator reaction time, which is metadata that is not provided to the model. Instead of Platt scaling and beta calibration, here we use Dirichlet calibration (to accomodate the multiple classes).

Here, Dirichlet calibration achieves the lowest overall ECE and variable-based calibration obtains the lowest VECE (see Table 5). The variable-based calibration plots are shown in Figure 5. We see that variable-based recalibration reduces underconfidence for examples with low median reaction times (where the majority of data points lie).

ECE VECE
Uncalibrated 1.90% 1.92%
Scaling-binning 3.83% 3.60%
Dirichlet calibration 0.80% 1.12%
Variable-based calibration 1.31% 0.60%
Table 5: CIFAR-10H model calibration error

Summary of Experimental Results

First, our results demonstrate the potential of variable-based recalibration. While score-based recalibration techniques generally improved the ECE, variable-based recalibration performed better across datasets in terms of simultaneously reducing both the ECE and VECE, without any significant increase in model error rate or the VECE for other variables (details in Appendix B). The results also illustrate that variable-based calibration plots enable meaningful characterization of the relationships between variables of interest and predicted/true error, providing more detailed insight into a model’s performance than a single number (i.e., ECE or VECE).

8 Discussion and Conclusions

Discussion of Limitations

There are several potential limitations of this work. First, we focused on the mitigation of miscalibration for one variable at a time. Although we did not observe recalibration with respect to one variable worsening VECE for another variable, this behavior has not been analyzed theoretically. Further, a more thorough investigation on miscalibration and recalibration across intersections of variables is still warranted. We also emphasize that the variable-based calibration technique used in the paper is primarily for illustration; the development of new methods for simultaneously reducing score-based and variable-based miscalibration is a useful direction for future work.

Conclusions

To summarize, in this paper, we demonstrated theoretically and empirically that ECE can obscure significant miscalibration with respect to variables of potential importance to a developer or user of a classification model. To better detect and characterize this type of miscalibration, we introduced the VECE measure and corresponding variable-based calibration plots, and characterized the theoretical relationship between VECE and ECE. In a case study across several datasets and models, we showed that VECE, variable-based calibration plots, and variable-based recalibration are all useful tools for understanding and mitigating miscalibration on a per-variable level. Looking forward, to mitigate biases in calibration error, we recommend moving beyond purely score-based calibration analysis. In addition to promoting fairness, these techniques offer new insight into model behavior and provide actionable avenues for improvement.


References

Appendix A Proofs for Section 5

[VECE bound] There exist -ary classifiers and variables such that the classifier has both ECE = 0 and variable-based .

Let be a continuous variable with density . Recall that where is the accuracy of model as a function of , and the score is the probability that the model assigns to its label prediction .

The reliability diagram for a -ary classifier has scores where the leftmost value for this interval is a result of the fact that the score is defined as the maximum of class probabilities. Let be the midpoint of this interval.

Assume that the scores

have a uniform distribution of the form

, where is some constant and , and that the scores and the variable are independent.

Further assume that the accuracy of the model depends on and in the following manner

where is defined such that .

The marginal accuracy as a function of the score (marginalizing over ) can be written as

The marginal accuracy as a function of (marginalizing over ) is

This setup is designed so that the score is close to the accuracy as a function of (to minimize ECE), but the variable-based expected scores are relatively far away from accuracy as a function of .

Under these assumptions we can write the ECE as

(8)

We can write the VECE as

(9)

Thus, as , and .

[ECE bound] There exist K-ary classifiers and variables such that the classifier has and .

Let be a continuous variable with density . Recall that a K-ary classifier has scores , where we let be the midpoint of this interval. Assume that produces scores from two uniform distributions, with equal probability: and , where is some constant , and that the scores and the variable are independent. Finally, suppose the accuracy of the model is independent of and .

Under these assumptions we can write the VECE as

(10)

We can write the ECE as

(11)

Thus, as , and .

[Consistent overconfidence] Let be a classifier with scores . For a variable taking values , is consistently overconfident if , i.e., the expected value of the model’s scores as a function of is always greater than the true accuracy as a function of . Consistent underconfidence is defined analogously with . In the special case where the variable is defined as the score itself, we have , etc.

[Equality conditions for ECE and VECE] Let be a classifier that is consistently under- or over-confident with respect both to and to a variable . Then the ECE and VECE of are equal.

Without loss of generality, suppose is consistently underconfident with respect to its scores and .

Then we have, by consistent underconfidence:

(12)

By the law of total probability,

(13)

So .

Appendix B Calibration, Model, and Dataset Details

Here, we include additional information and plots for each dataset and model discussed in Section 7. Code for reproducing all tables and plots is available online.666All of our code is available online at https://github.com/markellekelly/variable-wise-calibration.

On each dataset, we test several existing recalibration techniques: Platt scaling, scaling-binning, beta calibration, and (for the multi-class case) Dirichlet calibration. For scaling-binning, we calibrate over 10 bins, and for Dirichlet calibration, we use a lambda value of 1e-3, values chosen based on the respective authors’ provided examples. Here and in Section 7, we present the uncalibrated and variable-based calibrated output, along with the best-performing score-based calibration method (for the Adult and Yelp datasets, beta calibration; for Bank Marketing, Platt scaling; for CIFAR, Dirichlet calibration).

Our variable-based recalibration is performed as follows. Given the calibration set, a decision tree classifier is trained to predict the outcome with input (the single variable of interest). We use a maximum depth of two and a minimum leaf size of the size of the calibration set. The calibration set is then split according to the leaf nodes of the trained decision tree, and separately the rest of the dataset is split according to the same rules. Standard beta calibration is then performed separately for each split, using the subset of the original calibration set as the new calibration set, and computing recalibrated probabilities for the subset of the original dataset.

Variable-based calibration plots are created with LOESS, with quadratic local fit and an assumed symmetric distribution of the errors, with empirically-chosen smoothing factors between 0.8 and 0.9.

We note the VECE for each numeric variable in each dataset before and after the recalibration described. We find in general empirically that variable-based calibration with respect to one variable is not detrimental to the VECE of other variables.

Finally, we observe that variable-based recalibration does not tend to significantly degrade accuracy. Accuracies for each dataset before and after recalibration are shown in Table 6.

Adult Income Yelp Bank Marketing CIFAR
Uncalibrated 79.1% 98.0% 88.9% 97.2%
Score-based calibrated 79.1% 98.0% 88.7% 96.9%
Variable-based calibrated 79.1% 98.0% 88.7% 96.0%
Table 6: Accuracies for all four datasets before calibration, after the highest-ECE score-based calibration (as reported in the main paper and below), and after variable-based calibration.

b.1 Adult Income

The Adult Income dataset was modeled with a multi-layer perceptron, with two hidden layers of sizes 100 and 75. Of the 48,842 observations, 32,561 were used for training, 2,500 were used for calibration, and 13,781 were used for testing. The dataset includes six continuous variables: age, fnlwgt (the estimated number of people an individual represents), education-num (a number representing the individual’s years of education), capital-gain, capital-loss, and hours-per-week (the number of hours per week that an individual works).

Based on the beta-calibrated model, education-num and age rank the highest in VECE, as shown in Section 6. For all six variables, VECE is reduced by performing recalibration with respect to age:

Uncalibrated Beta-calibrated Variable-based calibrated
education-num 20.67% 9.95% 8.53%
age 20.67% 9.59% 2.11%
hours-per-week 20.67% 7.94% 6.02%
fnlwgt 20.67% 5.06% 4.10%
capital-gain 20.67% 1.50% 1.39%
capital-loss 20.67% 1.50% 1.39%
Table 7: VECE for numeric variables in the Adult Income dataset: uncalibrated, beta-calibrated, and after variable-based recalibration with respect to age.

Uncalibrated, the model’s ECE and VECE are 20.67%. Of the score-based calibration methods tested, beta calibration achieves the lowest ECE of 1.65%. Relevant reliability diagrams and variable-based calibration plots for the uncalibrated, beta-calibrated, and variable-based-calibrated models are shown in Figure 6.

Figure 6: Reliability diagrams and variable-based calibration plots for the Adult Income model, uncalibrated (top), beta-calibrated (middle), and variable-based calibrated (bottom)

b.2 Yelp

The Yelp dataset was modeled with a fine-tuned BERT model. 100,000 observations were randomly sampled from the full Yelp dataset. Of these, 70,500 were used for training, 10,000 were used for calibration, and 19,500 were used for testing. Several continuous features were generated from the raw text reviews, including length in characters, number of special characters, and proportions of each part of speech. Based on the beta-calibrated model, review length ranked highest in VECE, followed by proportion of stop words, as shown in Table 8.

Uncalibrated Beta-calibrated Variable-based calibrated
Length (characters) 1.93% 0.37% 0.23%
Stop-word Proportion 1.93% 0.29% 0.28%
Named Entity Count 1.93% 0.21% 0.22%
Table 8: VECE for numeric variables in the Yelp dataset: uncalibrated, beta-calibrated, and after variable-based recalibration with respect to length in characters.

Uncalibrated, the model’s ECE and VECE are 1.93%. Of the score-based calibration methods tested, beta calibration achieves the lowest ECE of 1.73%. Relevant reliability diagrams and variable-based calibration plots for the uncalibrated, beta-calibrated, and variable-based-calibrated models are shown in Figure 7.

Figure 7: Reliability diagrams and variable-based calibration plots for the Yelp model, uncalibrated (top), beta-calibrated (middle), and variable-based calibrated (bottom)

b.3 Bank Marketing

The Bank Marketing dataset was modeled with a multi-layer perceptron, with two hidden layers of sizes 100 and 75. Of the 45,211 total observations, 31,647 were used for training, 1,000 were used for calibration, and 12,564 were used for testing. Based on the model calibrated with Platt scaling, account balance ranked highest in VECE, followed by age, as shown in Table 9.

Uncalibrated Calibrated with Platt scaling Variable-based calibrated
Account balance 5.35% 4.17% 3.22%
Age 4.69% 2.83% 0.52%
Table 9: VECE for numeric variables in the Bank Marketing dataset: uncalibrated, calibrated with Platt scaling, and after variable-based recalibration with respect to age.

Uncalibrated, the model’s ECE is 4.69%. Of the score-based calibration methods tested, Platt scaling achieves the lowest ECE of 2.38%. Relevant reliability diagrams and variable-based calibration plots for the uncalibrated, calibrated with Platt scaling, and variable-based-calibrated models are shown in Figure 8.

Figure 8: Reliability diagrams and variable-based calibration plots for the Bank Marketing model, uncalibrated (top), scaling-binning-calibrated (middle), and variable-based calibrated (bottom)

b.4 Cifar-10h

The CIFAR-10H dataset was modeled with a DenseNet model. Of the 10,000 total observations, 4,057 were used for training, 2,000 were used for calibration, and 3,943 were used for testing.

Uncalibrated, the model’s ECE and VECE are 1.90% and 1.92%, respectively. Of the score-based calibration methods tested, Dirichlet calibration achieves the lowest ECE of 0.80%. Relevant reliability diagrams and variable-based calibration plots for the uncalibrated, Dirichlet-calibrated, and variable-based-calibrated models are shown in Figure 9.

Figure 9: Reliability diagrams and variable-based calibration plots for the CIFAR-10H model, uncalibrated (top), Dirichlet-calibrated (middle), and variable-based calibrated (bottom)

Appendix C Alternate Recalibration Methods

As an alternative variable-based calibration method, we extend logistic and beta calibration, which operate continously over score, to incorporate information regarding . In particular, logistic calibration learns a mapping of scores , with parameters and

learned via logistic regression:

This can be augmented to include by simply training the logistic regression on both and , learning the following mapping:

where is the logistic regression coefficient corresponding to .

Similarly, beta calibration learns the following mapping, where the parameters , , and are learned by training a logistic regression on and (see kull2017 for more details):

This can also be augmented with , including it as a third input to the regression:

In contrast to the tree-based method detailed in the main paper, which splits the data along and then separately calibrates each set, these methods learn one calibration mapping for the entire dataset. Empirically, we find that augmented beta calibration is a promising approach, simultaneously reducing ECE and VECE, although some attention must be paid to the fit of the logistic regression (e.g., by including a quadratic term). However, in our experiments, this technique ultimately was not as reliable as tree-based calibration (perhaps because the functional form of beta calibration is not flexible enough to always be able to correct systematic miscalibration as a function of ).

Here, we include the results of augmented-beta variable-based (VB) recalibration on the Adult Income, Yelp, and Bank Marketing datasets. The models for the Adult and Bank Marketing datasets include a quadratic term for , which obtained a better fit. (Note that this formulation only applies to binary classification, so we do not include results for the CIFAR dataset here).

ECE VECE
Uncalibrated 20.67% 20.67%
Beta calibration 1.65% 9.59%
Tree-based VB calibration 1.64% 2.11%
Augmented-beta VB calibration 1.49% 1.87%
Table 10: Adult Income model calibration error
(a) Uncalibrated
(b) Beta-calibrated
(c) Augmented-beta-calibrated
Figure 10: Variable-based calibration plots for the Adult Income model for Age
ECE VECE
Uncalibrated 1.93% 1.93%
Beta calibration 1.73% 0.37%
Tree-based VB calibration 1.70% 0.23%
Augmented-beta VB calibration 1.73% 0.37%
Table 11: Yelp model calibration error
(a) Uncalibrated
(b) Beta-calibrated
(c) Augmented-beta-calibrated
Figure 11: Variable-based calibration plots for the Yelp model for Review Length
ECE VECE
Uncalibrated 4.69% 4.69%
Platt scaling 2.38% 2.83%
Tree-based VB calibration 2.10% 0.52%
Augmented-beta VB calibration 2.09% 1.13%
Table 12: Bank Marketing model calibration error
(a) Uncalibrated
(b) Calibrated with Platt scaling
(c) Augmented-beta-calibrated
Figure 12: Variable-based calibration plots for the Bank Marketing model for Age