1 Introduction
Predictive models built by machine learning algorithms are increasingly informing decisions across highstakes applications such as medicine (rajkomar2019machine), employment (chalfin2016productivity), and criminal justice (zavrsnik2021). There is also broad recent interest in developing systems where humans and machine learning models collaborate to make predictions and decisions (kleinberg2018human; bansal2021most; de2021classification; steyvers2022bayesian). A critical aspect of using model predictions in such contexts is calibration. In particular, in order to trust the predictions from a machine learning classifier, these predictions must be accompanied by wellcalibrated confidence scores.
In practice, however, it has been welldocumented that machine learning classifiers such as deep neural networks tend to produce poorlycalibrated class probabilities
(guo2017calibration; vaicenavicius2019evaluating; ovadia2019). As a result, a variety of recalibration techniques have been developed, which aim to ensure that a model’s confidence (or score) matches its true accuracy. A widely used approach is posthoc recalibration: methods which use a separate labeled dataset to learn a mapping from the original model’s class probabilities to calibrated probabilities, often with a relatively simple onedimensional mapping (e.g., plattprobabilistic; kull2017; kumar2019verified). These methods have been shown to generally improve the the empirical calibration error of the model, as commonly measured by the expected calibration error (ECE).However, as we show in this paper, aggregate measures of scorebased calibration error such as ECE can hide significant systematic miscalibration in other dimensions of a model’s performance. To address this issue we introduce the notion of variablebased calibration to better understand how the calibration error of a model can vary as a function of a variable of interest. The variable of interest can be an input variable to the model or some other metadata variable—we focus in particular in this paper on realvalued variables. For example, in prediction problems involving individuals (e.g., creditscoring or medical diagnosis) one such variable would be age. Detecting systematic miscalibration is important for problems such as assessing the fairness of a model, for instance detecting that a model is significantly overconfident for some age ranges and underconfident for others.
As an illustrative example, consider a simple classifier trained to predict the presence of cardiovascular disease^{1}^{1}1https://www.kaggle.com/sulianova/cardiovasculardiseasedataset. After the application of standard posthoc calibration, this model attains a relatively low ECE of 0.74%. This low ECE is reflected in the reliability diagram shown in Figure 0(a), which shows nearperfect alignment with the diagonal. If a user of this model were only to consider aggregate metrics such as ECE, they might reasonably conclude that the model is generally wellcalibrated. However, evaluating model error and predicted error with respect to the variable Patient Age reveals an undesirable and systematic miscalibration pattern with respect to this variable, as illustrated in Figure 0(b): the model is underconfident by upwards of five percentage points for younger patients, and is significantly overconfident for older patients.
Calibration plots for a neural network predicting cardiovascular disease, after recalibration with Platt scaling: (a) reliability diagram, (b) LOESSsmoothed estimates with confidence intervals of actual and modelpredicted error as a function of patient age. This dataset consists of 70,000 records of patient data (49,000 train, 6,000 validation, 15,000 test), with a binary prediction task of determining the presence of cardiovascular disease.
In this paper, we systematically investigate variablebased calibration for classification models, from both theoretical and empirical perspectives. In particular, our contributions are as follows:

We introduce the notion of variablebased calibration and define a pervariable calibration metric (VECE).

We characterize theoretically the relationship between variablebased miscalibration measured via VECE and traditional scorebased miscalibration measured via ECE.

We demonstrate, across multiple wellknown tabular, text, and image datasets and a variety of models, that significant variablebased miscalibration can exist in practice, even after the application of standard scorebased recalibration methods.

We investigate variablebased recalibration methods and demonstrate empirically that these methods can simultaneously reduce both ECE and VECE. ^{2}^{2}2All of our code is available online at https://github.com/markellekelly/variablewisecalibration.
2 Related Work
Visualizing Model Performance by Variable:
A number of techniques have been developed for visual understanding and diagnosis of model performance with respect to a particular variable of interest. One such technique is partial dependence plots (friedman2001greedy; molnar2020interpretable), which visualize the effect of an input feature of interest on model predictions. Another approach is dashboards such as FairVis (cabrera2019fairvis) which enable the exploration of model performance (e.g., accuracy, false positive rate) across various data subgroups. However, none of this prior work has investigated the visualization of pervariable calibration properties of a model, i.e., how a model’s own predictions of accuracy (or error) vary as a function of a particular variable.
Quantifying Model Calibration by Variable:
Work on calibration for machine learning classifiers has largely focused on scorebased calibration: reliability diagrams, the ECE, and standard recalibration methods are all defined with respect to confidence scores (murphy1977reliability; huang2020tutorial; song2021classifier). An exception to this is in the realm of fairness, where researchers have generally called for disaggregated model evaluation, e.g. computing metrics of interest individually for sensitive subpopulations (mitchell2019model; raji2020closing). To this end, several notions of calibration that move beyond standard aggregate measures have been introduced: johnsonmasses check calibration across all identifiable subpopulations of the data, pan2020field
evaluate calibration over data subsets corresponding to a categorical variable of interest, and
luolocalized compute “local calibration” based on the average classification error on similar samples. This prior work does not, however, address the issue of variablebased calibration from a general perspective—in contrast, in this paper we develop a systematic framework that provides theoretical insight and enables empirical calibration estimation, visualization, and recalibration on a pervariable basis.3 Background on ScoreBased ECE
Consider a classification problem mapping inputs to predictions for labels . Let be a blackbox classifier which outputs label probabilities for each
. Then, for the standard 01 loss function, the predicted label is
and the corresponding confidence score is . It is of interest to determine whether such a model is wellcalibrated, that is, whether its confidence matches the true probability that a prediction is correct.For a given confidence score , we define .Then the calibration error (CE), as a function of the confidence score , is defined as the difference between accuracy and confidence score (kumar2019verified):
(1) 
where . In this paper, we will focus on the expectation of the calibration error with , known as the ECE:
(2) 
where an ECE of zero corresponds to “perfect” calibration. In practice, ECE is often estimated empirically on a labeled test dataset by creating bins over according to some binning scheme (e.g., guo2017calibration):
(3) 
where is the number of datapoints in bin , is the total number of datapoints, and and are the estimated accuracy and estimated average value of confidence , respectively, in bin .
4 VariableBased Calibration Error
In many applications, we may be motivated to understand miscalibration of a classification model relative to one or more particular variables of interest. As shown in Figure 1, traditional reliability diagrams and the ECE miscalibration measure may be insufficient to fully characterize this type of variablebased miscalibration.
Consider a realvalued variable taking values . In general can be a variable related to the inputs of the model, e.g., one of the input features, or another feature (e.g., metadata) defined per instance but not used in the model, or some function of inputs . To evaluate model calibration with respect to , we introduce the notion of variablebased calibration error (VCE), defined pointwise as a function of :
(4) 
where is the accuracy of the model, marginalizing over inputs to the model that don’t involve , conditioned on . is the expected model score conditioned on a particular value :
(5) 
In general, conditioning on will induce a distribution over inputs , which in turn induces a distribution over scores and predictions . As an example of , in the context of Figure 0(b), at , the model accuracy is estimated to be and the expected score is estimated to be , so the is approximately .
The expected value of , with respect to , is defined as:
(6) 
Comment
Note that CE (and ECE) can be seen as a special case of VCE (and VECE) given the correspondence of Equations 1 and 2 with Equations 4 and 6 when is the model score (i.e., ). In the rest of the paper, however, we view CE and ECE as being distinct from VCE and VECE in order to highlight the differences between scorebased and variablebased calibration.
As with ECE, a practical way to compute an empirical estimate of VECE is by binning, where bins are defined by some binning scheme (e.g., equal weight) over values of the variable (rather than over scores ):
(7) 
Here is a bin corresponding to some subrange of , is the number of points within this bin, and and are empirical estimates of the model’s accuracy and the model’s average confidence (average score) within bin . For example, the in Figure 1 is 0.74%, while the is 2.04%.
The definitions of and VECE above are in terms of a continuous variable , which is our primary focus in this paper. In general, the definitions above and the theoretical results in Section 5 also apply to discretevalued , as well as to multivariate .
5 Theoretical Results
In this section, we establish a number of results on the relationship between ECE and VECE. All proofs can be found in Appendix A.
First, we show that the ECE and VECE can differ by a gap of up to 50%.
[VECE bound] There exist ary classifiers and variables such that the classifier has both ECE = 0 and variablebased .
For example, in the binary case with , the difference between ECE and VECE can be as large as 0.25. As the number of classes grows, this gap approaches 0.5. Thus, we can have models that are perfectly calibrated according to ECE (with ECE = 0) but that can have variablebased ECE ranging from 0.25 to 0.5. We will show later in our experimental results section that this type of gap is not just a theoretical artifact but also exists in realworld datasets, for realworld classifiers and for specific variables of interest. The proof of Theorem 5 is by construction, using a model that is very underconfident for certain regions of and very overconfident in other regions of , but perfectly calibrated with respect to .
In earlier work, kumar2019verified proved that the binned empirical estimator consistently underestimates the true ECE, and showed by construction that this gap can approach 0.5. Our results complement this work in that we are concerned with the true theoretical relationship between two different measures of calibration, namely ECE and VECE, whereas kumar2019verified relate the estimate (Equation 3) with the true ECE (Equation 2).
[ECE bound] There exist Kary classifiers and variables such that the classifier has and .
Again, we prove this by construction, where is wellcalibrated with respect to a variable , but its low scores are very underconfident, and its high scores are very overconfident.
The results above illustrate that the ECE and VECE measures can be very different for the same model . In our experimental results we will also show that it is not uncommon (particularly for uncalibrated models) for ECE and VECE to be equal. To understand the case of equality, we first define the notion of consistent over or underconfidence with respect to a variable:
[Consistent overconfidence] Let be a classifier with scores . For a variable taking values , is consistently overconfident if , i.e., the expected value of the model’s scores as a function of is always greater than the true accuracy as a function of .
Consistent underconfidence can be defined analogously, using . In the special case where the variable is defined as the score itself, we have the condition , leading to consistent overconfidence for the scores.
For the case of consistent over or under confidence for a model , we have the following result:
[Equality conditions of ECE and VECE] Let be a classifier that is consistently under or over confident with respect both to and to a variable . Then the ECE and VECE of are equal.
The results above provide insight into the relationship between ECE and VECE. Specifically, if the miscalibration is “onesided” (i.e., consistently over or underconfident for both the score and a variable ) then ECE and VECE will be in agreement. However, when the classifier is both over and underconfident (as a function of either or ), then ECE and VECE can differ significantly and, as a result, ECE can mask significant systematic miscalibration with respect to variables of interest.
6 Mitigation of VariableBased Miscalibration
6.1 Diagnosis of VariableBased Miscalibration
In order to better detect and characterize pervariable miscalibration, we discuss below variablebased calibration plots, which we have found useful in practice. Figure 0(b) shows an example of a variablebased calibration plot for age and in Section 7 we explore how these plots can be used to characterize miscalibration across different classifiers, datasets, and variables of interest.
For ease of interpretation in the results below we focus on the model’s error rate and predicted error, rather than accuracy and confidence, although they are equivalent. Particularly for models with high accuracy, we find that it is more intuitive to discuss differences in error rate than in accuracy.
To generate these plots, we first compute the individual error and predicted error
for each observation. We then construct nonparametric error curves with LOESS. (Further details are available in Appendix B.) This approach allows us to obtain 95% confidence bars for the error rate and mean predicted error, based on standard error, thus putting the differences in curves into perspective.
Beyond visualization, we can use VECE scores to discover which variables for a dataset have the highest systematic variablebased calibration. Ranking features in order of decreasing VECE can highlight variables that may be worth investigating. An example of such a ranking for the UCI Adult Income dataset, based on a neural network with posthoc beta recalibration (kull2017), is shown in Table 1. The years of education and age variables rank highest in VECE, so a model developer or a user of a model might find it useful to generate a variablebased calibration plot for each of these. The weekly work hours and census weight variables are of lesser concern, but could also be explored. We will perform an indepth investigation of miscalibration with respect to the variable Age in Section 7.1.
VECE  VCE()  

Years of education  9.95%  20.13% 
Age  9.59%  23.44% 
Weekly work hours  7.94%  18.21% 
Census weight  5.06%  12.08% 
It is also possible to define the maximum value of , i.e, the worstcase calibration error, as well as the value that incurs this worstcase error:
Estimating either or accurately may be difficult in practice, particularly for small sample sizes , since it involves the nonparametric estimation (bowman1996graphical) of the difference of two curves as a function of , as the shapes of the curves need not follow any convenient parametric form (e.g., see Figure 0(b)). One simple estimation strategy is to smooth both curves with LOESS and compute the maximum difference between the two estimated curves. Using this LOESSbased approach, worstcase calibration errors for the Adult Income model are also shown in Table 1.
6.2 Recalibration Methods
We found empirically, across multiple datasets, that standard scorebased recalibration techniques often reduce ECE while neglecting variablebased systematic miscalibration. Because calibration error can vary as a function of a feature of interest , we propose incorporating information about during recalibration. In particular, we introduce the concept of variablebased recalibration, a family of recalibration methods that adjust confidence scores with respect to some variable of interest . As an illustrative example, we perform experiments in Section 7 with a modification of probability calibration trees (leatharttrees)
. This technique involves performing logistic calibration separately for data splits, defined by decision trees trained over the input space. We alter the method to train decision trees for
with only as input, with a minimum leaf size of onetenth of the total calibration set size. We then perform beta calibration at each leaf (kull2017), as we found that it performs empirically better than logistic calibration. In the multiclass case, we use Dirichlet calibration, an extension of beta calibration for class classification (kull2019). Our use of splitbased recalibration using decision trees is intended to provide a straightforward illustration of the potential benefits of variablebased calibration, rather than to provide a stateoftheart methodology that can balance ECE and VECE (which we leave to future work). We also investigated variablebased recalibration methods that operate continuously over (rather than on separate data splits) using extensions of logistic and beta calibration, but found that these were not as reliable in our experiments as the treebased approach (see Appendix C for details).7 VariableBased Miscalibration in Practice
In this section, we explore several examples where the ECE obscures systematic miscalibration relative to some variable of interest, particularly after scorebased recalibration. In our experiments we use four datasets that span tabular, text, and image data. For each dataset and variable of interest , we investigate both (1) several scorebased calibration methods and (2) our variablebased recalibration (the treebased method described in Section 6.2), comparing the resulting ECE, VECE, and variablebased calibration plots. In particular, we calibrate with scalingbinning (kumar2019verified), Platt scaling (plattprobabilistic), beta calibration (kull2017), and, for the multiclass case, Dirichlet calibration (kull2019). The datasets are split into training, calibration, and test sets. Each calibration method is trained on the same calibration set, and all metrics and figures are produced from the final test set. The ECE and VECE are computed with an equalsupport binning scheme, with . Further details regarding datasets, models, and calibration are in Appendix B.
7.1 Adult Census Records: Predicting Income
The UCI Adult Income dataset^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/adult
consists of 1994 Census records, where the goal is to predict whether an individual’s annual income is greater than $50,000. We model this data with a simple feedforward neural network and evaluate the model’s calibration error with respect to age (i.e. let
=age). Uncalibrated, this model has an ECE and VECE of 20.67% (see Table 2). The ECE and VECE are equal precisely because of the model’s consistent overconfidence as a function of both the confidence score and (see Definition 5). The overconfidence with respect to age is reflected in the variablebased calibration plot (Figure 1(a)). The model’s error rate varies significantly as a function of age, with very high error for individuals around age 50, and much lower error for younger and older people. However, its confidence remains nearly constant at close to 100% (i.e., a predicted error close to 0%) across all ages.ECE  VECE  

Uncalibrated  20.67%  20.67% 
Scalingbinning  2.27%  9.25% 
Platt scaling  4.57%  10.13% 
Beta calibration  1.65%  9.59% 
Variablebased calibration  1.64%  2.11% 
After recalibrating, the ECE is dramatically reduced, with beta calibration achieving an ECE of 1.65%. However, the corresponding VECE is still very high (over 9%). As shown in Figure 1(b), the model’s selfpredicted error has increased substantially, but remains near constant as a function of age. Thus, despite a significant improvement in ECE, this recalibrated model still harbors unfairness with respect to age, exhibiting overconfidence in its predictions for individuals in the 3565 age range, and underconfidence for those outside of it. As the model is no longer consistently overconfident, the ECE and VECE diverge, as predicted theoretically.
Variablebased calibration obtains a significantly lower VECE of 2.11%, while simultaneously reducing the ECE. This improvement in VECE is reflected in Figure 1(c). The model’s predicted error now varies with age to match the true error rate. In this case, simple variablebased recalibration improves the agewise systematic miscalibration of the model, without detriment to the overall calibration error.
7.2 Yelp Reviews: Predicting Sentiment
To explore variablebased calibration in an NLP context, we use a finetuned large language model, BERT (devlin2018), on the Yelp review dataset^{4}^{4}4https://www.yelp.com/dataset. The model predicts whether a review has a positive or negative rating based on its text. In this case there are no easilyinterpretable features directly input to the model. Instead, to better diagnose model behavior, we can analyze realvalued characteristics of the text, such as the length of each review or partofspeech statistics. Here we focus on review length in characters.
Figure 2(a) shows the model’s error and predicted error with respect to review length. The error rate is lowest for reviews around 300700 characters, around the median review length. Very short and very long reviews are associated with a higher error rate. This model is consistently overconfident, with an uncalibrated ECE and VECE of 1.93% (see Table 3).
ECE  VECE  

Uncalibrated  1.93%  1.93% 
Scalingbinning  4.23%  4.23% 
Platt scaling  3.04%  0.64% 
Beta calibration  1.73%  0.37% 
Variablebased calibration  1.70%  0.23% 
The ECE and VECE diverge after beta calibration, which obtains the lowest ECE of 1.73% and a substantially reduced VECE of 0.37%. Figure 2(b) reflects this: the model’s predicted error aligns more closely with its actual error rate, although it is still notably overconfident for very short reviews.
Variablebased recalibration reduces the VECE slightly further, while yielding a small improvement to the overall ECE. After variablebased calibration, the predicted error curve matches the true relationship between review length and true error rate more faithfully, reducing overconfidence for short reviews (Figure 2(c)).
7.3 Bank Marketing: Predicting Subscriptions
We also investigate miscalibration on a simple neural network modeling the UCI Bank Marketing dataset^{5}^{5}5https://archive.ics.uci.edu/ml/datasets/bank+marketing. The model predicts whether a bank customer will subscribe to a bank term deposit as a result of direct marketing. Uncalibrated, the model is overconfident, with ECE and VECE over 4.5% (see Table 4). Consider the calibration error with respect to customer age, both before (Figure 3(a)) and after (Figure 3(b)) recalibration. The bestperforming recalibration technique, Platt scaling, uniformly increases the predicted error across age, reducing both ECE and VECE, but resulting in underconfidence for most ages and overconfidence at the edges of the distribution.
ECE  VECE  

Uncalibrated  4.69%  4.69% 
Scalingbinning  4.37%  3.39% 
Platt scaling  2.38%  2.83% 
Beta calibration  2.48%  2.77% 
Variablebased calibration  2.10%  0.52% 
Variablebased recalibration achieves competitive ECE, while reducing VECE to about half of one percent. Figure 3(c) reflects this improvement. The predicted error after variablebased recalibration matches the true error rate more closely, reducing the miscalibration with respect to customer age.
Variablebased calibration plots for the CIFAR10H model for
Median Reaction Time7.4 CIFAR10H: Image Classification
As a multiclass example, we investigate variablebased miscalibration on CIFAR10H, a 10class image dataset including labels and reaction times from human annotators (peterson2019human)
. We use a standard deep learning image classification architecture (a DenseNet model) to predict the image category, and investigate median annotator reaction time, which is metadata that is not provided to the model. Instead of Platt scaling and beta calibration, here we use Dirichlet calibration (to accomodate the multiple classes).
Here, Dirichlet calibration achieves the lowest overall ECE and variablebased calibration obtains the lowest VECE (see Table 5). The variablebased calibration plots are shown in Figure 5. We see that variablebased recalibration reduces underconfidence for examples with low median reaction times (where the majority of data points lie).
ECE  VECE  

Uncalibrated  1.90%  1.92% 
Scalingbinning  3.83%  3.60% 
Dirichlet calibration  0.80%  1.12% 
Variablebased calibration  1.31%  0.60% 
Summary of Experimental Results
First, our results demonstrate the potential of variablebased recalibration. While scorebased recalibration techniques generally improved the ECE, variablebased recalibration performed better across datasets in terms of simultaneously reducing both the ECE and VECE, without any significant increase in model error rate or the VECE for other variables (details in Appendix B). The results also illustrate that variablebased calibration plots enable meaningful characterization of the relationships between variables of interest and predicted/true error, providing more detailed insight into a model’s performance than a single number (i.e., ECE or VECE).
8 Discussion and Conclusions
Discussion of Limitations
There are several potential limitations of this work. First, we focused on the mitigation of miscalibration for one variable at a time. Although we did not observe recalibration with respect to one variable worsening VECE for another variable, this behavior has not been analyzed theoretically. Further, a more thorough investigation on miscalibration and recalibration across intersections of variables is still warranted. We also emphasize that the variablebased calibration technique used in the paper is primarily for illustration; the development of new methods for simultaneously reducing scorebased and variablebased miscalibration is a useful direction for future work.
Conclusions
To summarize, in this paper, we demonstrated theoretically and empirically that ECE can obscure significant miscalibration with respect to variables of potential importance to a developer or user of a classification model. To better detect and characterize this type of miscalibration, we introduced the VECE measure and corresponding variablebased calibration plots, and characterized the theoretical relationship between VECE and ECE. In a case study across several datasets and models, we showed that VECE, variablebased calibration plots, and variablebased recalibration are all useful tools for understanding and mitigating miscalibration on a pervariable level. Looking forward, to mitigate biases in calibration error, we recommend moving beyond purely scorebased calibration analysis. In addition to promoting fairness, these techniques offer new insight into model behavior and provide actionable avenues for improvement.
References
Appendix A Proofs for Section 5
[VECE bound] There exist ary classifiers and variables such that the classifier has both ECE = 0 and variablebased .
Let be a continuous variable with density . Recall that where is the accuracy of model as a function of , and the score is the probability that the model assigns to its label prediction .
The reliability diagram for a ary classifier has scores where the leftmost value for this interval is a result of the fact that the score is defined as the maximum of class probabilities. Let be the midpoint of this interval.
Assume that the scores
have a uniform distribution of the form
, where is some constant and , and that the scores and the variable are independent.Further assume that the accuracy of the model depends on and in the following manner
where is defined such that .
The marginal accuracy as a function of the score (marginalizing over ) can be written as
The marginal accuracy as a function of (marginalizing over ) is
This setup is designed so that the score is close to the accuracy as a function of (to minimize ECE), but the variablebased expected scores are relatively far away from accuracy as a function of .
Under these assumptions we can write the ECE as
(8) 
We can write the VECE as
(9) 
Thus, as , and .
[ECE bound] There exist Kary classifiers and variables such that the classifier has and .
Let be a continuous variable with density . Recall that a Kary classifier has scores , where we let be the midpoint of this interval. Assume that produces scores from two uniform distributions, with equal probability: and , where is some constant , and that the scores and the variable are independent. Finally, suppose the accuracy of the model is independent of and .
Under these assumptions we can write the VECE as
(10) 
We can write the ECE as
(11) 
Thus, as , and .
[Consistent overconfidence] Let be a classifier with scores . For a variable taking values , is consistently overconfident if , i.e., the expected value of the model’s scores as a function of is always greater than the true accuracy as a function of . Consistent underconfidence is defined analogously with . In the special case where the variable is defined as the score itself, we have , etc.
[Equality conditions for ECE and VECE] Let be a classifier that is consistently under or overconfident with respect both to and to a variable . Then the ECE and VECE of are equal.
Without loss of generality, suppose is consistently underconfident with respect to its scores and .
Then we have, by consistent underconfidence:
(12) 
Appendix B Calibration, Model, and Dataset Details
Here, we include additional information and plots for each dataset and model discussed in Section 7. Code for reproducing all tables and plots is available online.^{6}^{6}6All of our code is available online at https://github.com/markellekelly/variablewisecalibration.
On each dataset, we test several existing recalibration techniques: Platt scaling, scalingbinning, beta calibration, and (for the multiclass case) Dirichlet calibration. For scalingbinning, we calibrate over 10 bins, and for Dirichlet calibration, we use a lambda value of 1e3, values chosen based on the respective authors’ provided examples. Here and in Section 7, we present the uncalibrated and variablebased calibrated output, along with the bestperforming scorebased calibration method (for the Adult and Yelp datasets, beta calibration; for Bank Marketing, Platt scaling; for CIFAR, Dirichlet calibration).
Our variablebased recalibration is performed as follows. Given the calibration set, a decision tree classifier is trained to predict the outcome with input (the single variable of interest). We use a maximum depth of two and a minimum leaf size of the size of the calibration set. The calibration set is then split according to the leaf nodes of the trained decision tree, and separately the rest of the dataset is split according to the same rules. Standard beta calibration is then performed separately for each split, using the subset of the original calibration set as the new calibration set, and computing recalibrated probabilities for the subset of the original dataset.
Variablebased calibration plots are created with LOESS, with quadratic local fit and an assumed symmetric distribution of the errors, with empiricallychosen smoothing factors between 0.8 and 0.9.
We note the VECE for each numeric variable in each dataset before and after the recalibration described. We find in general empirically that variablebased calibration with respect to one variable is not detrimental to the VECE of other variables.
Finally, we observe that variablebased recalibration does not tend to significantly degrade accuracy. Accuracies for each dataset before and after recalibration are shown in Table 6.
Adult Income  Yelp  Bank Marketing  CIFAR  

Uncalibrated  79.1%  98.0%  88.9%  97.2% 
Scorebased calibrated  79.1%  98.0%  88.7%  96.9% 
Variablebased calibrated  79.1%  98.0%  88.7%  96.0% 
b.1 Adult Income
The Adult Income dataset was modeled with a multilayer perceptron, with two hidden layers of sizes 100 and 75. Of the 48,842 observations, 32,561 were used for training, 2,500 were used for calibration, and 13,781 were used for testing. The dataset includes six continuous variables: age, fnlwgt (the estimated number of people an individual represents), educationnum (a number representing the individual’s years of education), capitalgain, capitalloss, and hoursperweek (the number of hours per week that an individual works).
Based on the betacalibrated model, educationnum and age rank the highest in VECE, as shown in Section 6. For all six variables, VECE is reduced by performing recalibration with respect to age:
Uncalibrated  Betacalibrated  Variablebased calibrated  

educationnum  20.67%  9.95%  8.53% 
age  20.67%  9.59%  2.11% 
hoursperweek  20.67%  7.94%  6.02% 
fnlwgt  20.67%  5.06%  4.10% 
capitalgain  20.67%  1.50%  1.39% 
capitalloss  20.67%  1.50%  1.39% 
Uncalibrated, the model’s ECE and VECE are 20.67%. Of the scorebased calibration methods tested, beta calibration achieves the lowest ECE of 1.65%. Relevant reliability diagrams and variablebased calibration plots for the uncalibrated, betacalibrated, and variablebasedcalibrated models are shown in Figure 6.
b.2 Yelp
The Yelp dataset was modeled with a finetuned BERT model. 100,000 observations were randomly sampled from the full Yelp dataset. Of these, 70,500 were used for training, 10,000 were used for calibration, and 19,500 were used for testing. Several continuous features were generated from the raw text reviews, including length in characters, number of special characters, and proportions of each part of speech. Based on the betacalibrated model, review length ranked highest in VECE, followed by proportion of stop words, as shown in Table 8.
Uncalibrated  Betacalibrated  Variablebased calibrated  

Length (characters)  1.93%  0.37%  0.23% 
Stopword Proportion  1.93%  0.29%  0.28% 
Named Entity Count  1.93%  0.21%  0.22% 
Uncalibrated, the model’s ECE and VECE are 1.93%. Of the scorebased calibration methods tested, beta calibration achieves the lowest ECE of 1.73%. Relevant reliability diagrams and variablebased calibration plots for the uncalibrated, betacalibrated, and variablebasedcalibrated models are shown in Figure 7.
b.3 Bank Marketing
The Bank Marketing dataset was modeled with a multilayer perceptron, with two hidden layers of sizes 100 and 75. Of the 45,211 total observations, 31,647 were used for training, 1,000 were used for calibration, and 12,564 were used for testing. Based on the model calibrated with Platt scaling, account balance ranked highest in VECE, followed by age, as shown in Table 9.
Uncalibrated  Calibrated with Platt scaling  Variablebased calibrated  

Account balance  5.35%  4.17%  3.22% 
Age  4.69%  2.83%  0.52% 
Uncalibrated, the model’s ECE is 4.69%. Of the scorebased calibration methods tested, Platt scaling achieves the lowest ECE of 2.38%. Relevant reliability diagrams and variablebased calibration plots for the uncalibrated, calibrated with Platt scaling, and variablebasedcalibrated models are shown in Figure 8.
b.4 Cifar10h
The CIFAR10H dataset was modeled with a DenseNet model. Of the 10,000 total observations, 4,057 were used for training, 2,000 were used for calibration, and 3,943 were used for testing.
Uncalibrated, the model’s ECE and VECE are 1.90% and 1.92%, respectively. Of the scorebased calibration methods tested, Dirichlet calibration achieves the lowest ECE of 0.80%. Relevant reliability diagrams and variablebased calibration plots for the uncalibrated, Dirichletcalibrated, and variablebasedcalibrated models are shown in Figure 9.
Appendix C Alternate Recalibration Methods
As an alternative variablebased calibration method, we extend logistic and beta calibration, which operate continously over score, to incorporate information regarding . In particular, logistic calibration learns a mapping of scores , with parameters and
learned via logistic regression:
This can be augmented to include by simply training the logistic regression on both and , learning the following mapping:
where is the logistic regression coefficient corresponding to .
Similarly, beta calibration learns the following mapping, where the parameters , , and are learned by training a logistic regression on and (see kull2017 for more details):
This can also be augmented with , including it as a third input to the regression:
In contrast to the treebased method detailed in the main paper, which splits the data along and then separately calibrates each set, these methods learn one calibration mapping for the entire dataset. Empirically, we find that augmented beta calibration is a promising approach, simultaneously reducing ECE and VECE, although some attention must be paid to the fit of the logistic regression (e.g., by including a quadratic term). However, in our experiments, this technique ultimately was not as reliable as treebased calibration (perhaps because the functional form of beta calibration is not flexible enough to always be able to correct systematic miscalibration as a function of ).
Here, we include the results of augmentedbeta variablebased (VB) recalibration on the Adult Income, Yelp, and Bank Marketing datasets. The models for the Adult and Bank Marketing datasets include a quadratic term for , which obtained a better fit. (Note that this formulation only applies to binary classification, so we do not include results for the CIFAR dataset here).
ECE  VECE  

Uncalibrated  20.67%  20.67% 
Beta calibration  1.65%  9.59% 
Treebased VB calibration  1.64%  2.11% 
Augmentedbeta VB calibration  1.49%  1.87% 
ECE  VECE  

Uncalibrated  1.93%  1.93% 
Beta calibration  1.73%  0.37% 
Treebased VB calibration  1.70%  0.23% 
Augmentedbeta VB calibration  1.73%  0.37% 
ECE  VECE  

Uncalibrated  4.69%  4.69% 
Platt scaling  2.38%  2.83% 
Treebased VB calibration  2.10%  0.52% 
Augmentedbeta VB calibration  2.09%  1.13% 