Honest calibration assessment for binary outcome predictions

by   Timo Dimitriadis, et al.

Probability predictions from binary regressions or machine learning methods ought to be calibrated: If an event is predicted to occur with probability x, it should materialize with approximately that frequency, which means that the so-called calibration curve p(x) should equal the bisector for all x in the unit interval. We propose honest calibration assessment based on novel confidence bands for the calibration curve, which are valid only subject to the natural assumption of isotonicity. Besides testing the classical goodness-of-fit null hypothesis of perfect calibration, our bands facilitate inverted goodness-of-fit tests whose rejection allows for the sought-after conclusion of a sufficiently well specified model. We show that our bands have a finite sample coverage guarantee, are narrower than existing approaches, and adapt to the local smoothness and variance of the calibration curve p. In an application to model predictions of an infant having a low birth weight, the bounds give informative insights on model calibration.



page 8


Binary Classifier Calibration: Bayesian Non-Parametric Approach

A set of probabilistic predictions is well calibrated if the events that...

Calibration tests in multi-class classification: A unifying framework

In safety-critical applications a probabilistic model is usually require...

Model-based ROC (mROC) curve: examining the effect of case-mix and model calibration on the ROC plot

The performance of a risk prediction model is often characterized in ter...

Making Heads and Tails of Models with Marginal Calibration for Sparse Tagsets

For interpreting the behavior of a probabilistic model, it is useful to ...

Individual Calibration with Randomized Forecasting

Machine learning applications often require calibrated predictions, e.g....

Application of Equal Local Levels to Improve Q-Q Plot Testing Bands with R Package qqconf

Quantile-Quantile (Q-Q) plots are often difficult to interpret because i...

Temporal Probability Calibration

In many applications, accurate class probability estimates are required,...

Code Repositories


Replication code (simulation+application) for the paper 'Honest calibration assessment for binary outcome predictions'.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be given covariates in and independent binary observations such that for some unknown function . In practice, the covariates can be probability predictions for the components of , e.g., stemming from a test sample of binary regressions, machine learning methods, or any other statistical model for binary data. A reliable interpretation of these predictions relies on the property of calibration, meaning that the so-called calibration curve is sufficiently close to the identity. For instance, if a fetus is predicted to have a low birth weight with probability , decisions on a potential medical treatment rely on this probability prediction being accurate enough, for some small .

Testing calibration, closely related to goodness-of-fit testing, is crucial in applications (Tutz2011; Hosmer2013Book) and is still regularly carried out by the classical test of HosmerLemeshow1980, which groups the predictions into bins and applies a -test. It is however subject to multiple criticisms: First, its ad hoc choice of bins can result in untenable instabilities (Bertolini2000; Allison2014). Second, placing the hypothesis of calibration in the null only allows for rejecting calibration rather than showing that a model is sufficiently well calibrated, where the latter would be highly desirable for applied researchers. Third, the test rejects essentially all, even acceptably well-specified models in large samples (Nattino2020; Paul2013), resulting in calls for a goodness-of-fit tests with inverted hypotheses (Nattino2020Rejoinder).

We propose a statistically sound solution to these criticisms by constructing honest, simultaneous confidence bands for the function . That is, for a given small number , we compute data-dependent functions and on such that


which we call calibration band. It allows for the desirable conclusion that with confidence , the true calibration curve lies inside the band, simultaneously for all values of the predicted probabilities.

Hence, it resolves the above mentioned criticisms of classical goodness-of-fit tests. Figure 1 shows the bands in a large data example for probit model predictions for the binary outcome of a fetus having a low birth weight. See Section 6 for additional details. The test of Hosmer and Lemeshow clearly rejects calibration even though our bands indicate a well-calibrated model, especially for the, in this application, most important region of small probability predictions shown in the magnified right panel of the figure. Our bands also nest a goodness-of-fit test with classical hypotheses by checking whether the band contains the bisector for all relevant values

. It is important to notice that even though we build our bands on the model predictions, the methodology applies equally to both, causal and predictive regressions. An open-source implementation in the statistical software

R is available under https://github.com/marius-cp/calibrationband.

Figure 1: Left: Calibration bands for the first model specification of the low birth weight application in Section 6

. The blue band denotes the calibration band and the grey step function the isotonic regression estimate. Right: Magnified version focusing on predicted probabilities below


Our confidence bands are valid in finite samples subject only to the mild assumption that the function is increasing,


which is natural in the context of assessing calibration as decreasing parts of the calibration curve can be dismissed as nonsensical predictions resulting from severely misspecified models (DGJ_2021; Roelofs2020). As can be expected for a non-parametric, pathwise and almost universally valid confidence band, we require large data sets of at least above

observations to obtain sensibly narrow bands. These are exactly the sample sizes where the classical goodness-of-fit tests become uninformative by rejecting all models in applications. For example, in a simulation study on assessing a logistic regression model with minor misspecification,

Kramer2007 find that the Hosmer-Lemeshow test at level achieves a power of for and essentially for .

A theoretical analysis shows that the proposed confidence band adapts locally to the smoothness of the function and to the variance of the observations. Adaptivity to the smoothness means that the width of the bands decreases faster with the sample size in regions where is constant, and at a slower rate where is steeper. This property is known for more general confidence bands for a monotone mean function developed by Yang2019, which are proved to be more conservative than our bands in the case of binary outcomes. Adaptivity to the variance means that the band is substantially narrower at if is close to zero or one, compared to near . In many practical applications, including the low birth weight predictions analyzed in this article, predicted probabilities close to zero or one are of most relevance and a sharp assessment of calibration is particularly important.

Existing methods for the construction of confidence bands in this setting are rare with the following two exceptions: First, Nattino2014 propose the use of confidence bands based on a parametric assumption on the function , which we show to have incorrect coverage in almost all of our simulation settings. Second, the nonparametric bands of Yang2019 are valid but shown to be wider than our bands as we show in theory and simulations.

We explain the absence of competing methods by their theoretical difficulties. Using asymptotic theory of the isotonic regression estimator is complicated as it requires the estimation of nuisance quantities such as the derivative of the unknown function , the convergence rate depends on the functional form of , it is subject to more restrictive assumptions and only results in bands with a pointwise interpretation (Wright1981). Resampling schemes are theoretically found to be inconsistent for the isotonic regression (Sen2010; GuntuboyinaSen2018). Other non-parametric approaches in the literature for constructing confidence bands for functions, many of them presented in the review by Hall2013, are often pointwise, not simultaneous, and require the selection of tuning parameters that may lead to instabilities, similar to the choice of the bins in the Hosmer-Lemeshow test. In contrast, the confidence bands proposed here are simple to compute and do not involve any implementation decisions resulting in a stable and reproducible method as recently called for by Stodden2016; Yu2020.

2 Construction of the confidence bands

In what follows we focus on confidence bounds and for , where . Indeed, if


defines a confidence band satisfying (1) with the auxiliary values , and , .

Our confidence bands are based on the classical confidence bounds of Clopper1934 for a binomial parameter. Suppose that

is a binomial random variable with parameters

and . For let


denotes the distribution function of the binomial distribution with parameters

and , while

stands for the quantile function of the beta distribution with parameters

. Then

For the representation of and in terms of beta quantiles we refer to Johnson2005.

Assumption (2) allows to construct confidence bands for as follows. For arbitrary indices , the random sum

is stochastically larger than a binomial random variable with parameters and , and it is stochastically smaller than a binomial variable with parameters and . Thus, as explained in Lemma B.1,


If we combine these bounds for all pairs in a given set , then we may claim with confidence that simultaneously for all ,

Specifically, let be the set of all index pairs such that and and . If there are tied covariate values, selects the outermost indices of the tied values. Hence, if contains different points, then . Consequently, for a given confidence level , we may combine the bounds and with to obtain a first confidence band.

Theorem 2.1.

For let


If satisfies the isotonicity assumption (2), then the resulting confidence band satisfies requirement (1).

In the definition (4), taking the minimum over index pairs with is equivalent to the minimum over if . When there are ties in the covariate, it is possible that the minimum is attained with an index but . Analogously, it is possible that the maximum in (5) is attained with an index but .

The confidence band proposed in Theorem 2.1 has two potential drawbacks. First, a natural nonparametric estimator for the function under the assmption (2) is given by a minimizer of over all isotonic functions (DGJ_2021). This minimizer is unique on the set . But there is no guarantee that . Second, the upper and lower bounds in (4) and (5) may even cross, resulting in an empty, and hence, nonsensical confidence band. These problems can be dealt with by using the non-crossing confidence band with


This band satisfies on , no matter how is defined for . Our simulations experiments indicate that with probability ; see Section 5 for details.

A potential obstacle in the practical application of the confidence bands proposed in this section is that their computation requires steps. This can be relieved by reducing the number of distinct values in the covariate by rounding, which often has almost no visible effect on the appearance of the bands. If differences in the covariate smaller than for some are regarded as negligible, one can round up to the next multiple of for the computation of the upper bound, and round off to the next lower multiple of to compute the lower bound. This still yields a valid confidence band for the function but guarantees that , which also implies a less conservative correction of the confidence level. The number should not be too small since the resulting confidence bands are constant on intervals of length thereby limiting their adaptivity.

3 Relation to Yang2019

The methods of Yang2019 may be adapted to the present regression setting with covariates as follows: With the isotonic estimator introduced before, let



This defines a confidence band with the following property:


where is any fixed isotonic function minimizing . Thus one obtains a confidence band with guaranteed covergage probability for an isotonic approximation of , even if (2) is violated. The proof of (9) follows from the arguments of Yang2019, noting that the random variables are subgaussian with scale parameter . That means, for all , which implies that for arbitrary ,

see Hoeffding1963. The following result shows that the confidence bands and are always contained in the band .

Theorem 3.1.

For all

and any data vector


For the applications considered in the present paper, the validity of a confidence band in case of violating (2) is not essential. It should be mentioned, however, that the band has a computational advantage. For the computation of in (7), it suffices to take the minimum over endpoints of constancy regions of , that is, all such that and or , see Proposition B.3 in the appendix. Likewise, for the computation of in (8), it suffices to take the maximum over all such that or and . While the computation of or requires steps, the following lemma implies that the computation of requires only steps.

Lemma 3.2.

The cardinality of is smaller than .

4 Theoretical properties of the confidence bands

This section illustrates consistency and adaptivity properties of the confidence band , where the subscript indicates the sample size, and we consider a triangular scheme of observations , . We are interested in situations in which the observed covariates could be the realizations of the order statistics of a random sample. Thus we have to extend the framework of Yang2019 and consider the following assumption.

(A)  Let denote Lebesgue measure, and let for . There exist constants such that for sufficiently large ,

for arbitrary intervals such that .

This assumption comprises the setting of Yang2019. Let be a differentiable distribution function on such that is bounded away from . If for , then (A) is satisfied for any and arbitrary . The arguments in Moesching2020 can be modified to show that if are the order statistics of independent random variables with distribution function , then Condition (A) is satisfied almost surely, provided that are chosen appropriately.

Theorem 4.1.

Suppose that condition (A) is satisfied. Let .

(i) Suppose that is constant on some non-degenerate interval . Then

for any fixed interval .

(ii) Suppose that is Lipschitz-continuous on a non-degenerate interval . Then

(iii) Suppose that for some constant , as . Then,

An analogous statement holds for the lower bound .

(iv) Suppose that is discontinuous at some point . Then for any number strictly between and and a suitable constant ,

with asymptotic probability one as .

Parts (i-ii) of this theorem are analogous to the results of Yang2019. Part (iii) demonstrates that our bounds are particularly accurate in regions where is close to or . Presumably, the conclusions in parts (iii-iv) are not satisfied for the confidence band .

5 Simulations

Here, we illustrate that our calibration bands have correct coverage in the sense of (1) and are narrower than existing techniques. We consider both, the raw method in (4) and (5) and the non-crossing variant in (6). Both methods are combined with the rounding technique to three digits after the comma as described in the end of Section 2 in order to facilitate faster computation at a minimal cost in accuracy. For comparison, we use the isotonic bands of Yang2019 given in (7) and (8) with a minimal variance factor of and the parametric bands of Nattino2014, implemented in the GivitiR package in the statistical software R (R2022). Replication material for the simulations and applications is available under https://github.com/marius-cp/replication_DDHPZ22.

Figure 2: Illustration of the five simulated calibration curves , where the solid red line corresponds to the shape parameter value and the dashed blue line to .

We use 1000 replications, a significance level of and simulate the predictions . The binary outcomes are generated by based on five distinct functional forms of the calibration curve for depending on a shape parameter . All specifications of satisfy the isotonicity assumption in (2) and they cover smooth, non-smooth as well as discontinuous setups. The choice results in perfectly calibrated forecasts with whereas the misscalibration increases with . In particular, we consider the following specifications, which are illustrated in Figure 2 for two exemplary shape values .

  1. Monomial:  First, we use a calibration curve defined by , where . This is already used in the simulations assessing the CORP reliability diagram in DGJ_2021.

  2. S-shaped:  Second, the calibration curve follows an S-shaped form , where pronounces the curves for larger values of .

  3. Kink:  Third,

    linearly interpolates the points

    and for , resulting in a kink at the point for all .

  4. Disc:  Fourth, we have a perfect calibration close to the borders , and a rotating, miscalibrated disc, within , where the rotation increases with .

  5. Step:  Fifth, we use a step function with equidistant steps in the unit interval. Formally, it is given by , where and . Notice that this choice does not nest a correctly specified model, but its misspecification still increases with .

Figure 3 presents the average coverage rates for a range of sample sizes between 512 and 32 768. We find that, as predicted by our theory, our calibration bands have conservative coverage throughout all simulation setups and sample sizes. We observe coverage rates between 0.998 and 1, with the majority of 179 out of the 212 displayed coverage values being exactly one. We dispense with a presentation of the coverage rates of the non-crossing and Yang2019 bands, as both are guaranteed to be larger by Theorem 3.1. The non-crossing bands differ from the raw ones in less than one out of a hundred thousand simulated forecast values. These deviations occur exclusively for large values of in the Step and Disc specifications within constancy regions of the calibration curve .

Figure 3: Empirical coverage rates of our calibration bands and the GiVitI bands for averaged over all forecast values , for the five specifications of the calibration curve , different shape values and a range of sample sizes . Notice that the choices in the monomial, and in step specification are not defined.

The parametric bands of Nattino2014 rarely achieve correct coverage rates unless in the cases and for the S-shaped calibration curve. This can be explained as these bands are based on the assumption of a certain parametric form of , which is rarely satisfied. The results get worse for the non-smooth and the two discontinuous specifications.

Figure 4: Top: Average empirical widths of the confidence bands by sample size for the non-crossing and Yang2019 bands for each of the five specifications of given in the main text for a fixed degree of misspecification . Bottom: Average empirical widths by forecast value for two sample sizes. In both panels, the solid red line corresponds to the non-crossing bands and the dashed blue line to the Yang2019 bands.

Figure 4 displays the average widths of the non-crossing and Yang2019 bands. We present the theoretically wider non-crossing bands instead of the raw versions thereof. Their average widths is however non-distinguishable in these displays. We fix a medium degree of miscalibration . The upper plot panel displays the widths averaged over all simulation runs and forecast values depending on the sample size . We find that the size of both bands shrinks with and that we can reconfirm the ordering established in Theorem 3.1. We further see that our bands are only narrow enough for practical use in large samples. The relative gain in width of our bands is the highest for large sample sizes, exactly for which we propose the application of our method for calibration validation. It is worth noting that the bands of Yang2019 are more generally valid than for the special case of binary observations.

The lower plot panel shows the widths averaged over the simulation replications, but depending on the forecast value for two selected sample sizes. It shows that the relative gains in width upon the bands of Yang2019 are particularly pronounced close to the edges of the unit interval. These regions of predicted probabilities close to zero or one are often of the highest interest in assessing calibration as for example in the subsequent application to low birth weight probability predictions.

6 Application: Predicting low birth weight probabilities

We apply our calibration bands to binary regressions predicting the probability of a fetus having a low birth weight, defined as weighting less than 2500 grams at birth (def_who). We use U.S. Natality Data from the Natality2017, which provides demographic and health data for 3 864 754 births in the year 2017. For the data set at hand, a low birth weight is observed in 8.1% of the cases.

Figure 5: Calibration bands for the second model specification on the left and for the third specification on the right for the low birth weight application. The blue band denotes the calibration band and the grey step function the isotonic regression estimate. The bisector is given in red color whenever it is not contained in the calibration band.

We estimate three binary regression models by maximum likelihood on the same randomly drawn subset that contains all but 1 000 000 observations that we leave for external model validation. All three models contain standard risk factors such as the mother’s age, body mass index and smoking behavior but they differ as follows. The first model uses a probit link function, and the explanatory variable week of gestation is categorized into four left-closed and right-open intervals with lower interval limits of 0, 28, 32 and 37 weeks, pertaining to the standard definitions of the World Health Organization of extremely, very, moderate and non preterm (Quinn2016PretermDef). Through this categorization, the model specification can capture the week of gestation in a non-linear fashion. In contrast, the second model uses the week of gestation as a continuous explanatory variable and the third specification employs the cauchit instead of the probit link function, which is known to produce less confident predictions close to zero and one (Koenker2009). Additional details of the model specifications are given in Appendix 1.

The classical Hosmer-Lemeshow test rejects perfect calibration of all three models with p-values of essentially zero for both, internal and external model validation, which leaves an applied researcher without any useful conclusions on model calibration. We show our calibration bands for the first model in Figure 1 and for the other two model specifications in Figure 5. We use the non-crossing method with rounding to three digits with a confidence level of .

For the first model, the calibration bands encompass the bisector for all forecast values, meaning that we cannot reject the null hypothesis of perfect calibration at the level. More importantly, we are certain that the true calibration curve lies within the the band at any point , implying that we are confident that the model is at least as well calibrated as specified by the band. This is especially notable in the important region of predictions below in the magnified right panel of Figure 1, where the confidence bands are remarkably close to the bisector implying a particularly well calibrated model. E.g., we are confident to conclude that a prediction of occurs on average with a probability between and .

In contrast, we reject calibration for both, the second and third model specifications as shown in Figure 5. However, these bands are much more informative than a simple test rejection as they directly show the exact form of model miscalibration. E.g., for the second model specification, we can conclude that the predicted probabilities are particularly miscalibrated for values larger than whereas the third specification entails miscalibrated probabilities for predictions below that are presumably of the highest importance for medical decision making. While small predicted values from the second specification might still be treated as relatively reliable, they should be interpreted with great caution when stemming from the third model. The wider bands for the third model specification between predicted probabilities of and are caused by relatively little predictions in this interval.


A. Henzi and J. Ziegel gratefully acknowledge financial support from the Swiss National Science Foundation.

Appendix A Model specifications in the low birth weight application

We give some additional details on the model specifications of the application here. The first two models are based on the probit link function whereas the third one uses the cauchit link function (Koenker2009)

. The second model uses the week of gestation as a continuous variable whereas the first and third models use the week of gestation as a categorical variable with left-closed and right-open intervals with lower interval limits of 0, 28, 32 and 37 weeks, which corresponds to the standard categorization of the World Health Organization


Additionally, all three models contain the following common explanatory variables: the mother’s age and its squared term, her body mass index prior to pregnancy, her smoking behavior as a categorical variable with left-closed and right-open intervals with lower limits of 0, 1, 9, and 20 cigarettes per day averaged over all three trimesters, individual binary variables for mother’s diabetes, any form of hypertension, mother’s education below or equal to eight years, employed infertility treatments, a cesarean in a previous pregnancy, a preterm birth in a previous pregnancy, current multiple pregnancy, the sex of the unborn child, and an infection of one of the following: gonorrhea, syphilis, chlamydia, hepatitis b, hepatitis c. Additional details on the data are given in the user guide under


Appendix B Proofs and Technical Lemmas

Lemma B.1.

Let be independent Bernoulli variables with expectations , and let . Then for any ,

Proof of Lemma b.1.

For the upper bound, note that is increasing in . If , then . By Shaked2007, is stochastically larger than with binomial distribution with parameters and , so , where the last inequality follows from the validity of the Clopper-Pearson confidence bounds. The proof for the lower bound is similar. ∎

The proof of Theorem 3.1 uses standard results for isotonic least squares regression and the following inequalities of Hoeffding1963.

Lemma B.2.

Let be independent random variables with values in and expectations . Suppose that , and set . Then for arbitrary ,

where .

Corollary 1.

For integers , and any number ,

where .

In addition, the proof of Theorem 3.1 makes use of the following proposition which is of independent interest, since it implies a more efficient method for computing the bounds of Yang2019.

Proposition B.3.

For an arbitrary observation vector , let be an increasing function minimizing . For some and any index , let

Then, the minimum for is attained at some such that and or . The maximum for is attained at some such that or and .

Proof of Proposition b.3.

Consider the statement about . The claim about follows from the fact that for fixed , is increasing and is decreasing in . As to the upper index , note that is the minimum of over all such that . Let be indices such that for . Then, for ,


Consequently, for ,

is a concave function of , and it is increasing in if . This implies that

Consequently, the minimum of over all is attained at some such that or , and this entails that . The statement about follows from the one about when are replaced by and by . ∎

Proof of Theorem 3.1.

The inequalities and , as well as hold by construction. It is therefore sufficient to show that and . As to the inequality , we know that equals

for some with and or , where . As explained later, this implies that


But then it follows from Corollary 1 that is greater than or equal to

Inequality (10) follows from a standard result about isotonic regression (see for example Henzi2020, Characterization II). The index interval may be partitioned into index intervals , where is any value in . For such an index interval, , with equality if .

The inequality for the lower bound follows from the one for the upper bound when are replaced by and by . ∎

Proof of Lemma 3.2.

Let be the different elements of , where we assume that . There exists a partition of into index intervals such that . For any integer , let be the number of indices such that . Since , the numbers satisfy the following constraints: , and . The question is, how large the number can be under these constraints, where we drop the restriction that the are integers. Suppose that and for integers . Then we may replace with , where is the minimum of and . This does not affect the constraints, but the sum increases strictly, while or . Eventually, we obtain an integer such that if and for . In particular,

whence , while

where . ∎

For the proof of Theorem 4.1, we need an inequality for the auxiliary function in Lemma B.2 which follows from Duembgen1998.

Lemma B.4.

For arbitrary , and , the inequality implies that

Proof of Theorem 4.1.

For notational convenience, we often drop the additional subscript , e.g. we write instead of