# Distribution-free binary classification: prediction sets, confidence intervals and calibration

We study three notions of uncertainty quantification—calibration, confidence intervals and prediction sets—for binary classification in the distribution-free setting, that is without making any distributional assumptions on the data. With a focus towards calibration, we establish a 'tripod' of theorems that connect these three notions for score-based classifiers. A direct implication is that distribution-free calibration is only possible, even asymptotically, using a scoring function whose level sets partition the feature space into at most countably many sets. Parametric calibration schemes such as variants of Platt scaling do not satisfy this requirement, while nonparametric schemes based on binning do. To close the loop, we derive distribution-free confidence intervals for binned probabilities for both fixed-width and uniform-mass binning. As a consequence of our 'tripod' theorems, these confidence intervals for binned probabilities lead to distribution-free calibration. We also derive extensions to settings with streaming data and covariate shift.

## Authors

• 7 publications
• 3 publications
• 68 publications
07/15/2021

### A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Black-box machine learning learning methods are now routinely used in hi...
04/20/2020

### Is distribution-free inference possible for binary regression?

For a regression problem with a binary label response, we examine the pr...
03/15/2017

### Online Learning for Distribution-Free Prediction

We develop an online learning method for prediction, which is important ...
07/09/2020

### Predictive Value Generalization Bounds

In this paper, we study a bi-criterion framework for assessing scoring f...
03/04/2021

### Distribution-free uncertainty quantification for classification under label shift

Trustworthy deployment of ML models requires a proper measure of uncerta...
12/12/2019

### Calibrated model-based evidential clustering using bootstrapping

Evidential clustering is an approach to clustering in which cluster-memb...
05/28/2021

### Distribution-free inference for regression: discrete, continuous, and in between

In data analysis problems where we are not able to rely on distributiona...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

Let and denote the feature and label spaces for binary classification. Consider a predictor that produces a prediction in some space . If , corresponds to a point prediction for the class label, but often class predictions are based on a ‘scoring function’. Examples are, for SVMs, and

for logistic regression, random forests with class probabilities, or deep models with a softmax top layer. In such cases, a higher value of

is often interpreted as higher belief that . In particular, if , it is tempting to interpret as a probability, and hope that

 f(X)≈P(Y=1∣X). (1)

However, such hope is unfounded, and in general (1

) will be far from true without strong distributional assumptions, which may not hold in practice. Valid uncertainty estimates that are related to (

1) can be provided, but ML models do not satisfy these out of the box. This paper discusses three notions of uncertainty quantification: calibration, prediction sets (PS) and confidence intervals (CI), defined next. A function is said to be (perfectly) calibrated if

 E[Y∣f(X)=a]=a  a.s. for all a in the % range of f. (2)

Define and fix . A function is a -PS if

 P(Y∈S(X))≥1−α. (3)

Finally, let denote the set of all subintervals of . A function is a -CI if

 P(E[Y∣X]∈C(X))≥1−α. (4)

All three notions are ‘natural’ in their own sense, but also different at first sight. We show that they are in fact tightly connected (see Figure 1), and focus on the implications of this result for calibration. Our analysis is in the distribution-free setting, that is, we are concerned with understanding what kinds of valid uncertainty quantification is possible without distributional assumptions on the data.

Our work primarily extends the ideas of Vovk et al. [47, Section 5] and Barber [3]. We also discuss Platt scaling [36], binning [51] and the recent work of Vaicenavicius et al. [44]. Other related work is cited as needed, and further discussed in Section 5. All proofs appear ordered in the Appendix.

Notation: Let denote any distribution over . In practice, the available labeled data is often split randomly into the training set and the calibration set. Typically, we use to denote the number of calibration data points, so is the calibration data, where we use the shorthand . A prototypical test point is denoted . All data are drawn i.i.d. from , denoted succinctly as

. As above, random variables are denoted in upper case. The learner observes realized values of all random variables

, except .

### 2 Calibration, confidence intervals and prediction sets

Calibration captures the intuition of (1) but is a weaker requirement, and was first studied in the meteorological literature for assessing probabilistic rain forecasts [5, 39, 31, 7]. Murphy and Epstein [31] described the ideal notion of calibration, called perfect calibration (2), which has also been referred to as calibration in the small [45], or sometimes simply as calibration [12, 44, 7]. The types of functions that can achieve perfect calibration can be succinctly captured as follows.

###### Proposition 1.

A function is perfectly calibrated if and only if there exists a space and a function , such that

 f(x)=E[Y∣g(X)=g(x)]  almost surely PX. (5)

(If parsing (5) is tricky: to evaluate at , first set , then calculate .) Vaicenavicius et al. [44] stated and gave a short proof for the ‘only if’ direction. While the other direction is also straightforward, together they lead to an appealingly simple and complete characterization. The proof of Proposition 1 is in Appendix A.

It is helpful to consider two extreme cases of Proposition 1. First, setting to be the identity function yields that the Bayes classifier is perfectly calibrated. Second, setting to any constant implies that is also a perfect calibrator. Naturally, we cannot hope to estimate the Bayes classifier without assumptions, but even the simplest calibrator can only be approximated in finite samples. Since Proposition 1 states that calibration is possible iff the RHS of (5) is known exactly for some , perfect calibration is impossible in practice. Thus we resort to satisfying the requirement (2) approximately, which is implicitly the goal of many empirical calibration techniques.

###### Definition 1 (Approximate calibration).

A predictor is -approximately calibrated for some and a function if with probability at least , we have

 |E[Y|f(X)]−f(X)|≤ε(f(X)). (6)

Note that when the definition is applied to a test point , there may be two sources of randomness in : the randomness in the test point, as well as randomness in —the latter may be statistical randomness via learning on the training data, or algorithmic randomness used to train . There can also be randomness in . All probabilities and expectations in this paper should be viewed through this lens. In practice, calibration is often achieved via a post-processing step. Hence, with increasing amount of the calibration data, one might hope that in Definition 1 vanishes to . We formalize this below.

###### Definition 2 (Asymptotic calibration).

A sequence of predictors from is asymptotically calibrated at level if there exists a sequence of functions such that is -approximately calibrated for every , and .

We will show that the notions of approximate and asymptotic calibration are related to prediction sets (3) and confidence intervals (4). PSs and CIs are only ‘informative’ if the sets or intervals produced by them are small: confidence intervals are measured by their length (denoted as ), and prediction sets are measured by their diameter (). Observe that for binary classification, the diameter of a PS is either or .

For a given distribution, one might expect prediction sets to have a larger diameter than the length of the confidence intervals, since we want to cover the actual value of and not its (conditional) expectation. As an example, if for every , then the shortest possible confidence interval is whose diameter is . However, a valid -PS has no choice but to output for at least fraction of the points (and a random guess for the other fraction), and thus must have expected diameter even in the limit of infinite data.

Recently, Barber [3] built on an earlier result of Vovk et al. [47] to show that if an algorithm provides an interval which is a -CI for all product distributions (of the training data and test-point), then is also a -PS whenever is a nonatomic distribution. An immediate implication is that must always contain one of the end-points or with probability . Since this implication holds for all distributions , including the one with discussed above, it implies that distribution-free CIs must necessarily be wide, and in particular their length cannot shrink to as . This can be treated as an impossibility result for the existence of (distribution-free) informative CIs.

One way to circumvent these impossibilities is to consider CIs for functions with ‘lower resolution’ than . To this end, we introduce a notion of a CI or PS ‘with respect to ’ (w.r.t.). As we discuss in Section 3 (and Section 3.1 in particular), these notions are connected to calibration.

###### Definition 3 (CI or PS w.r.t. f).

A function is a -CI with respect to  if

 P(E[Y∣f(X)]∈C(f(X)))≥1−α. (7)

Analogously, a function is a -PS with respect to if

 P(Y∈S(f(X)))≥1−α. (8)

When instantiated for a test point , the probability in definitions (7) and (8) is not only over the test point, but also over the randomness in the pair or , which are usually learned on labeled data. In order to produce PSs and CIs, one typically fixes a function learned on an independent split of the labeled data, and considers learning a or that provides guarantees (7) and (8). For example, can be produced using inductive conformal techniques [37, 34, 26]. In this case, or would be random as well; to make this explicit, we often denote or as or .

### 3 Relating the notions of distribution-free uncertainty quantification

As preluded to above, we consider a standard setting for valid distribution-free uncertainty quantification where the ‘training’ data is used to learn a scoring function and then held-out data ‘calibration’ data is used to estimate uncertainty. We establish that in this setting, the notions of calibration, PSs and CIs are closely related. Figure 1 summarizes this section’s takeaway message. Here, and in the rest of the section, if is the distribution of data, then we denote the distribution of the random variable as .

In Section 3.1, we show that if an algorithm provides a CI, it can be used to provide a calibration guarantee and vice-versa (Theorem 1). This result is true even if the CI and calibration guarantees are not assumption-free. Section 3.2 shows that for all distributions such that is nonatomic, if an algorithm constructs a distribution-free CI with respect to , then it can be used to construct a distribution-free PS with respect to (Theorem 2). This result might seem surprising since one typically expects the length of CIs to shrink to in the limit of infinite data, whereas PSs have a fixed distribution-dependent lower bound on their diameter. Connecting our results, we infer the key impossibility result for asymptotic calibration in Section 3.3 (Theorem 3). Informally, our result shows that for a large class of standard scoring functions

(such as logistic regression, deep networks with a final softmax layer, SVMs), it is impossible to achieve distribution-free asymptotic calibration without a ‘discretization’ step. Parametric schemes such as Platt scaling

[36] do not perform such discretization and thus cannot lead to distribution-free calibration. To complement this lower bound, we provide calibration guarantees for one possible discretization step (histogram binning) in Section 4.

#### 3.1 Relating calibration and confidence intervals

Given a predictor that is -approximately calibrated, there is a natural way to construct a function that is a -CI:

 C(f(x))=[f(x)−ε(f(x)),f(x)+ε(f(x))], x∈X. (9)

On the other hand, given that is a -CI for , define for :

 uC(z):=sup{g:g∈C(z)}, lC(z):=inf{g:g∈C(z)}, mC(z):=(uC(z)+lC(z))/2. (10)

Consider the midpoint as a ‘corrected’ prediction for :

 ˜f(x):=mC(f(x)), x∈X. (11)

Then is -approximately calibrated for a non-trivial . These claims are formalized next.

###### Theorem 1.

Fix any . Let be a predictor that is -approximately calibrated for some function . Then the function in (9) is a -CI with respect to .

Conversely, fix a scoring function . If is a -CI with respect to , then the predictor in (11) is -approximately calibrated for .

The proof is in Appendix B. An important implication of Theorem 1 is that having a sequence of predictors that is asymptotically calibrated yields a sequence of confidence intervals with vanishing length as . This is formalized in the following corollary, also proved in Appendix B.

###### Corollary 1.

Fix any . If a sequence of predictors is asymptotically calibrated at level , then construction (9) yields a sequence of functions such that each is a -CI with respect to and .

Next, we show that for a large class of scoring functions, CIs and PSs are also related in the distribution-free setting. This connection along with Corollary 2 (below) leads to an impossibility result for distribution-free asymptotic calibration for certain functions (Theorem 3 in Section 3.3).

#### 3.2 Relating distribution-free confidence intervals and prediction sets

Suppose a function satisfies a CI guarantee with respect to no matter what the data-generating distribution is. We show that such a function would also provide a PS guarantee for all such that is nonatomic. To write our theorem, we define the ‘discretize’ function to transform a confidence interval to a prediction set: In the following theorem, the CI and PS guarantees provided (per equations (7) and (8)) are to be understood as marginal over both the calibration and test-data. To make this explicit, we denote the CI function as .

###### Theorem 2.

Fix and . If is a -CI with respect to for all distributions , then is a -PS with respect to for all distributions for which is nonatomic.

The proof is in Appendix B. It adapts the proof of Barber [3, Theorem 1]. Their result connects the notions of CI and PS, but not with respect to (like in equations (3), (4)). By adapting the result for CIs and PSs with respect to , and using Theorem 1, we are able to relate CIs and PSs to calibration and use this to prove an impossibility result for asymptotic calibration. This is done in the proof of Theorem 3 in the Section 3.3. A corollary of Theorem 2 that is used in Theorem 3 (but is also important on its own) is stated next.

###### Corollary 2.

Fix and . If is a -CI with respect to for all , and there exists a such that is nonatomic, then we can construct a distribution such that

 EQn+1|ˆCn(f(Xn+1))|≥0.5−α.

The proof is in Appendix B. For a given , the bound in the corollary needs existence of such that is nonatomic. These are characterized in the discussion after Corollary 3 (Section 3.3), and formally in the proof of Theorem 3. One expects the length of a confidence interval to vanish as . Corollary 2 shows that this is impossible in a distribution-free manner for certain .

#### 3.3 Necessary condition for distribution-free asymptotic calibration

The characterization of calibration in Proposition 1 shows that a function is a calibrated probabilistic classifier if and only if it takes the form (5) for some function , and in particular is calibrated by defining . Observe that for the purposes of calibration, the actual values taken by are only as informative as the partition of provided by its level sets. Denote this partition as , where . Then we may equivalently rewrite (5) as identifying values where . This allows us to re-characterize calibration as follows.

###### Corollary 3 (to Proposition 1).

Any calibrated classifier is characterized by a partition of into subsets and corresponding conditional probabilities for some index set .

Corollary 1 shows that asymptotic calibration allows construction of CIs whose lengths vanish asymptotically. Corollary 2 shows however that asymptotically vanishing CIs are impossible (without distributional assumptions) for if there exists a distribution such that is nonatomic. Consequently asymptotic calibration is also impossible for such . If is countable, then by the axioms of probability, and so for at least some . Thus cannot be nonatomic for any . On the other hand, if is uncountable we can show that there always exists a such that is nonatomic. Hence distribution-free asymptotic calibration is impossible for such . This argument is formalized in the following theorem. In the statement, we used to denote the partition that a function induces on , and we use to denote its cardinality (which may be infinite). Also denotes the largest cardinality of a countable set, which corresponds to the cardinality of . The proof of the following theorem is in Appendix B.

###### Theorem 3.

Let be a fixed threshold. If a sequence of scoring functions is asymptotically calibrated at level for every distribution then

 limsupn→∞ |X(fn)|≤ℵ0.

In words, the cardinality of the partition induced by must be at most countable for large enough . The following phrasing is convenient: is said to lead to a fine partition of if . Then, for the purposes of distribution-free asymptotic calibration, Theorem 3 necessitates us to consider that do not lead to fine partitions. Popular scoring functions such as logistic regression, deep neural-nets with softmax output and SVMs lead to continuous that induce fine partitions of and thus cannot be asymptotically calibrated without distributional assumptions.

This impossibility result can be extended to many parametric calibration schemes that ‘recalibrate’ an existing through a wrapper learnt on the calibration data, with the goal that is nearly calibrated: . For instance, consider methods like Platt scaling [36], temperature scaling [12] and beta calibration [20]. Each of these methods learns a continuous and monotonic111assuming that the parameters satisfy natural constraints as discussed in the original papers: for Beta scaling with at least one of them nonzero, for Platt scaling and for temperature scaling (hence bijective) wrapper , and thus . If is a good calibrator, we would have . One way to formalize this is to consider whether an interval around is a CI for . In other words — does there exist a function such that for every distribution ,

 ˜Cn(f(X)):=[hn(f(X))−εn(hn(f(X))),hn(f(X))+εn(hn(f(X)))]

is a -CI with respect to and . Theorem 3 shows that this is impossible if leads to a fine partition of , irrespective of the properties of . Thus the aforementioned parametric calibration methods cannot lead to asymptotic calibration in general. We conjecture that the lower bound also generalizes to other continuous parametric methods that are not necessarily monotonic.

A well-known calibration method that does not produce a fine partition of is histogram binning [51]. In Section 4, we analyze histogram binning and show that any scoring function can be ‘binned’ to achieve distribution-free calibration. We explicitly quantify the finite-sample approximate calibration guarantees that automatically also lead to asymptotic calibration. We also discuss calibration in the online setting and calibration under covariate shift.

### 4 Achieving distribution-free approximate calibration

In Section 4.1, we prove a distribution-free approximate calibration guarantee given a fixed partitioning of the feature space into finitely many sets. This calibration guarantee also leads to asymptotic calibration. In Section 4.2, we discuss a natural method for obtaining such a partition using sample-splitting, called histogram binning. Histogram binning inherits the bound in Section 4.1. This shows that binning schemes lead to distribution-free approximate calibration. In Section 4.3 and 4.4 we discuss extensions of this scheme to adaptive sampling and covariate shift respectively.

#### 4.1 Distribution-free calibration given a fixed sample-space partition

Suppose we have a fixed partition of into regions , and let be the expected label probability in region . Denote the partition-identity function as where if and only if . Given a calibration set , let be the number of points from the calibration set that belong to region . In this subsection, we assume that (in Section 4.2 we show that the partition can be constructed to ensure that is with high probability). Define

 ˆπb:=1ˆsb∑i:B(Xi)=bYi and ˆVb:=1ˆsb∑i:B(Xi)=b(Yi−ˆπb)2 (12)

as the empirical average and variance of the

values in a partition. We now deploy an empirical Bernstein bound [2] to produce a confidence interval for .

###### Theorem 4.

For any , with probability at least ,

 |πb−ˆπb|≤√2ˆVbln(3B/α)ˆsb+3ln(3B/α)ˆsb,simultaneously for all b∈[B].

The theorem is proved in Appendix C. Using the crude deterministic bound we get that the length of the confidence interval for partition is . However, if for some , is highly informative or homogeneous in the sense that is close to or , we expect . In this case, Theorem 4 adapts and provides an length interval for . Let denote the index of the region with the minimum number of calibration examples.

###### Corollary 4.

For , the function is -approximately calibrated with

 ε(⋅)=√ˆVb⋆ln(3B/α)2ˆsb⋆+3ln(3B/α)2ˆsb⋆.

Thus, is asymptotically calibrated at level .

The proof is in Appendix C. Thus, any finite partition of can be used for asymptotic calibration. However, the finite sample guarantee of Corollary 4 can be unsatisfactory if the sample-space partition is chosen poorly, since it might lead to small . In Section 4.2, we present a data-dependent partitioning scheme that provably guarantees that scales as with high probability.

#### 4.2 Identifying a data-dependent partition using sample splitting

Here, we describe ways of constructing the partition through histogram binning [51]. Binning uses a sample splitting strategy, where the partition is learned on the first part and are estimated on the second part. Formally, the labeled data is split at random into the training set and calibration set . Then is used to train an underlying scoring classifier (in general the range of the classifier could be any interval of but for simplicity we describe it for ). The classifier usually does not satisfy a valid calibration guarantee out-of-the-box but can be calibrated using binning as follows.

A binning scheme is any partition of into non-overlapping intervals , such that and for . and induce a partition of as follows:

 Xb={x∈X: g(x)∈Ib}, b∈[B]. (13)

The simplest binning scheme corresponds to fixed-width binning. In this case, bins have the form

 Ii=[i−1B,iB),i=1,…,B−1  and  IB=[B−1B,1].

However, fixed-width binning suffers from the drawback that very few calibration points might fall into some bins, which would make the estimates less calibrated. To remedy this, we consider uniform-mass binning, which aims to guarantee that each region

contains approximately equal number of data points from the calibration set. This is done by estimating the empirical quantiles of

. First, the calibration set is randomly split into two parts, and . Then is simply defined as the -th quantile of the empirical distribution of the values for . Consequently, the bins are defined as:

 I1=[0,ˆq1),Ii=[ˆqi−1,ˆqi],i=2,…,B−1  and  IB=(ˆqB−1,1].

Consequently, only is used for calibrating the underlying classifier. Kumar et al. [21] showed that uniform-mass binning provably controls the number of calibration samples that fall into each bin (see Appendix F.2). Building on their result, we show the following guarantee for .

###### Theorem 5.

There exists a universal constant such that if , then with probability at least ,

 ˆsb⋆≥∣∣D2cal∣∣/2B−√∣∣D2cal∣∣ln(2B/α)/2,

Thus even if does not grow with , as long as , uniform-mass binning is
approximately calibrated at level , and hence also asymptotically calibrated for any .

The proof is in Appendix C. In words, if we use a small number of points (independent of ) for uniform-mass binning, and the rest to estimate bin probabilities, we achieve (approximate/asymptotic) distribution-free calibration.

#### 4.3 Distribution-free calibration in the online setting

So far, we have considered the batch setting with a fixed calibration set of size . However, often a practitioner might want to query additional calibration data until a desired confidence level is achieved. This is called the online or adaptive setting. In this case, the results of Section 4 are no longer valid since the number of calibration samples is unknown a priori and may even be dependent on the data. In order to quantify uncertainty in the online setting, we use time-uniform concentration bounds [14, 15]; these hold simultaneously for all possible values of the calibration set size .

Fix a partition of , . For some value of , let the calibration data be given as . We use the superscript notation to emphasize the dependence on the current size of the calibration set. Let be examples from the calibration set that fall into the partition , where is the total number of points that are mapped to . Let the empirical label average and cumulative (unnormalized) empirical variance be denoted as

 ¯¯¯¯Ybt=1tt∑i=1Ybi,ˆV+b=1∨ˆs(n)b∑i=1(Ybi−¯¯¯¯Ybi−1)2. (14)

Note the normalization difference between and used in the batch setting. The following theorem constructs confidence intervals for that are valid uniformly for any value of .

###### Theorem 6.

For any , with probability at least ,

 |πb−ˆπb|≤7√ˆV+bln(1+lnˆV+b)+5.3ln(6.3Bα)ˆs(n)b,simultaneously % for all b∈[B] and all n∈N. (15)

Thus is asymptotically calibrated at any level .

The proof is in Appendix C. Due to the crude bound: , we can see that the width of confidence intervals roughly scales as . In comparison to the batch setting, only a small price is paid for not knowing beforehand how many examples will be used for calibration.

#### 4.4 Calibration under covariate shift

Here, we briefly consider the problem of calibration under covariate shift [41]. In this setting, calibration data is from a ‘source’ distribution , while the test point is from a shifted ‘target’ distribution , meaning that the ‘shift’ occurs only in the covariate distribution while does not change. We assume the likelihood ratio (LR)

 w:X→R;w(x):=d˜PX(x)/dPX(x)

is well-defined. The following is unambiguous: if is arbitrarily ill-behaved and unknown, the covariate shift problem is hopeless, and one should not expect any distribution-free guarantees. Nevertheless, one can still make nontrivial claims using a ‘modular’ approach towards assumptions:

1. [itemsep=0cm]

2. Condition (A): is known exactly and is bounded.

3. Condition (B): an asymptotically consistent estimator for can be constructed.

We show the following: under Condition (A), a weighted estimator using delivers approximate and asymptotic distribution-free calibration; under Condition (B), weighting with a plug-in estimator for continues to deliver asymptotic distribution-free calibration. It is clear that Condition (B) will always require distributional assumptions: asymptotic consistency is nontrivial for ill-behaved . Nevertheless, the above two-step approach makes it clear where the burden of assumptions lie: not with calibration step, but with the estimation step. Estimation of is a well studied problem in the covariate-shift literature and there is some understanding of what assumptions are needed to accomplish it, but there has been less work on recognizing the resulting implications for calibration. Luckily, many practical methods exist for estimating given unlabeled samples from  [4, 16, 17]. In summary, if Condition (B) is possible, then distribution-free calibration is realizable, and if Condition (B) is not met (even with infinite samples), then it implies that is probably very ill-behaved, and so distribution-free calibration is also likely to be impossible.

For a fixed partition , one can use the labeled data from the source distribution to estimate (unlike as before), given oracle access to :

 \widecheckπ(w)b:=∑i:B(Xi)=bw(Xi)Yi∑i:B(Xi)=bw(Xi). (16)

As preluded to earlier, assume that

 for all x∈X, L≤w(x)≤U for some 0

The ‘standard’ i.i.d. assumption on the test point equivalently assumes is known and . We now present our first claim: satisfies a distribution-free approximate calibration guarantee. To show the result, we assume that the sample-space partition was constructed via uniform-mass binning (on the source domain) with sufficiently many points, as required by Theorem 5. This guarantees that all regions have satisfy with high probability.

###### Theorem 7.

Assume is known and bounded (17). Then for an explicit universal constant , with probability at least ,

 ∣∣\widecheckπ(w)b−E˜P[Y∣X∈Xb]∣∣≤c(UL)2√Bln(6B/α)2n,simultaneously for all b∈[B],

as long as . Thus is asymptotically calibrated at any level .

The proof is in Appendix D. Theorem 7 establishes distribution-free calibration under Condition (A). For Condition (B), using unlabeled samples from the source and target domains, assume that we construct an estimator of that is consistent, meaning

 (18)

We now define an estimator by plugging in for in the right hand side of (16):

 \widecheckπ(ˆwk)b:=∑i:B(Xi)=bˆwk(Xi)Yi∑i:B(Xi)=bˆwk(Xi).
###### Proposition 2.

If is consistent (18), then is asymptotically calibrated at any level .

In Appendix D, we illustrate through preliminary simulations that can be estimated using unlabeled data from the target distribution, and consequently approximate calibration can be achieved on the target domain. Recently, Park et al. [35] also considered calibration under covariate shift through importance weighting, but they do not show validity guarantees in the same sense as Theorem 7. For real-valued regression, distribution-free prediction sets under covariate shift were constructed using conformal prediction [42] under Condition (A), and is thus a precursor to our modular approach.

### 5 Other related work

The problem of measuring calibration in binary classification was first studied in the meteorological and statistics literature [5, 39, 31, 28, 29, 30, 7, 9, 6, 10]; we refer the reader to the review by Dawid [8] for more details. Two common ways of measuring calibration that resulted from these works are reliability diagrams [9] and estimates of the squared expected calibration error (ECE) [39]: . Squared ECE can easily be generalized to multiclass settings and sometimes related notions such as absolute deviation ECE and top-label ECE have been considered, for instance [12, 32]. ECE is typically estimated through binning, which provably leads to underestimation of ECE for calibrators with continuous output [44, 21]. This fact is notably comparable to our results showing that distribution-free calibration is not achievable by continuous methods. Certain methods have been proposed to estimate ECE without binning [53, 50], but they require distributional assumptions for provability.

Apart from classical methods for calibration [36, 51, 52, 33]

, some new methods have been proposed recently in the ML literature, primarily for calibration of deep neural networks

[23, 12, 22, 43, 40, 19, 18, 49, 27]. These calibration methods perform well in practice but do not have distribution-free guarantees. A calibration framework that generalizes binning schemes is Venn prediction [46, 47, 45, 48, 24]; we briefly discuss this framework and show some connections to our work in Appendix E.

Calibration has natural applications in numerous sensitive domains where uncertainty estimation is desirable (healthcare, finance, forecasting). Recently, calibrated classifiers have been used as a part of the pipeline for anomaly detection

[13, 25] and label shift estimation [38, 1, 11].

### 6 Conclusion

We analyze calibration for binary classification problems from the standpoint of robustness to distributional assumptions. By connecting calibration to other ways of quantifying uncertainty, we establish that popular parametric scaling methods cannot provide provable informative calibration guarantees in the distribution-free setting. In contrast, we showed that a standard nonparametric method – histogram binning – satisfies approximate and asymptotic calibration guarantees without distributional assumptions. We also establish guarantees for the cases of streaming data and covariate shift.

Takeaway message. Recent calibration methods that perform binning on top of parametric methods (Platt-binning [21] and IROvA-TS [53]) have achieved strong empirical performance. In light of the theoretical findings in our paper, we recommend some form of binning as the last step of calibrated prediction due to the robust distribution-free guarantees provided by Theorem 4.

Machine learning is regularly deployed in real-world settings, including areas having high impact on individual lives such as granting of loans, pricing of insurance and diagnosis of medical conditions. Often, instead of hard classifications, these systems are required to produce soft probabilistic predictions, for example of the probability that a startup may go bankrupt in the next few years (in order to determine whether to give it a loan) or the probability that a person will recover from a disease (in order to price an insurance product). Unfortunately, even though classifiers produce numbers between 0 and 1, these are well known to not be ‘calibrated’ and hence not be interpreted as probabilities in any real sense, and using them in lieu of probabilities can be both misleading (to the bank granting the loan) and unfair (to the individual at the receiving end of the decision).

Thus, following early research in meteorology and statistics, in the last couple of decades the ML community has embraced the formal goal of calibration as a way to quantify uncertainty as well as to interpret classifier outputs. However, there exist other alternatives to quantify uncertainty, such as confidence intervals for the regression function and prediction sets for the binary label. There is not much guidance on which of these should be employed in practice, and what the relationship between them is, if any. Further, while there are many post-hoc calibration techniques, it is unclear which of these require distributional assumptions to work and which do not—this is critical because making distributional assumptions (for convenience) on financial or medical data is highly suspect.

This paper explicitly relates the three aforementioned notions of uncertainty quantification without making distributional assumptions, describes what is possible and what is not. Importantly, by providing distribution-free guarantees on well-known variants of binning, we identify a conceptually simple and theoretically rigorous way to ensure calibration in high-risk real-world settings. Our tools are thus likely to lead to fairer systems, better estimates of risks of high-stakes decisions, and more human-interpretable outputs of classifiers that apply out-of-the-box in many real-world settings because of the assumption-free guarantees.

### Acknowledgements

The authors would like to thank Tudor Manole, Charvi Rastogi and Michael Cooper Stanley for comments on an initial version of this paper.

### References

• Alexandari et al. [2019] Amr Alexandari, Anshul Kundaje, and Avanti Shrikumar. Adapting to label shift with bias-corrected calibration. arXiv preprint: 1901.06852, 2019.
• Audibert et al. [2007] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Tuning bandit algorithms in stochastic environments. In Proceedings of the 18th International Conference on Algorithmic Learning Theory, 2007.
• Barber [2020] Rina Barber. Is distribution-free inference possible for binary regression? arXiv preprint: 2004.09477, 2020.
• Bickel et al. [2007] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th International Conference on Machine Learning, 2007.
• Brier [1950] Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
• Bröcker [2012] Jochen Bröcker. Estimating reliability and resolution of probability forecasts through decomposition of the empirical score. Climate dynamics, 39(3-4):655–667, 2012.
• Dawid [1982] A Philip Dawid. The well-calibrated Bayesian. Journal of the American Statistical Association, 77(379):605–610, 1982.
• Dawid [2014] A Philip Dawid. Probability forecasting. Wiley StatsRef: Statistics Reference Online, 2014.
• DeGroot and Fienberg [1983] Morris H DeGroot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22, 1983.
• Ferro and Fricker [2012] Christopher AT Ferro and Thomas E Fricker. A bias-corrected decomposition of the Brier score. Quarterly Journal of the Royal Meteorological Society, 138(668):1954–1960, 2012.
• Garg et al. [2020] Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary C Lipton. A unified view of label shift estimation. arXiv preprint arXiv:2003.07554, 2020.
• Guo et al. [2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
• Hendrycks et al. [2019] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich.

Deep anomaly detection with outlier exposure.

In International Conference on Learning Representations, 2019.
• Howard et al. [2018] Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Uniform, nonparametric, non-asymptotic confidence sequences. arXiv preprint: 1810.08240, 2018.
• Howard et al. [2020] Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform chernoff bounds via nonnegative supermartingales. Probability Surveys, 17:257–317, 2020.
• Huang et al. [2007] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex J. Smola. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems. 2007.
• Kanamori et al. [2009] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, 2009.
• Kendall and Gal [2017] Alex Kendall and Yarin Gal.

What uncertainties do we need in bayesian deep learning for computer vision?

In Advances in Neural Information Processing Systems. 2017.
• Kuleshov et al. [2018] Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. In Proceedings of the 35th International Conference on Machine Learning, 2018.
• Kull et al. [2017] Meelis Kull, Telmo M. Silva Filho, and Peter Flach. Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics, 11(2):5052–5080, 2017.
• Kumar et al. [2019] Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems. 2019.
• Kumar et al. [2018] Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In Proceedings of the 35th International Conference on Machine Learning, 2018.
• Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.
• Lambrou et al. [2015] Antonis Lambrou, Ilia Nouretdinov, and Harris Papadopoulos. Inductive Venn prediction.

Annals of Mathematics and Artificial Intelligence

, 74(1-2):181–201, 2015.
• Lee et al. [2018] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. In International Conference on Learning Representations, 2018.
• Lei [2014] Jing Lei. Classification with confidence. Biometrika, 101(4):755–769, 2014.
• Milios et al. [2018] Dimitrios Milios, Raffaello Camoriano, Pietro Michiardi, Lorenzo Rosasco, and Maurizio Filippone. Dirichlet-based gaussian processes for large-scale calibrated classification. In Advances in Neural Information Processing Systems, 2018.
• Murphy [1972a] Allan H Murphy.

Scalar and vector partitions of the probability score: Part i. two-state situation.

Journal of Applied Meteorology, 11(2):273–282, 1972a.
• Murphy [1972b] Allan H Murphy. Scalar and vector partitions of the probability score: Part ii. n-state situation. Journal of Applied Meteorology, 11(8):1183–1192, 1972b.
• Murphy [1973] Allan H Murphy. A new vector partition of the probability score. Journal of applied Meteorology, 12(4):595–600, 1973.
• Murphy and Epstein [1967] Allan H Murphy and Edward S Epstein. Verification of probabilistic predictions: A brief review. Journal of Applied Meteorology, 6(5):748–755, 1967.
• Naeini et al. [2015] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning. In 29th AAAI Conference on Artificial Intelligence, 2015.
• Niculescu-Mizil and Caruana [2005] Alexandru Niculescu-Mizil and Rich Caruana.

Predicting good probabilities with supervised learning.

In Proceedings of the 22nd International Conference on Machine Learning, 2005.
• Papadopoulos et al. [2002] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In European Conference on Machine Learning, 2002.
• Park et al. [2020] Sangdon Park, Osbert Bastani, James Weimer, and Insup Lee. Calibrated prediction with covariate shift via unsupervised domain adaptation. In 23rd International Conference on Artificial Intelligence and Statistics, 2020.
• Platt [1999] John C. Platt.

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.

In Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999.
• Proedrou et al. [2002] Kostas Proedrou, Ilia Nouretdinov, Vladimir Vovk, and Alex Gammerman.

Transductive confidence machines for pattern recognition.

In European Conference on Machine Learning, 2002.
• Saerens et al. [2002] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.
• Sanders [1963] Frederick Sanders. On subjective probability forecasting. Journal of Applied Meteorology, 2(2):191–201, 1963.
• Seo et al. [2019] Seonguk Seo, Paul Hongsuck Seo, and Bohyung Han. Learning for single-shot confidence calibration in deep neural networks through stochastic inferences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
• Shimodaira [2000] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
• Tibshirani et al. [2019] Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems. 2019.
• Tran et al. [2019] Gia-Lac Tran, Edwin V Bonilla, John Cunningham, Pietro Michiardi, and Maurizio Filippone. Calibrating deep convolutional gaussian processes. In The 22nd International Conference on Artificial Intelligence and Statistics, 2019.
• Vaicenavicius et al. [2019] Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B Schön. Evaluating model calibration in classification. In 22nd International Conference on Artificial Intelligence and Statistics, 2019.
• Vovk and Petej [2014] Vladimir Vovk and Ivan Petej. Venn-Abers predictors. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, 2014.
• Vovk et al. [2004] Vladimir Vovk, Glenn Shafer, and Ilia Nouretdinov. Self-calibrating probability forecasting. In Advances in Neural Information Processing Systems, 2004.
• Vovk et al. [2005] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic learning in a random world. 2005.
• Vovk et al. [2015] Vladimir Vovk, Ivan Petej, and Valentina Fedorova. Large-scale probabilistic predictors with and without guarantees of validity. In Advances in Neural Information Processing Systems, 2015.
• Wenger et al. [2020] Jonathan Wenger, Hedvig Kjellström, and Rudolph Triebel. Non-parametric calibration for classification. 2020.
• Widmann et al. [2019] David Widmann, Fredrik Lindsten, and Dave Zachariah. Calibration tests in multi-class classification: a unifying framework. In Advances in Neural Information Processing Systems, 2019.

Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers.

In Proceedings of the 18th International Conference on Machine Learning, 2001.
• Zadrozny and Elkan [2002] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
• Zhang et al. [2020] Jize Zhang, Bhavya Kailkhura, and T Han. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning, 2020.

### Appendix A Proof of Proposition 1

The ‘if’ part of the theorem is due to Vaicenavicius et al. [44, Proposition 1]; we reproduce it for completeness. Let be the sub -algebras generated by and respectively. By definition of , we know that is -measurable and, hence, . We now have:

 E[Y∣f(X)] =E[E[Y∣g(X)]∣f(X)] (by tower rule since σ(f)⊆σ(g)) =E[f(X)∣f(X)] (by property (5)) =f(X).

The ‘only if’ part can be verified for . Since is perfectly calibrated,

 E[Y∣f(X)=f(x)]=f(x),

almost surely .

### Appendix B Proofs of results in Section 3

#### b.1 Proof of Theorem 1

Assume that one is given a predictor that is -approximately calibrated. Then the assertion follows from the definition of -approximate calibration since:

 |E[Y∣f(X)]−f(X)|≤ε(f(X))⟹E[Y∣f(X)]∈C(f(X)).

Now we show the proof in the other direction. Since is a constant-valued function that depends on , let us denote its constant output as .

If was injective, and thus if (which happens with probability at least ), we would have and so

 |E[Y∣mC(f(X))]−mC(f(X)|≤supz∈Range(f){|C(z)|/2}=εC.

This serves as an intuition for the proof in the general case, when need not be injective. Note that,

 |E[Y∣mC(f(X))]−mC(f(X))| =|E[Y∣mC(f(X))]−E[mC(f(X))∣mC(f(X))]| (1)=|E[E[Y∣f(X)]∣mC(f(X))]−E[mC(f(X))∣mC(f(X))]| (2)=|E[E[Y∣f(X)]−mC(f(X))∣mC(f(X))]| (3)≤E[|E[Y∣f(X)]−mC(f(X))|∣mC(f(X))], (19)

where we use the tower rule in (1) (since is a function of ), linearity of expectation in (2) and Jensen’s inequality in (3). To be clear, the outermost expectation above is over (conditioned on ). Consider the event

 A:E[Y∣f(X)]∈C(f(X)).

On , by definition we have:

 |E[Y∣f(X)]−mC(f(X))|=uC(f(X))−lC(f(X))2≤supz∈Range(f)(|C(z)|2)=εC.

By monotonicity property of conditional expectation, we also have that conditioned on ,

 E[|E[Y∣f(X)]−mC(f(X))|∣mC(f(X))]≤E[εC∣mC(f(X))]=εC,

with probability 1. Thus by the relationship proved in the series of equations ending in (19), we have that conditioned on , with probability ,

 |E[Y∣mC(f(X))]−mC(f(X))|≤εC.

Since we are given that is a -CI with respect to , . For any event , it holds that . Setting

 B:|E[Y∣mC(f(X))]−mC(f(X))|≤εC,

we obtain:

 P(|E[Y∣mC(f(X))]−mC(f(X))|≤εC)≥1−α.

Thus, we conclude that is -approximately calibrated. ∎

#### b.2 Proof of Corollary 1

Let be asymptotically calibrated sequence with the corresponding sequence of functions that satisfy . From Theorem 1, we can construct corresponding functions that are -CI with respect to and satisfy

 |Cn(fn(Xn+1))|=2εn(fn(Xn+1))=oP(1).

This concludes the proof. ∎

#### b.3 Proof of Theorem 2

In the proof we write the test point as . Suppose is a -CI with respect to for all distributions . We show that covers the label itself for distributions such that