Nested Conformal Prediction and the Generalized Jackknife+

10/23/2019 ∙ by Arun K. Kuchibhotla, et al. ∙ 0

We provide an alternate unified framework for conformal prediction, which is a framework to provide assumption-free prediction intervals. Instead of beginning by choosing a conformity score, our framework starts with a sequence of nested sets {F_t(x)}_t∈T for some ordered set T that specifies all potential prediction sets. We show that most proposed conformity scores in the literature, including several based on quantiles, straightforwardly result in nested families. Then, we argue that what conformal prediction does is find a mapping α t(α), meaning that it calibrates or rescales T to [0,1]. Nestedness is a natural and intuitive requirement because the optimal prediction sets (eg: level sets of conditional densities) are also nested, but we also formally prove that nested sets are universal, meaning that any conformal prediction method can be represented in our framework. Finally, to demonstrate its utility, we show how to develop the full conformal, split conformal, cross-conformal and the recent jackknife+ methods within our nested framework, thus immediately generalizing the latter two classes of methods to new settings. Specifically, we prove the validity of the leave-one-out, K-fold, subsampling and bootstrap variants of the latter two methods for any nested family.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conformal prediction is a general framework for constructing prediction sets that are valid in finite samples without distributional assumptions, pioneered by Vladimir Vovk and coauthors for over a decade. This method/principle of construction can be wrapped around (almost) any prediction algorithm and hence has gained a lot of interest in the machine learning and statistics literature. We refer the reader to the works 

Vovk et al. (2005) and Balasubramanian et al. (2014) for details. In this paper, we provide an equally general alternate framework for accomplishing the same goals, that we call nested conformal prediction. To be clear, our ideas are transparently and directly motivated from conformal prediction itself (and hence the name), and so we do not seek to replace the original “score-based” conformal prediction in any way or form, but to view it from an alternate and equivalent lens. Of course, we argue that our viewpoint is more “natural” (for some definition of natural) but admit immediately that this is a subjective claim. The main difference between the viewpoints is simply summarized as follows:

Instead of beginning with a (non-)conformity score that specifies how “odd” a point is relative to other points in the dataset, we begin with a sequence of nested sets

for some ordered set that specifies all potential prediction sets. Rather than forming prediction regions based on excluding nonconforming points by thresholding scores, we form prediction regions by mapping to .

To gain some intuition for why conformal prediction based on a nested sequence of sets is “natural”, consider the example of split conformal or inductive conformal prediction discussed in Lei et al. (2018) and Balasubramanian et al. (2014, Chapter 2.3) respectively. In the regression setting, split conformal method is based on splitting the data into two parts: training and calibration , where we have assumed that

is even and the split is even for simplicity of exposition. Based on the training data, one constructs an estimate

of the conditional mean of given . The construction of a valid prediction set is based on residuals , and the prediction set for a new point is given by

where for a finite set represents the -th quantile of elements in , formally defined later. An alternative description of split conformal is as follows:

  1. Based on the training data, construct the sequence of prediction regions:

    Note that is a random set and is random through . It is clear that regardless of , for any from the same distribution as the training data, there exists a such that

    Hence we can rewrite our nested family as

    (1)
  2. The only issue now is that we do not know the map , that is, given we do not know which of these prediction intervals to use. Hence we use the calibration data to “estimate” the map and this is done by finding the smallest such that contains proportion of the calibration data. Because the sequence is increasing in , finding the smallest leads to the smallest prediction set within this family.

Because of (1), we can, in general, view any given collection of nested sets as a sequence of prediction sets (corresponding to different prediction levels) with an unknown mapping of index to the prediction coverage error. The calibration set is then used to “estimate” this map possibly conservatively so that finite sample validity still holds true. For this reason, the framework we consider starts with a collection of nested (increasing) sequence of sets for an ordered set .

Another reason that the assumption of nestedness is natural is the fact that the optimal prediction sets are nested: Suppose

are exchangeable random variables with a common distribution that has density

(with respect to some measure). The “oracle” prediction set for a future observation is given by with defined by minimum such that . Because is decreasing with , is decreasing with .

To recap, we have so far only argued that considering nested prediction sets is “natural”, but we have not formulated an algorithm to produce prediction sets using this new viewpoint. The rest of this paper is devoted to four tasks:

  1. presenting both split conformal and full conformal in the language of nested conformal prediction, and translating various conformity scores in the literature to nested prediction sets;

  2. proving that the frameworks of nested conformal and score-based conformal are fully equivalent (meaning that they include each other), thus implying that nested conformal is also “universal” in the same sense in which standard conformal is known to be universal;

  3. demonstrating the utility of the unified framework by easily extending the recent ideas of cross-conformal prediction and the jackknife+ to our general nested framework, immediately making the latter tools applicable to a wide variety of settings beyond the initial scope of those works.

  4. developing leave-one-out, -fold, subsampling and bootstrap variants of cross-conformal and the jackknife+ (where the underlying algorithm is trained multiple times on different subsets of the data).

Before proceeding with the fully general setup involving full conformal prediction, we first demonstrate carefully how existing conformal scores in the literature (when used with split conformal prediction) can be transformed into nested prediction sets, by working through a few examples. The experienced reader can immediately jump to the following section on universality if desired.

2 Split Conformal based on Nested Prediction Sets

Earlier we showed that in a simple regression setup with the conformity scores as held-out residuals, split conformal intervals can be naturally expressed in terms of nested sets, but we did not present any algorithm using nested sets. Below, we complete the earlier story by presenting our method and demonstrating its equivalence to the usual split conformal method and extend that story to split conformal prediction with other conformity scores.

Suppose that denote the training dataset. Let be a partition of . For and each , let (with ) denote a nested sequence of sets constructed based on the first split of training data , that is, for . The sets in almost all examples are random through the training data, although they are not required to be random. Consider the score

(2)

where is a mnemonic for “radius”, and can be informally thought of as the smallest “radius” of sets that capture (and perhaps thinking of a multivariate response, that is , and as representing appropriate balls/ellipsoids might help with that intuition). Define the scores for the second split of the training data and set

The final prediction set is given by

(3)

The following well known sample coverage guarantee holds true (Lei et al., 2018).

Proposition 2.1.

If are exchangeable, then the prediction interval satisfies

Moreover, if the scores are almost surely distinct, then the prediction interval also satisfies

(4)
Proof.

See Appendix A.1 for a proof. ∎

Remark 2.1.

We have assumed the nested sets are increasing. If these are decreasing then the infimum in (2) should be replaced by the supremum. Proposition 2.1 continues to hold in this setting.

The following are some examples from the literature written in terms of nested conformal prediction.

  1. Split Conformal (Lei et al., 2018). Let be a mean estimator based on . Consider the nested sets for as

    Observe now that

    Hence for and

  2. Locally Weighted Split Conformal (Lei et al., 2018). Let

    be the mean and variance estimators based on

    . Consider the nested sets for as

    This implies that

    Hence for and

  3. Conformalized Quantiles (Romano et al., 2019). Let be quantile estimators based on . Consider the nested sets for as

    Note that these sets are monotonically increasing in if . Observe now that

    Hence for and

Table 1 summarizes these and other examples in the literature covered in the nested conformal framework. It is likely that recent works involving localization (Guan, 2019) can also fit into our framework, but we did not intend to be comprehensive, since the literature is currently growing at a rapid rate.

Reference Estimates
Lei et al. (2018)
Lei et al. (2018)
Romano et al. (2019)
Kivaranovic et al. (2019)
Sesia and Candès (2019)
Chernozhukov et al. (2019)
Izbicki et al. (2019)
Table 1: Examples from the literature covered by nested conformal framework. The methods listed are split conformal, locally weighted split conformal, CQR, CQR-m, CQR-r and split distributional conformal. Functions represent the conditional quantile estimates, and let represent an estimate of the conditional density.

Having discussed the case of split conformal prediction, which is conceptually simpler, in some detail, we now move to the setting of full conformal prediction.

3 The equivalence of nested conformal prediction and score-based conformal prediction

The aim of this section is simple: we show that every instance of nested conformal prediction can be cast in terms of score-based conformal prediction, and vice versa.

We first recall the definition of a (transductive) conformal prediction set; see Section 1.3 of Balasubramanian et al. (2014) for details. For this section, it is useful to think of , and for any , let denote the set .

A (non-)conformity -measure is a measurable function that assigns every sequence of examples to a corresponding sequence of real numbers that is equivariant with respect to permutations, meaning that for any permutation ,

The conformal prediction set determined by as a nonconformity measure is defined by

(5)

where for each , the corresponding -value is defined by

and the corresponding sequence of nonconformity scores is defined by

The nested conformal predictor in contrast starts with a nested sequence of sets. For any and sequence , let be a sequence of nested sets that are invariant to permutations of indices. For observations and a possible future , define the scores

The nested conformal predictor is then given by

It is clear that nested conformal predictor is a special case of (transductive) conformal predictor with scores defined based on nested sets rather than a function . Below, we prove that the converse also holds.

Remark 3.1.

Given that any conformal predictor is also a nested conformal predictor, one might ask why to consider nested conformal predictors. It often seems easier to specify the shape of the prediction set rather than coming up with a good non-conformity score. For example, if we want to use conditional quantiles, a natural sequence of prediction sets is but coming up with a non-conformity score is tricky. This perspective also leads to another natural sequence and once again a non-conformity score is tricky. Of course, we realize that this is a subjective judgment.

We now argue that any conformal prediction set is a nested conformal set and any prediction set that is invariant to the labels of observations can be improved by a nested conformal set. Proposition 1.2 of Balasubramanian et al. (2014) claims that if are exchangeable then for all . Our next result states that any conformal prediction set can be obtained as a nested conformal set.

Proposition 3.1.

Suppose represents a conformal prediction set. Then there exists a nested sequence such that the nested conformal set matches with the conformal prediction set.

Proof.

See Appendix A.2 for a proof. ∎

Proposition 1.3 of Balasubramanian et al. (2014) shows that conformal prediction is universal in a particular sense (informally, any valid scheme for producing assumption-free confidence sets can be replaced by a conformal prediction scheme that is at least as efficient). Since everything that can be accomplished via nested conformal prediction can also be done via conformal prediction and vice versa, nested conformal prediction is also universal in the same sense.

Full conformal prediction is an elegant framework, but it is often criticized for being computationally intensive. Indeed, that was one of the motivations for the development of the split conformal method (Papadopoulos et al., 2002; Lei et al., 2018). Two other methods proposed in the literature can be seen as “middle grounds” between full and split conformal. These are cross-conformal (Vovk, 2015) and the related jackknife+ (Barber et al., 2019). The latter was originally introduced in the context of standard regression setups with the usual residuals, but we next show how to generalize both of these to use any nested prediction sets.

4 Extensions of the jackknife+ and cross-conformal using nested sets

In the previous section, we used a part of training data to construct the nested sets and the remaining part to calibrate them for finite sample validity. This, although computationally efficient, can be statistically inefficient due to the reduction of the sample size used for calibrating. Instead of splitting into two parts, it is statistically more efficient to split the data into multiple parts. In this section, we describe such versions of nested conformal prediction sets and prove their validity. These versions in the score-based conformal framework are called cross-conformal prediction and the jackknife+, and were developed in Vovk (2015) and in Barber et al. (2019), but the latter only for a specific score function.

4.1 Nested leave-one-out generalization of cross-conformal

We now derive a generalized jackknife+ method based on nested prediction sets. This method is a special case of a cross-conformal method discussed in Section 4.2. We present it here separately because of its close connection to Jackknife which is statistically more popular.

Suppose for each denote a collection of nested sets constructed based only on . It is assumed that these sets are invariant to the permutation of their input points. (Note that this is an additional assumption compared to the split conformal version.) Define the -th residual

The residual for point is defined as

Define the prediction set

(6)
Theorem 4.1.

If are exchangeable and sets constructed based on are invariant to their ordering, then

Proof.

See Appendix A.3 for a proof. ∎

4.2 Nested K-fold generalizations of cross-conformal

The jackkinfe+ extension discussed in the previous section is based on splitting the data of size into parts, or in other words, a leave-one-out version. Following this reformulation, we now define a -fold cross-validation version. This was first developed in Vovk (2015), although no validity guarantee was provided. The validity of this method was proved in Appendix E of Vovk and Wang (2019) when which leads to a non-trivial bound only when . Barber et al. (2019) proved a non-trivial validity guarantee for all .

Suppose denote a disjoint partition of such that . For exchangeability, this equality of sizes is very important. Let (assume this is an integer). Let be a sequence of nested sets computed based on . Define the score

where is such that . The cross-conformal prediction set is now defined as

It is clear that if then for every . The following result proves the validity of as an extension of Theorem 4 of Barber et al. (2019). This, clearly, reduces to Theorem 4.1 if .

Theorem 4.2.

If are exchangeable and sets constructed based on are invariant to their ordering, then

Proof.

See Appendix A.3 for a proof. ∎

4.3 Generalizations of CV+ and Jackknife+

The prediction sets and are defined implicitly. These sets can be written in terms of nested sets as

In this section, we show that there exists an explicit interval that always contains whenever is a collection of nested intervals (instead of just nested sets). We have shown in Table 1 that many existing conformal prediction sets are intervals and hence writing for some . In particular, this shows that our framework provides a cross-conformal generalization of the CQR, CQR-m, CQR-r and the distributional conformal methods. Using this notation, we can write as

Define

It is clear that

is the empirical cumulative distribution function (cdf) of the left end points of

and is the left-continuous version of the empirical cdf of the right end points. Now observe that if and only if

(7)

Because and both belong to for all , we get that

The set of such that is exactly same as the set of all that are larger than the -th quantile of ’s and the set of such that is exactly same as the set of all that are smaller than the -th quantile of ’s. Since for all , we get that

where denotes the -th quantile of and denotes the -th quantile of . For , is the analogue of Jackknife+ and is denoted by . These prediction intervals are the generalizations of the CV+ and the Jackknife+ procedures of Barber et al. (2019). We proved above that and are always non-empty intervals. Because for all , we readily get that for all ,

Remark 4.1.

(Computation of and ) Unlike , the prediction set can be empty for some . The set is empty if and only if

(8)

This means that no more than many intervals intersect. Of course, this will be satisfied if all the intervals are mutually disjoint.

If is non-empty, then it would a union of one or more intervals and the exact set can be obtained by verifying the last condition of (7) for all in the interval . One need not check this for every real , it suffices to verify for endpoints that belong to .

5 Nested Conformal based on Multiple Repetitions of Splits

In Section 2, we described the nested conformal version of split conformal which is based on one particular split of the data into two parts, and in the previous section we discussed partitions of the data into parts. In practice, however, to reduce the additional variance due to randomization, one might wish to consider several (say ) different splits of data into two parts and want to combine these predictions. Lei et al. (2018) discuss a combination of split conformal prediction sets based on Bonferroni correction and in this section, we consider an alternative combination method that we call subsampling conformal. The same idea can also be used for cross-conformal version where the partition of the data into folds can be repeatedly performed times. The methods to be discussed are related111In a forthcoming work, Rina F. Barber and coauthors explore the jackknife+ and CV+ methods for the case of ensemble prediction algorithms, which are particular ways of forming , and are different from our use of bootstrap and subsampling. to those proposed in Carlsson et al. (2014), Vovk (2015), and Linusson et al. (2017), but these papers do not provide validity results for their methods.

5.1 Subsampling Conformal based on Nested Prediction Sets

Fix a number of subsamples. Let , , , denote independent and identically distributed random “variables” drawn uniformly from ; one can also restrict to for some . For each set , define the -value for the new prediction at as

where the scores and are computed as

based on nested sets computed based on observations in . Define the prediction set as

It is clear that for , is same as the split conformal prediction set discussed in Section 2. The following results proves the validity of .

Theorem 5.1.

If are exchangeable, then for any and ,

Proof.

See Appendix A.5 for a proof. ∎

Note that we can write as by adding the argument for observations used in computing the nested sets. Using this notation, we can write for large

(9)

where the expectation is taken with respect to the random “variable” drawn uniformly from a collection of subsets of such as or for some . Because any uniformly drawn element in can be obtained by sampling from without replacement (subsampling), the above combination of prediction intervals can be thought as subbagging introduced in Bühlmann and Yu (2002).

Section 2.3 of Lei et al. (2018) combine the -values by taking the minimum. They define the set

Because are independent and identically distributed (conditional on the data), averaging is a natural stabilizer than the minimum; all the -values should get equal contribution towards the stabilizer but the minimum places all its weight on one -value.

Vovk (2015, Appendix B) describes a version of using bootstrap samples instead of subsamples and this corresponds to bagging. We consider this version in the following subsection.

5.2 Bootstrap Conformal based on Nested Prediction Sets

The subsampling prediction set is based on sets obtained by sampling without replacement. Statistically a more popular alternative is to form sets by sampling with replacement, which corresponds to bootstrap.

Let denote independent and identically distributed bags (of size ) obtained by random sampling with replacement from . For each , consider scores

based on nested sets computed based on observations ; should be thought of as a bag rather than a set of observations because of repititions of indices. Consider the prediction interval