Efficiency of conformalized ridge regression

04/08/2014 ∙ by Evgeny Burnaev, et al. ∙ Royal Holloway, University of London 0

Conformal prediction is a method of producing prediction sets that can be applied on top of a wide range of prediction algorithms. The method has a guaranteed coverage probability under the standard IID assumption regardless of whether the assumptions (often considerably more restrictive) of the underlying algorithm are satisfied. However, for the method to be really useful it is desirable that in the case where the assumptions of the underlying algorithm are satisfied, the conformal predictor loses little in efficiency as compared with the underlying algorithm (whereas being a conformal predictor, it has the stronger guarantee of validity). In this paper we explore the degree to which this additional requirement of efficiency is satisfied in the case of Bayesian ridge regression; we find that asymptotically conformal prediction sets differ little from ridge regression prediction intervals when the standard Bayesian assumptions are satisfied.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper discusses theoretical properties of the procedure described in the abstract as applied to Bayesian ridge regression in the primal form. The procedure itself has been discussed earlier in the Bayesian context under the names of frequentizing ([14], Section 3) and de-Bayesing ([12], p. 101); in this paper, however, we prefer the name “conformalizing”. The procedure has also been studied empirically (see, e.g., [12], Figures 10.1–10.5, and [14], Figure 1, corrected in [11], Figure 11.1). To our knowledge, this paper is the first to explore the procedure theoretically.

The purpose of conformalizing is to make prediction algorithms, first of all Bayesian algorithms, valid under the assumption that the observations are generated independently from the same probability measure; we will refer to this assumption as the IID assumption. This is obviously a desirable step provided that we do not lose much if the assumptions of the original algorithm happen to be satisfied. The situation here resembles that in nonparametric hypothesis testing (see, e.g., [7]), where nonparametric analogues of some classical parametric tests relying on Gaussian assumptions turned out to be surprisingly efficient even when the Gaussian assumptions are satisfied.

We start the main part of the paper from Section 2, in which we define the ridge regression procedure and the corresponding prediction intervals in a Bayesian setting involving strong Gaussian assumptions. It contains standard material and so no proofs. The following section, Section 3, applies the conformalizing procedure to ridge regression in a way that facilitates theoretical analysis in the following sections; the resulting “conformalized ridge regression” is similar to but somewhat different from the algorithm called “ridge regression confidence machine” in [12].

Section 4 contains our main result. It shows that asymptotically we lose little when we conformalize ridge regression and the Gaussian assumptions are satisfied; namely, conformalizing changes the prediction interval by with high probability, where is the number of observations. Our main result gives precise asymptotic distributions for the differences between the left and right end-points of the prediction intervals output by the Bayesian and conformal predictors. These are theoretical counterparts of the preliminary empirical results obtained in [12] (Figures 10.1–10.5 and Section 8.5, pp. 205–207) and [13]. We then discuss and interpret our main result using the notions of efficiency and conditional validity (introduced in the previous two sections). Section 5 gives a more explicit description of conformalized ridge regression, and in Section 6 we prove the main result.

Other recent theoretical work about efficiency and conditional validity of conformal predictors includes Lei and Wasserman’s [6]. Whereas our predictor is obtained by conformalizing ridge regression, Lei and Wasserman’s conformal predictor is specially crafted to achieve asymptotic efficiency and conditional validity. It is intuitively clear that whereas our algorithm is likely to produce reasonable results in practice (in situations where ridge regression produces reasonable results), Lei and Wasserman’s algorithm is primarily of theoretical interest. A significant advantage of their algorithm, however, is that it is guaranteed to be asymptotically efficient and conditionally valid under their regularity assumptions, whereas our algorithm is guaranteed to be asymptotically efficient and conditionally valid only under the Gaussian assumptions.

2 Bayesian ridge regression

Much of the notation introduced in this section will be used throughout the paper. We are given a training sequence and a test object , and our goal is to predict its label . Each observation , consists of an object and a label . We are interested in the case where the number of training observations is large, whereas the number of attributes is fixed. Our setting is probabilistic; in particular, the observations are generated by a probability measure.

In this section we do not assume anything about the distribution of the objects , but given the objects, the labels are generated by the rule

(1)

where

is a random vector distributed as

(the Gaussian distribution being parameterized by its mean and covariance matrix, and

being the unit matrix), each is distributed as , the random elements are independent (given the objects), and and are given positive numbers.

The conditional distribution for the label of the test object given the training sequence and is

where

(2)
(3)

is the design matrix for the training sequence (the matrix whose th row is , ), and is the vector of the training labels; see, e.g., [12], (10.24). Therefore, the Bayesian prediction interval is

(4)

where is the significance level (the permitted probability of error, so that is the required coverage probability) and is the

-quantile of the standard normal distribution

.

The prediction interval (4) enjoys several desiderata: it is unconditionally valid, in the sense that its error probability is equal to the given significance level ; it is also valid conditionally on the training sequence and the test object ; finally, this prediction interval is the shortest possible conditionally valid interval. We will refer to the class of algorithms producing prediction intervals (4) (and depending on the parameters and ) as Bayesian ridge regression (BRR).

3 Conformalized ridge regression

Conformalized ridge regression (CRR) is a special case of conformal predictors; the latter are defined in, e.g., [12], Chapter 2, but we will reproduce the definition in our current context. First we define the CRR conformity measure as the function that maps any finite sequence of observations of any length to the sequence of the following conformity scores : for each ,

where is the vector of ridge regression residuals ,

(cf. (2)), is the overall design matrix (the matrix whose th row is , ), and is the overall vector of labels (the vector in whose th element is , ).

Remark.

We interpret as the degree to which the element conforms to the full sequence . Intuitively, conforms to the sequence if its ridge regression residual is neither among the largest nor among the smallest. Instead of the simple residuals we could have used deleted or studentized residuals (see, e.g., [12], pp. 34–35), but we choose the simplest definition, which makes calculations feasible. Another possibility is to use as conformity scores; this choice leads to what was called “ridge regression confidence machines” in [12], Chapter 2, but its analysis is less feasible.

Given a significance level , a training sequence

and a test object , conformalized ridge regression outputs the prediction set

(5)

where the p-values are defined by

and the conformity scores are defined by

(6)

Define the prediction interval output by CRR as the closure of the convex hull of the prediction set ; we will use the notation and for the left and right end-points of this interval, respectively. (Later we will introduce assumptions that will guarantee that itself is an interval from some on.) As discussed later in Section 5, CRR is computationally efficient: e.g., its computation time is in the on-line mode.

CRR relies on different assumptions about the data as compared with BRR. Instead of the Gaussian model (1), where and

, it uses the assumption that is standard in machine learning: we consider observations

that are IID (independent and identically distributed).

Proposition 1 ([12], Proposition 2.3).

If are IID observations, the coverage probability of CRR (i.e., the probability of , where is defined by (5)) is at least .

Proposition 1 asserts the unconditional validity of CRR. Its validity conditional on the training sequence and the test object is not, however, guaranteed (and it is intuitively clear that ensuring validity conditional on the test object prevents us from relying on the IID assumption about the objects). For a discussion of conditional validity in the context of conformal prediction, see [6], Section 2, and, more generally, [10]. Efficiency (narrowness of the prediction intervals) is not guaranteed either.

The kind of validity asserted in Proposition 1 is sometimes called “conservative validity” since is only a lower bound on the coverage probability. However, the definition of conformal predictors can be slightly modified (using randomization for treatment of borderline cases) to achieve exact validity; in practice, the difference between conformal predictors and their modified (“smoothed”) version is negligible. For details, see, e.g., [12], p. 27.

4 Main result

In this section we show that under the Gaussian model (1) complemented by other natural (and standard) assumptions CRR is asymptotically close to BRR, and therefore is approximately conditionally valid and efficient. On the other hand, Proposition 1 guarantees the unconditional validity of CRR under the IID assumption, regardless of whether (1) holds.

In this section we assume an infinite sequence of observations

but consider only the first of them and let . We make both the IID assumption about the objects (the objects are generated independently from the same distribution) and the assumption (1); however, we relax the assumption that is distributed as . These are all the assumptions used in our main result:

(A1)

The random objects , , are IID.

(A2)

The second-moment matrix

of exists and is non-singular.

(A3)

The random vector is independent of  .

(A4)

The labels are generated by , where are Gaussian noise variables distributed as and independent between themselves, of the objects , and of .

Notice that the assumptions imply that the random observations , , are IID given . It will be clear from the proof that the assumptions can be relaxed further (but we have tried to make them as simple as possible).

Theorem 2.

Under the assumptions (A1)–(A4), the prediction sets output by CRR are intervals from some on almost surely, and the differences between the upper and lower ends of the prediction intervals for BRR and CRR are asymptotically Gaussian:

(7)
(8)

where , is the -quantile of , is the density of , is the expectation of , and is the second-moment matrix of .

The theorem will be proved in Section 6, and in the rest of this section we will discuss it. We can see from (7) and (8) that the symmetric difference between the prediction intervals output by BRR and CRR shrinks to 0 as in Lebesgue measure with high probability.

Let us first see what the typical values of the standard deviation (the square root of the variance) in (

7) and (8) are. It is easy to check that the standard deviation is proportional to ; therefore, let us assume . The second term in the variance does not affect it significantly since . Indeed, denoting the covariance matrix of by and using the Sherman–Morrison formula (see, e.g., [5], (3)), we have:

(9)

(we write rather than because is permitted to be singular: see Appendix A for details). The first term, on the other hand, can affect the variance more significantly, and the significant dependence of the variance on is natural: the accuracy obtained from the Gaussian model is better for small

since it uses all data for estimating the end-points of the prediction interval rather than relying, under the IID model, on the scarcer information provided by observations in the tails of the distribution generating the labels. Figure 

1 illustrates the dependence of the standard deviation of the asymptotic distribution on . The upper line in it corresponds to and the lower line corresponds to . The possible values for the standard deviation lie between the upper and lower lines. The asymptotic behaviour of the standard deviation as is given by

(10)

uniformly in .

Figure 1: The limits for the standard deviation in Theorem 2 as a function of (left) and (right) shown as solid (blue) lines; the asymptotic expression in (10) shown as a dotted (red) line. In all cases .

The assumptions (A1)–(A4) do not involve , and Theorem 2 continues to hold if we set ; this can be checked by going through the proof of Theorem 2 in Section 6. Theorem 2 can thus also be considered as an efficiency result about conformalizing the standard non-Bayesian least squares procedure; this procedure outputs precisely with as its prediction intervals (see, e.g., [8], p. 131). The least squares procedure has guaranteed coverage probability under weaker assumptions than BRR (not requiring assumptions about ); however, its validity is not conditional, similarly to CRR.

5 Further details of CRR

By the definition of the CRR conformity measure, we can rewrite the conformity scores in (6) as

(11)

where the vector of residuals is , is the unit matrix, is the hat matrix, is the overall design matrix (the matrix whose th row is , ), and is the overall vector of labels with the label of the test object set to (i.e., is the vector in whose th element is , , and whose th element is ). If we modify the definition of CRR replacing (11) by , we will obtain the definition of upper CRR; and if we replace (11) by , we will obtain the definition of lower CRR. It is easy to see that the prediction set output by CRR at significance level is the intersection of the prediction sets output by upper and lower CRR at significance levels . We will concentrate on upper CRR in the rest of this paper: lower CRR is analogous, and CRR is determined by upper and lower CRR.

Let us represent the upper CRR prediction set in a more explicit form (following [12], Section 2.3). We are given the training sequence and a test object ; let be a postulated label for and

be the vector of labels. The vector of conformity scores is , where

The components of and , respectively, will be denoted by and .

If we define

(12)

the definition of the p-values can be rewritten as

remember that the prediction set is defined by (5). As shown (under a slightly different definition of ) in [12], pp. 30–34, the prediction set can be computed efficiently, in time in the on-line mode.

6 Proof of Theorem 2

For concreteness, we concentrate on the convergence (7) for the upper ends of the conformal and Bayesian prediction intervals. We split the proof into a series of steps.

Regularizing the rays in upper CRR

The upper CRR looks difficult to analyze in general, since the sets (12) may be rays pointing in the opposite directions. Fortunately, the awkward case () will be excluded for large under our assumptions (see Lemma 4 below). The following lemma gives a simple sufficient condition for its absence.

Lemma 3.

Suppose that, for each ,

(13)

where stands for the Euclidean norm. Then for all .

Intuitively, in the case of a small , (13) being violated for some means that all

lie approximately in the same hyperplane, and

is well outside it. The condition (13) can be expressed by saying that the matrix is positive definite.

Proof.

First we assume (so that ridge regression becomes least squares); an extension to will be easy. In this case is the projection matrix onto the column space of the overall design matrix and is the projection matrix onto the orthogonal complement of . We can have for (or even ) only if the angle between and the hyperplane is or less; in other words, if the angle between and that hyperplane is or more; in other words, if there is an element of such that its last coordinate is and its projection onto the other coordinates has length at most 1.

To reduce the case to add the dummy objects , , labelled by 0 at the beginning of the training sequence; here is the standard basis of . ∎

Lemma 4.

The case for is excluded from some on almost surely under (A1)–(A4).

Proof.

We will check that (13) holds from some on. Let us set, without loss of generality, . Let . Since a.s.,

where

is the smallest eigenvalue of the given matrix. Since

a.s.,

for all from some on. ∎

Simplified upper CRR

Let us now find the upper CRR prediction set under the assumption that for all (cf. Lemmas 3 and 4 above). In this case each set (12) is

except for ; notice that only are defined. The p-value for any potential label of is

Therefore, the upper CRR prediction set at significance level is the ray

where and stands, as usual, for the th order statistic of .

Proof proper

As before, stands for the design matrix based on the first observations. A simple but tedious computation (see Appendix A) gives

(14)

where (cf. (3)). The first term in (14) is the centre of the Bayesian prediction interval (4); it does not depend on . We can see that

(15)

where is the th order statistic in the series

(16)

of residuals adjusted by dividing by . The behaviour of the order statistics of residuals is well studied: see, e.g., the theorem in [2]. The presence of complicates the situation, and so we first show that is small with high probability.

Lemma 5.

Let

be a sequence of IID random variables with a finite second moment. Then

in probability (and even almost surely) as .

Proof.

By the strong law of large numbers the sequence

converges a.s. as , and so a.s. This implies that a.s. ∎

Corollary 6.

Under the conditions of the theorem, in probability.

Proof.

Similarly to the proof of Lemma 4, we have, for almost all sequences ,

from some on. It remains to combine this with Lemma 5 and the fact that, by Assumption (A1), is bounded by a constant with high probability. ∎

Corollary 7.

Under the conditions of the theorem, in probability.

Proof.

Suppose that, on the contrary, there are and such that with probability at least for infinitely many . Fix such and . Suppose, for concreteness, that, with probability at least for infinitely many , we have , i.e., . The last inequality implies that for at least values of . By the definition (16) of this in turn implies that for at least values of . By Corollary 6, however, the last addend is less than with probability at least from some on (the fact that is bounded with high probability follows, e.g., from Lemma 8 below). This implies with positive probability from some on, and this contradiction completes the proof. ∎

The last (and most important) component of the proof is the following version of the theorem in [2], itself a version of the famous Bahadur representation theorem [1].

Lemma 8 ([2], theorem).

Under the conditions of Theorem 2,

(17)

where is the empirical distribution function of the noise and is the ridge regression estimate of .

For details of the proof (under our assumptions), see Appendix B.

By (15), Corollary 6, and Slutsky’s lemma (see, e.g., [9], Lemma 2.8), it suffices to prove (7) with the left-hand side replaced by . Moreover, by Corollary 7 and Slutsky’s lemma, it suffices to prove (7) with the left-hand side replaced by ; this is what we will do.

Lemma 8 holds in the situation where is a constant vector (the distribution of is allowed to be degenerate). Let be a Borel set in such that (17) holds for all , where the “a.s.” is now interpreted as “for almost all sequences ”. By Lebesgue’s dominated convergence theorem, it suffices to prove (7) with the left-hand side replaced by for a fixed and a fixed sequence . Therefore, we fix and ; the only remaining source of randomness is . Finally, by the definition of the set , it suffices to prove (7) with the left-hand side replaced by

(18)

Without loss of generality we will assume that as (this extra assumption about will ensure that Lindeberg’s condition is satisfied below).

Since and

where is the distribution function of , we have

by the central limit theorem (in its simplest form).

Since is the ridge regression estimate,

(19)
(20)

Furthermore, for

This gives

(the asymptotic, and even exact, normality is obvious from the formula for ).

Let us now calculate the covariance between the two addends in (18):

where and the last equality uses the decomposition with the second addend having zero expected value. Since

where , , . An easy computation gives , and so we have

as , where is the arithmetic mean of . Finally, this implies that (18) converges in law to

the asymptotic normality of (18) follows from the central limit theorem with Lindeberg’s condition, which holds since (18) is a linear combination of the noise random variables with coefficients whose maximum is as (this uses the assumption made earlier).

A more intuitive (but not necessarily simpler) proof can be obtained by noticing that and the residuals are asymptotically (precisely when ) independent.

7 Conclusion

The results of this paper are asymptotic; it would be very interesting to obtain their non-asymptotic counterparts. In non-asymptotic settings, however, it is not always true that conformalized ridge regression loses little in efficiency as compared with the Bayesian prediction interval; this is illustrated in [12], Section 8.5, and illustrated and explained in [13]. The main difference is that CRR and Bayesian predictor start producing informative predictions after seeing a different number of observations. CRR, like any other conformal predictor (or any other method whose validity depends only on the IID assumption), starts producing informative predictions only after the number of observations exceeds the inverse significance level . After this theoretical lower bound is exceeded, however, the difference between CRR and Bayesian predictions quickly becomes very small.

Another interesting direction of further research is to extend our results to kernel ridge regression.

Acknowledgements

We are grateful to Albert Shiryaev for inviting us in September 2013 to Kolmogorov’s dacha in Komarovka, where this project was conceived, and to Glenn Shafer for his advice about terminology. This work was supported in part by EPSRC (grant EP/K033344/1).

References

  • [1] R. Raj Bahadur. A note on quantiles in large samples. Annals of Mathematical Statistics, 37:577–580, 1966.
  • [2] Raymond J. Carroll. On the distribution of quantiles of residuals in a linear model. Technical Report Mimeo Series No. 1161, Department of Statistics, University of North Carolina at Chapel Hill, March 1978. Available from
    http://www.stat.ncsu.edu/information/library/mimeo.php.
  • [3] Samprit Chatterjee and Ali S. Hadi.

    Sensitivity Analysis in Linear Regression

    .
    Wiley, New York, 1988.
  • [4] László Györfi, Michael Kohler, Adam Krzyżak, and Harro Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, New York, 2002.
  • [5] Harold V. Henderson and Shayle R. Searle. On deriving the inverse of a sum of matrices. SIAM Review, 23:53–60, 1981.
  • [6] Jing Lei and Larry Wasserman. Distribution free prediction bands for nonparametric regression. Journal of the Royal Statistical Society B, 76:71–96, 2014.
  • [7] Ronald H. Randles, Thomas P. Hettmansperger, and George Casella. Introduction to the Special Issue: Nonparametric statistics. Statistical Science, 19:561, 2004.
  • [8] George A. F. Seber and Alan J. Lee.

    Linear Regression Analysis

    .
    Wiley, Hoboken, NJ, second edition, 2003.
  • [9] Aad W. van der Vaart. Asymptotic Statistics. Cambridge University Press, Cambridge, 1998.
  • [10] Vladimir Vovk. Conditional validity of inductive conformal predictors. Machine Learning, 92:349–376, 2013.
  • [11] Vladimir Vovk. Kernel ridge regression. In Bernhard Schölkopf, Zhiyuan Luo, and Vladimir Vovk, editors, Empirical Inference: Festschrift in Honour of Vladimir N. Vapnik, chapter 11, pages 105–116. Springer, Berlin, 2013.
  • [12] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, New York, 2005.
  • [13] Vladimir Vovk, Ilia Nouretdinov, and Alex Gammerman. On-line predictive linear regression. Annals of Statistics, 37:1566–1590, 2009.
  • [14] Larry Wasserman. Frasian inference. Statistical Science, 26:322–325, 2011.

Appendix A Various computations

For the reader’s convenience, this appendix provides details of various routine calculations.

A singular in (9)

Apply (9) to and , where , in place of and , respectively, and let .

Computing for simplified upper CRR

In addition to the notation for the design matrix based on the first observations, we will use the notation for the hat matrix based on the first observations and for the hat matrix based on the first observations; the elements of will be denoted as and the elements of as ; as always, stands for the diagonal element . To compute we will use the formulas (2.18) in [3].

Since is the last column of and

we have

Therefore,

Next, letting stand for the predictions computed from the first observations,