A Residual Bootstrap for High-Dimensional Regression with Near Low-Rank Designs

07/04/2016 ∙ by Miles E. Lopes, et al. ∙ berkeley college 0

We study the residual bootstrap (RB) method in the context of high-dimensional linear regression. Specifically, we analyze the distributional approximation of linear contrasts c^ (β̂_ρ-β), where β̂_ρ is a ridge-regression estimator. When regression coefficients are estimated via least squares, classical results show that RB consistently approximates the laws of contrasts, provided that p≪ n, where the design matrix is of size n× p. Up to now, relatively little work has considered how additional structure in the linear model may extend the validity of RB to the setting where p/n 1. In this setting, we propose a version of RB that resamples residuals obtained from ridge regression. Our main structural assumption on the design matrix is that it is nearly low rank --- in the sense that its singular values decay according to a power-law profile. Under a few extra technical assumptions, we derive a simple criterion for ensuring that RB consistently approximates the law of a given contrast. We then specialize this result to study confidence intervals for mean response values X_i^β, where X_i^ is the ith row of the design. More precisely, we show that conditionally on a Gaussian design with near low-rank structure, RB simultaneously approximates all of the laws X_i^(β̂_ρ-β), i=1,...,n. This result is also notable as it imposes no sparsity assumptions on β. Furthermore, since our consistency results are formulated in terms of the Mallows (Kantorovich) metric, the existence of a limiting distribution is not required.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Until recently, much of the emphasis in the theory of high-dimensional statistics has been on “first order” problems, such as estimation and prediction. As the understanding of these problems has become more complete, attention has begun to shift increasingly towards “second order” problems, dealing with hypothesis tests, confidence intervals, and uncertainty quantification 

[1, 2, 3, 4, 5, 6]. In this direction, much less is understood about the effects of structure, regularization, and dimensionality — leaving many questions open. One collection of such questions that has attracted growing interest deals with the operating characteristics of the bootstrap in high dimensions [7, 8, 9]

. Due to the fact that bootstrap is among the most widely used tools for approximating the sampling distributions of test statistics and estimators, there is much practical value in understanding what factors allow for the bootstrap to succeed in the high-dimensional regime.

The regression model and linear contrasts.

In this paper, we focus our attention on high-dimensional linear regression, and our aim is to know when the residual bootstrap (RB) method consistently approximates the laws of linear contrasts. (A review of RB is given in Section 2.)

To specify the model, suppose that we observe a response vector

, generated according to

(1)

where is the observed design matrix, is an unknown vector of coefficients, and the error variables are drawn i.i.d. from an unknown distribution , with mean

and unknown variance

. As is conventional in high-dimensional statistics, we assume the model (1) is embedded in a sequence of models indexed by . Hence, we allow , , and to vary implicitly with . We will leave unconstrained until Section 3.3, where we will assume in Theorem 3, and then in Section 3.4, we will assume further that is bounded strictly between 0 and 1. The distribution is fixed with respect to , and none of our results require

to have more than four moments.

Although we are primarily interested in cases where the design matrix is deterministic, we will also study the performance of the bootstrap conditionally on a Gaussian design. For this reason, we will use the symbol even when the design is non-random so that confusion does not arise in relating different sections of the paper. Likewise, the symbol refers to unconditional expectation over all sources of randomness. Whenever the design is random, we will assume , denoting the distribution of by , and the distribution of by .

Within the context of the regression, we will be focused on linear contrasts , where is a fixed vector and is an estimate of . The importance of contrasts arises from the fact that they unify many questions about a linear model. For instance, testing the significance of the th coefficient may be addressed by choosing to be the standard basis vector . Another important problem is quantifying the uncertainty of point predictions, which may be addressed by choosing , i.e. the th row of the design matrix. In this case, an approximation to the law of the contrast leads to a confidence interval for the mean response value . Further applications of contrasts occur in the broad topic of ANOVA [10].

Intuition for structure and regularization in RB.

The following two paragraphs explain the core conceptual aspects of the paper. To understand the role of regularization in applying RB to high-dimensional regression, it is helpful to think of RB in terms of two ideas. First, if

denotes the ordinary least squares estimator, then it is a simple but important fact that contrasts can be written as

where . Hence, if it were possible to sample directly from , then the law of any such contrast could be easily determined. Since is unknown, the second key idea is to use the residuals of some estimator as a proxy for samples from . When , the least-squares residuals are a good proxy [11, 12]. However, it is well-known that least-squares tends to overfit when . When fits “too well”, this means that its residuals are “too small”, and hence they give a poor proxy for . Therefore, by using a regularized estimator , overfitting can be avoided, and the residuals of may offer a better way of obtaining “approximate samples” from .

The form of regularized regression we will focus on is ridge regression:

(2)

where is a user-specificed regularization parameter. As will be seen in Sections 3.2 and 3.3, the residuals obtained from ridge regression lead to a particularly good approximation of when the design matrix is nearly low-rank, in the sense that most of its singular values are close to 0. In essence, this condition is a form of sparsity, since it implies that the rows of nearly lie in a low-dimensional subspace of . However, this type of structural condition has a significant advantage over the the more well-studied assumption that is sparse. Namely, the assumption that is nearly low-rank can be inspected directly in practice — whereas sparsity in is typically unverifiable. In fact, our results will impose no conditions on , other than that remains bounded as . Finally, it is worth noting that the occurrence of near low-rank design matrices is actually very common in applications, and is often referred to as collinearity [13, ch. 17].

Contributions and outline.

The primary contribution of this paper is a complement to the work of Bickel and Freedman [12] (hereafter B&F 1983) — who showed that in general, the RB method fails to approximate the laws of least-squares contrasts when . Instead, we develop an alternative set of results, proving that even when , RB can successfully approximate the laws of “ridged contrasts” for many choices of , provided that the design matrix is nearly low rank. A particularly interesting consequence of our work is that RB successfully approximates the law for a certain choice of that was shown in B&F 1983 to “break” RB when applied to least-squares. Specifically, such a can be chosen as one of the rows of with a high leverage score (see Section 4). This example corresponds to the practical problem of setting confidence intervals for mean response values . (See [12, p. 41], as well as Lemma 2 and Theorem 4 in Section 3.4). Lastly, from a technical point of view, a third notable aspect of our results is that they are formulated in terms of the Mallows- metric, which frees us from having to impose conditions that force a limiting distribution to exist.

Apart from B&F 1983, the most closely related works we are aware of are the recent papers [7] and [8], which also consider RB in the high-dimensional setting. However, these works focus on role of sparsity in and do not make use of low-rank structure in the design, whereas our work deals only with structure in the design and imposes no sparsity assumptions on .

The remainder of the paper is organized as follows. In Section 2, we formulate the problem of approximating the laws of contrasts, and describe our proposed methodology for RB based on ridge regression. Then, in Section 3 we state several results that lay the groundwork for Theorem 4, which shows that that RB can successfully approximate all of the laws , , conditionally on a Gaussian design. Due to space constraints, all proofs are deferred to material that will appear in a separate work.

Notation and conventions.

If and

are random variables, then

denotes the law of , conditionally on . If and are two sequences of real numbers, then the notation means that there is an absolute constant and an integer such that for all . The notation means that and . For a square matrix

whose eigenvalues are real, we denote them by

.

2 Problem setup and methodology

Problem setup.

For any , it is clear that conditionally on , the law of is completely determined by , and hence it makes sense to use the notation

(3)

The problem we aim to solve is to approximate the distribution for suitable choices of .

Review of the residual bootstrap (RB) procedure.

We briefly explain the steps involved in the residual bootstrap procedure, applied to the ridge estimator of . To proceed somewhat indirectly, consider the following “bias-variance” decomposition of , conditionally on ,

(4)

Note that the distribution has mean zero, and so that the second term on the right side is the bias of as an estimator of . Furthermore, the distribution may be viewed as the “variance component” of . We will be interested in situations where the regularization parameter may be chosen small enough so that the bias component is small. In this case, one has and then it is enough to find an approximation to the law , which is unknown. To this end, a simple manipulation of leads to

(5)

Now, to approximate , let be any centered estimate of . (Typically, is obtained by using the centered residuals of some estimator of , but this is not necessary in general.) Also, let be an i.i.d. sample from . Then, replacing with in line (5) yields

(6)

At this point, we define the (random) measure to be the RB approximation to . Hence, it is clear that the RB approximation is simply a “plug-in rule”.

A two-stage approach.

An important feature of the procedure just described is that we are free to use any centered estimator of . This fact offers substantial flexibility in approximating . One way of exploiting this flexibility is to consider a two-stage approach, where a “pilot” ridge estimator is used to first compute residuals whose centered empirical distribution function is , say. Then, in the second stage, the distribution is used to approximate via the relation (6). To be more detailed, if are the residuals of , then we define to be the distribution that places mass at each of the values with . Here, it is important to note that the value is chosen to optimize as an approximation to . By contrast, the choice of

depends on the relative importance of width and coverage probability for confidence intervals based on

. Theorems 13, and 4 will offer some guidance in selecting and .

Resampling algorithm.

To summarize the discussion above, if is user-specified number of bootstrap replicates, our proposed method for approximating is given below.

  1. Select and , and compute the residuals .

  2. Compute the centered distribution function , putting mass at each .

  3. For :

    • Draw a vector of i.i.d. samples from .

    • Compute .

  4. Return the empirical distribution of .

Clearly, as , the empirical distribution of converges weakly to , with probability 1. As is conventional, our theoretical analysis in the next section will ignore Monte Carlo issues, and address only the performance of as an approximation to .

3 Main results

The following metric will be central to our theoretical results, and has been a standard tool in the analysis of the bootstrap, beginning with the work of Bickel and Freedman [14].

The Mallows (Kantorovich) metric.

For two random vectors and in a Euclidean space, the Mallows- metric is defined by

(7)

where the infimum is over the class

of joint distributions

whose marginals are and . It is worth noting that convergence in is strictly stronger than weak convergence, since it also requires convergence of second moments. Additional details may be found in the paper [14].

3.1 A bias-variance decomposition for bootstrap approximation

To give some notation for analyzing the bias-variance decomposition of in line (4), we define the following quantities based upon the ridge estimator . Namely, the variance is

To express the bias of , we define the vector according to

(8)

and then put

(9)

We will sometimes omit the arguments of and to lighten notation. Note that does not depend on , and only depends on through .

The following result gives a regularized and high-dimensional extension of some lemmas in Freedman’s early work [11] on RB for least squares. The result does not require any structural conditions on the design matrix, or on the true parameter .

Theorem 1 (consistency criterion).

Suppose is fixed. Let be any estimator of , and let be any vector such that . Then with -probability 1, the following inequality holds for every , and every ,

(10)
Remarks.

Observe that the normalization ensures that the bound is non-trivial, since the distribution has variance equal to 1 for all (and hence does not become degenerate for large ). To consider the choice of , it is simple to verify that the ratio decreases monotonically as decreases. Note also that as becomes small, the variance becomes large, and likewise, confidence intervals based on become wider. In other words, there is a trade-off between the width of the confidence interval and the size of the bound (10).

Sufficient conditions for consistency of RB.

An important practical aspect of Theorem 1 is that for any given contrast , the variance can be easily estimated, since it only requires an estimate of , which can be obtained from . Consequently, whenever theoretical bounds on and are available, the right side of line (10) can be controlled. In this way, Theorem 1 offers a simple route for guaranteeing that RB is consistent. In Sections 3.2 and 3.3 to follow, we derive a bound on in the case where is chosen to be . Later on in Section 3.4, we study RB consistency in the context of prediction with a Gaussian design, and there we derive high probability bounds on both and where is a particular row of .

3.2 A link between bootstrap consistency and MSPE

If is an estimator of , its mean-squared prediction error (MSPE), conditionally on , is defined as

(11)

The previous subsection showed that in-law approximation of contrasts is closely tied to the approximation of . We now take a second step of showing that if the centered residuals of an estimator are used to approximate , then the quality of this approximation can be bounded naturally in terms of . This result applies to any estimator computed from the observations (1).

Theorem 2.

Suppose is fixed. Let be any estimator of , and let be the empirical distribution of the centered residuals of . Also, let denote the empirical distribution of i.i.d. samples from . Then for every ,

(12)
Remarks.

As we will see in the next section, the MSPE of ridge regression can be bounded in a sharp way when the design matrix is approximately low rank, and there we will analyze for the pilot estimator. Consequently, when near low-rank structure is available, the only remaining issue in controlling the right side of line (12) is to bound the quantity . The very recent work of Bobkov and Ledoux [15] provides an in-depth study of this question, and they derive a variety bounds under different tail conditions on . We summarize one of their results below.

Lemma 1 (Bobkov and Ledoux, 2014).

If has a finite fourth moment, then

(13)

Remarks. The fact that the squared distance is bounded at the rate of is an indication that is a rather strong metric on distributions. For a detailed discussion of this result, see Corollaries 7.17 and 7.18 in the paper [15]. Although it is possible to obtain faster rates when more stringent tail conditions are placed on , we will only need a fourth moment, since the term in Theorem 2 will often have a slower rate than , as discussed in the next section.

3.3 Consistency of ridge regression in MSPE for near low rank designs

In this subsection, we show that when the tuning parameter is set at a suitable rate, the pilot ridge estimator is consistent in MSPE when the design matrix is near low-rank — even when is large, and without any sparsity constraints on . We now state some assumptions.

A 1.

There is a number , and absolute constants , such that

A 2.

There are absolute constants , such that for every , and .

A 3.

The vector satisfies .

Due to Theorem 2, the following bound shows that the residuals of may be used to extract a consistent approximation to . Two other notable features of the bound are that it is non-asymptotic and dimension-free.

Theorem 3.

Suppose that is fixed and that Assumptions 13 hold, with . Assume further that is chosen as when , and when . Then,

(14)

Also, both bounds in (14) are tight in the sense that can be chosen so that attains either rate.

Remarks.

Since the eigenvalues are observable, they may be used to estimate and guide the selection of . However, from a practical point of view, we found it easier to select via cross-validation in numerical experiments, rather than via an estimate of .

A link with Pinsker’s Theorem.

In the particular case when

is a centered Gaussian distribution, the “prediction problem” of estimating

is very similar to estimating the mean parameters of a Gaussian sequence model, with error measured in the norm. In the alternative sequence-model format, the decay condition on the eigenvalues of translates into an ellipsoid constraint on the mean parameter sequence [16, 17]. For this reason, Theorem 3 may be viewed as “regression version” of error bounds for the sequence model under an ellipsoid constraint (cf. Pinsker’s Theorem, [16, 17]). Due to the fact that the latter problem has a very well developed literature, there may be various “neighboring results” elsewhere. Nevertheless, we could not find a direct reference for our stated MSPE bound in the current setup. For the purposes of our work in this paper, the more important point to take away from Theorem 3 is that it can be coupled with Theorem 2 for proving consistency of RB.

3.4 Confidence intervals for mean responses, conditionally on a Gaussian design

In this section, we consider the situation where the design matrix has rows

drawn i.i.d. from a multivariate normal distribution

, with . (The covariance matrix may vary with .) Conditionally on a realization of , we analyze the RB approximation of the laws . As discussed in Section 1, this corresponds to the problem of setting confidence intervals for the mean responses . Assuming that the population eigenvalues obey a decay condition, we show below in Theorem 4 that RB succeeds with high -probability. Moreover, this consistency statement holds for all of the laws simultaneously. That is, among the distinct laws , , even the worst bootstrap approximation is still consistent. We now state some population-level assumptions.

A 4.

The operator norm of satisfies .

Next, we impose a decay condition on the eigenvalues of . This condition also ensures that is invertible for each fixed — even though the bottom eigenvalue may become arbitrarily small as becomes large. It is important to notice that we now use for the decay exponent of the population eigenvalues, whereas we used when describing the sample eigenvalues in the previous section.

A 5.

There is a number , and absolute constants , such that for all ,

A 6.

There are absolute constants such that for all , we have the bounds
         and .

The following lemma collects most of the effort needed in proving our final result in Theorem 4. Here it is also helpful to recall the notation and from Assumption 2.

Lemma 2.

Suppose that the matrix has rows drawn i.i.d. from , and that Assumptions 26 hold. Furthermore, assume that chosen so that . Then, the statements below are true.

(i) (bias inequality)
Fix any . Then, there is an absolute constant , such that for all large , the following event holds with -probability at least ,

(15)

(ii) (variance inequality)
There are absolute constants such that for all large , the following event holds with -probability at least ,

(16)

(iii) (mspe inequalities)
Suppose that is chosen as when , and that is chosen as when . Then, there are absolute constants such that for all large ,

Remarks.

Note that the two rates in part (iii) coincide as approaches . At a conceptual level, the entire lemma may be explained in relatively simple terms. Viewing the quantities , and as functionals of a Gaussian matrix, the proof involves deriving concentration bounds for each of them. Indeed, this is plausible given that these quantities are smooth functionals of . However, the difficulty of the proof arises from the fact that they are also highly non-linear functionals of .

We now combine Lemmas 1 and 2 with Theorems 1 and 2 to show that all of the laws can be simultaneously approximated via our two-stage RB method.

Theorem 4.

Suppose that has a finite fourth moment, Assumptions 26 hold, and is chosen so that . Also suppose that is chosen as when , and when . Then, there is a sequence of positive numbers with , such that the event

(17)

has -probability tending to 1 as .

Remark.

Lemma 2 gives explicit bounds on the numbers , as well as the probabilities of the corresponding events, but we have stated the result in this way for the sake of readability.

4 Simulations

In four different settings of and the decay parameter , we compared the nominal confidence intervals (CIs) of four methods: “oracle”, “ridge”, “normal”, and “OLS”, to be described below. In each setting, we generated random designs with i.i.d. rows drawn from , where ,

, and the eigenvectors of

were drawn randomly by setting them to be the factor in a decomposition of a standard Gaussian matrix. Then, for each realization of , we generated realizations of according to the model (1), where , and is the centered distribution on degrees of freedom, rescaled to have standard deviation . For each , and each corresponding , we considered the problem of setting a CI for the mean response value , where is the row with the highest leverage score, i.e. and . This problem was shown in B&F 1983 to be a case where the standard RB method based on least-squares fails when . Below, we refer to this method as “OLS”.

To describe the other three methods, “ridge” refers to the interval , where is the quantile of the numbers computed in the proposed algorithm in Section 2, with and . To choose the parameters and for a given and , we first computed as the value that optimized the MSPE error of a ridge estimator with respect to 5-fold cross validation; i.e. cross validation was performed for every distinct pair . We then put and , as we found the prefactors 5 and 0.1 to work adequately across various settings. (Optimizing with respect to MSPE is motivated by Theorems 12, and 3. Also, choosing to be somewhat smaller than conforms with the constraints on and in Theorem 4.) The method “normal” refers to the CI based on the normal approximation , where , , and

is the usual unbiased estimate of

based on OLS residuals. The “oracle” method refers to the interval , with , and being the empirical quantile of over all realizations of based on a given . (This accounts for the randomness in .)

Within a given setting of the triplet , we refer to the “coverage” of a method as the fraction of the instances where the method’s CI contained the parameter . Also, we refer to “width” as the average width of a method’s intervals over all of the instances. The four settings of correspond to moderate/high dimension and moderate/fast decay of the eigenvalues . Even in the moderate case of , the results show that the OLS intervals are too narrow and have coverage noticeably less than 90%. As expected, this effect becomes more pronounced when . The ridge and normal intervals perform reasonably well across settings, with both performing much better than OLS. However, it should be emphasized that our study of RB is motivated by the desire to gain insight into the behavior of the bootstrap in high dimensions — rather than trying to outperform particular methods. In future work, we plan to investigate the relative merits of the ridge and normal intervals in greater detail.

    oracle    ridge    normal    OLS
setting 1     width 0.21 0.20 0.23 0.16
coverage 0.90 0.87 0.91 0.81
setting 2 width 0.22 0.26 0.26 0.06
coverage 0.90 0.88 0.88 0.42
setting 3 width 0.20 0.21 0.22 0.16
coverage 0.90 0.90 0.91 0.81
setting 4 width 0.21 0.26 0.23 0.06
coverage 0.90 0.92 0.87 0.42
Table 1: Comparison of nominal confidence intervals
Acknowledgements.

MEL thanks Prof. Peter J. Bickel for many helpful discussions, and gratefully acknowledges the DOE CSGF under grant DE-FG02-97ER25308, as well as the NSF-GRFP.

Appendix A Proof of Theorem 1

Proof.

Due to line (4) and Lemma 8.8 in B&F 1981,

(18)

If is a random vector whose entries are drawn i.i.d. from , then the definition of gives the matching relations

(19)

To make use of these relations, we apply Lemma 8.9 in B&F 1981, which implies that if is a generic deterministic vector, and if and are random vectors with i.i.d. entries, then

Therefore,

(20)

Combining this with line (18) and dividing through by proves the claim.∎

Appendix B Proof of Theorem 2

Proof.

By the triangle inequality,

(21)

Let be the (uncentered) empirical distribution of the residuals of , which places mass at each value , for . The proofs of Lemmas 2.1 and 2.2 in Freedman 1981, show that

(22)

where we have used the algebraic identity , which holds for any estimator . This completes the proof.∎

Appendix C Proof of Theorem 3

Proof.

We begin with a simple bias-variance decomposition,

(23)

We will handle the bias and variance terms separately. To consider the bias term, note that where

Hence,

(24)

If we let , then the eigenvalues of are of the form . In particular, it is simple to check111 Note that if and , then is maximized at . Also, if , then there at least one that scales at the rate of . that whenever , and so

(25)

Note that this bound is tight, since it is achieved whenever is parallel to the top eigenvector of .

To consider the variance term, note that and so

(26)

It is natural to decompose the sum in terms of the index set

(27)

which satisfies . We will bound the variance term in two complementary cases; either or . First assume . Then,

(28)
(29)