1 Introduction
Until recently, much of the emphasis in the theory of highdimensional statistics has been on “first order” problems, such as estimation and prediction. As the understanding of these problems has become more complete, attention has begun to shift increasingly towards “second order” problems, dealing with hypothesis tests, confidence intervals, and uncertainty quantification
[1, 2, 3, 4, 5, 6]. In this direction, much less is understood about the effects of structure, regularization, and dimensionality — leaving many questions open. One collection of such questions that has attracted growing interest deals with the operating characteristics of the bootstrap in high dimensions [7, 8, 9]. Due to the fact that bootstrap is among the most widely used tools for approximating the sampling distributions of test statistics and estimators, there is much practical value in understanding what factors allow for the bootstrap to succeed in the highdimensional regime.
The regression model and linear contrasts.
In this paper, we focus our attention on highdimensional linear regression, and our aim is to know when the residual bootstrap (RB) method consistently approximates the laws of linear contrasts. (A review of RB is given in Section 2.)
To specify the model, suppose that we observe a response vector
, generated according to(1) 
where is the observed design matrix, is an unknown vector of coefficients, and the error variables are drawn i.i.d. from an unknown distribution , with mean
and unknown variance
. As is conventional in highdimensional statistics, we assume the model (1) is embedded in a sequence of models indexed by . Hence, we allow , , and to vary implicitly with . We will leave unconstrained until Section 3.3, where we will assume in Theorem 3, and then in Section 3.4, we will assume further that is bounded strictly between 0 and 1. The distribution is fixed with respect to , and none of our results requireto have more than four moments.
Although we are primarily interested in cases where the design matrix is deterministic, we will also study the performance of the bootstrap conditionally on a Gaussian design. For this reason, we will use the symbol even when the design is nonrandom so that confusion does not arise in relating different sections of the paper. Likewise, the symbol refers to unconditional expectation over all sources of randomness. Whenever the design is random, we will assume , denoting the distribution of by , and the distribution of by .
Within the context of the regression, we will be focused on linear contrasts , where is a fixed vector and is an estimate of . The importance of contrasts arises from the fact that they unify many questions about a linear model. For instance, testing the significance of the th coefficient may be addressed by choosing to be the standard basis vector . Another important problem is quantifying the uncertainty of point predictions, which may be addressed by choosing , i.e. the th row of the design matrix. In this case, an approximation to the law of the contrast leads to a confidence interval for the mean response value . Further applications of contrasts occur in the broad topic of ANOVA [10].
Intuition for structure and regularization in RB.
The following two paragraphs explain the core conceptual aspects of the paper. To understand the role of regularization in applying RB to highdimensional regression, it is helpful to think of RB in terms of two ideas. First, if
denotes the ordinary least squares estimator, then it is a simple but important fact that contrasts can be written as
where . Hence, if it were possible to sample directly from , then the law of any such contrast could be easily determined. Since is unknown, the second key idea is to use the residuals of some estimator as a proxy for samples from . When , the leastsquares residuals are a good proxy [11, 12]. However, it is wellknown that leastsquares tends to overfit when . When fits “too well”, this means that its residuals are “too small”, and hence they give a poor proxy for . Therefore, by using a regularized estimator , overfitting can be avoided, and the residuals of may offer a better way of obtaining “approximate samples” from .The form of regularized regression we will focus on is ridge regression:
(2) 
where is a userspecificed regularization parameter. As will be seen in Sections 3.2 and 3.3, the residuals obtained from ridge regression lead to a particularly good approximation of when the design matrix is nearly lowrank, in the sense that most of its singular values are close to 0. In essence, this condition is a form of sparsity, since it implies that the rows of nearly lie in a lowdimensional subspace of . However, this type of structural condition has a significant advantage over the the more wellstudied assumption that is sparse. Namely, the assumption that is nearly lowrank can be inspected directly in practice — whereas sparsity in is typically unverifiable. In fact, our results will impose no conditions on , other than that remains bounded as . Finally, it is worth noting that the occurrence of near lowrank design matrices is actually very common in applications, and is often referred to as collinearity [13, ch. 17].
Contributions and outline.
The primary contribution of this paper is a complement to the work of Bickel and Freedman [12] (hereafter B&F 1983) — who showed that in general, the RB method fails to approximate the laws of leastsquares contrasts when . Instead, we develop an alternative set of results, proving that even when , RB can successfully approximate the laws of “ridged contrasts” for many choices of , provided that the design matrix is nearly low rank. A particularly interesting consequence of our work is that RB successfully approximates the law for a certain choice of that was shown in B&F 1983 to “break” RB when applied to leastsquares. Specifically, such a can be chosen as one of the rows of with a high leverage score (see Section 4). This example corresponds to the practical problem of setting confidence intervals for mean response values . (See [12, p. 41], as well as Lemma 2 and Theorem 4 in Section 3.4). Lastly, from a technical point of view, a third notable aspect of our results is that they are formulated in terms of the Mallows metric, which frees us from having to impose conditions that force a limiting distribution to exist.
Apart from B&F 1983, the most closely related works we are aware of are the recent papers [7] and [8], which also consider RB in the highdimensional setting. However, these works focus on role of sparsity in and do not make use of lowrank structure in the design, whereas our work deals only with structure in the design and imposes no sparsity assumptions on .
The remainder of the paper is organized as follows. In Section 2, we formulate the problem of approximating the laws of contrasts, and describe our proposed methodology for RB based on ridge regression. Then, in Section 3 we state several results that lay the groundwork for Theorem 4, which shows that that RB can successfully approximate all of the laws , , conditionally on a Gaussian design. Due to space constraints, all proofs are deferred to material that will appear in a separate work.
Notation and conventions.
If and
are random variables, then
denotes the law of , conditionally on . If and are two sequences of real numbers, then the notation means that there is an absolute constant and an integer such that for all . The notation means that and . For a square matrixwhose eigenvalues are real, we denote them by
.2 Problem setup and methodology
Problem setup.
For any , it is clear that conditionally on , the law of is completely determined by , and hence it makes sense to use the notation
(3) 
The problem we aim to solve is to approximate the distribution for suitable choices of .
Review of the residual bootstrap (RB) procedure.
We briefly explain the steps involved in the residual bootstrap procedure, applied to the ridge estimator of . To proceed somewhat indirectly, consider the following “biasvariance” decomposition of , conditionally on ,
(4) 
Note that the distribution has mean zero, and so that the second term on the right side is the bias of as an estimator of . Furthermore, the distribution may be viewed as the “variance component” of . We will be interested in situations where the regularization parameter may be chosen small enough so that the bias component is small. In this case, one has and then it is enough to find an approximation to the law , which is unknown. To this end, a simple manipulation of leads to
(5) 
Now, to approximate , let be any centered estimate of . (Typically, is obtained by using the centered residuals of some estimator of , but this is not necessary in general.) Also, let be an i.i.d. sample from . Then, replacing with in line (5) yields
(6) 
At this point, we define the (random) measure to be the RB approximation to . Hence, it is clear that the RB approximation is simply a “plugin rule”.
A twostage approach.
An important feature of the procedure just described is that we are free to use any centered estimator of . This fact offers substantial flexibility in approximating . One way of exploiting this flexibility is to consider a twostage approach, where a “pilot” ridge estimator is used to first compute residuals whose centered empirical distribution function is , say. Then, in the second stage, the distribution is used to approximate via the relation (6). To be more detailed, if are the residuals of , then we define to be the distribution that places mass at each of the values with . Here, it is important to note that the value is chosen to optimize as an approximation to . By contrast, the choice of
depends on the relative importance of width and coverage probability for confidence intervals based on
. Theorems 1, 3, and 4 will offer some guidance in selecting and .Resampling algorithm.
To summarize the discussion above, if is userspecified number of bootstrap replicates, our proposed method for approximating is given below.

Select and , and compute the residuals .

Compute the centered distribution function , putting mass at each .

For :

Draw a vector of i.i.d. samples from .

Compute .


Return the empirical distribution of .
Clearly, as , the empirical distribution of converges weakly to , with probability 1. As is conventional, our theoretical analysis in the next section will ignore Monte Carlo issues, and address only the performance of as an approximation to .
3 Main results
The following metric will be central to our theoretical results, and has been a standard tool in the analysis of the bootstrap, beginning with the work of Bickel and Freedman [14].
The Mallows (Kantorovich) metric.
For two random vectors and in a Euclidean space, the Mallows metric is defined by
(7) 
where the infimum is over the class
whose marginals are and . It is worth noting that convergence in is strictly stronger than weak convergence, since it also requires convergence of second moments. Additional details may be found in the paper [14].3.1 A biasvariance decomposition for bootstrap approximation
To give some notation for analyzing the biasvariance decomposition of in line (4), we define the following quantities based upon the ridge estimator . Namely, the variance is
To express the bias of , we define the vector according to
(8) 
and then put
(9) 
We will sometimes omit the arguments of and to lighten notation. Note that does not depend on , and only depends on through .
The following result gives a regularized and highdimensional extension of some lemmas in Freedman’s early work [11] on RB for least squares. The result does not require any structural conditions on the design matrix, or on the true parameter .
Theorem 1 (consistency criterion).
Suppose is fixed. Let be any estimator of , and let be any vector such that . Then with probability 1, the following inequality holds for every , and every ,
(10) 
Remarks.
Observe that the normalization ensures that the bound is nontrivial, since the distribution has variance equal to 1 for all (and hence does not become degenerate for large ). To consider the choice of , it is simple to verify that the ratio decreases monotonically as decreases. Note also that as becomes small, the variance becomes large, and likewise, confidence intervals based on become wider. In other words, there is a tradeoff between the width of the confidence interval and the size of the bound (10).
Sufficient conditions for consistency of RB.
An important practical aspect of Theorem 1 is that for any given contrast , the variance can be easily estimated, since it only requires an estimate of , which can be obtained from . Consequently, whenever theoretical bounds on and are available, the right side of line (10) can be controlled. In this way, Theorem 1 offers a simple route for guaranteeing that RB is consistent. In Sections 3.2 and 3.3 to follow, we derive a bound on in the case where is chosen to be . Later on in Section 3.4, we study RB consistency in the context of prediction with a Gaussian design, and there we derive high probability bounds on both and where is a particular row of .
3.2 A link between bootstrap consistency and MSPE
If is an estimator of , its meansquared prediction error (MSPE), conditionally on , is defined as
(11) 
The previous subsection showed that inlaw approximation of contrasts is closely tied to the approximation of . We now take a second step of showing that if the centered residuals of an estimator are used to approximate , then the quality of this approximation can be bounded naturally in terms of . This result applies to any estimator computed from the observations (1).
Theorem 2.
Suppose is fixed. Let be any estimator of , and let be the empirical distribution of the centered residuals of . Also, let denote the empirical distribution of i.i.d. samples from . Then for every ,
(12) 
Remarks.
As we will see in the next section, the MSPE of ridge regression can be bounded in a sharp way when the design matrix is approximately low rank, and there we will analyze for the pilot estimator. Consequently, when near lowrank structure is available, the only remaining issue in controlling the right side of line (12) is to bound the quantity . The very recent work of Bobkov and Ledoux [15] provides an indepth study of this question, and they derive a variety bounds under different tail conditions on . We summarize one of their results below.
Lemma 1 (Bobkov and Ledoux, 2014).
If has a finite fourth moment, then
(13) 
Remarks. The fact that the squared distance is bounded at the rate of is an indication that is a rather strong metric on distributions. For a detailed discussion of this result, see Corollaries 7.17 and 7.18 in the paper [15]. Although it is possible to obtain faster rates when more stringent tail conditions are placed on , we will only need a fourth moment, since the term in Theorem 2 will often have a slower rate than , as discussed in the next section.
3.3 Consistency of ridge regression in MSPE for near low rank designs
In this subsection, we show that when the tuning parameter is set at a suitable rate, the pilot ridge estimator is consistent in MSPE when the design matrix is near lowrank — even when is large, and without any sparsity constraints on . We now state some assumptions.
A 1.
There is a number , and absolute constants , such that
A 2.
There are absolute constants , such that for every , and .
A 3.
The vector satisfies .
Due to Theorem 2, the following bound shows that the residuals of may be used to extract a consistent approximation to . Two other notable features of the bound are that it is nonasymptotic and dimensionfree.
Theorem 3.
Remarks.
Since the eigenvalues are observable, they may be used to estimate and guide the selection of . However, from a practical point of view, we found it easier to select via crossvalidation in numerical experiments, rather than via an estimate of .
A link with Pinsker’s Theorem.
In the particular case when
is a centered Gaussian distribution, the “prediction problem” of estimating
is very similar to estimating the mean parameters of a Gaussian sequence model, with error measured in the norm. In the alternative sequencemodel format, the decay condition on the eigenvalues of translates into an ellipsoid constraint on the mean parameter sequence [16, 17]. For this reason, Theorem 3 may be viewed as “regression version” of error bounds for the sequence model under an ellipsoid constraint (cf. Pinsker’s Theorem, [16, 17]). Due to the fact that the latter problem has a very well developed literature, there may be various “neighboring results” elsewhere. Nevertheless, we could not find a direct reference for our stated MSPE bound in the current setup. For the purposes of our work in this paper, the more important point to take away from Theorem 3 is that it can be coupled with Theorem 2 for proving consistency of RB.3.4 Confidence intervals for mean responses, conditionally on a Gaussian design
In this section, we consider the situation where the design matrix has rows
drawn i.i.d. from a multivariate normal distribution
, with . (The covariance matrix may vary with .) Conditionally on a realization of , we analyze the RB approximation of the laws . As discussed in Section 1, this corresponds to the problem of setting confidence intervals for the mean responses . Assuming that the population eigenvalues obey a decay condition, we show below in Theorem 4 that RB succeeds with high probability. Moreover, this consistency statement holds for all of the laws simultaneously. That is, among the distinct laws , , even the worst bootstrap approximation is still consistent. We now state some populationlevel assumptions.A 4.
The operator norm of satisfies .
Next, we impose a decay condition on the eigenvalues of . This condition also ensures that is invertible for each fixed — even though the bottom eigenvalue may become arbitrarily small as becomes large. It is important to notice that we now use for the decay exponent of the population eigenvalues, whereas we used when describing the sample eigenvalues in the previous section.
A 5.
There is a number , and absolute constants , such that for all ,
A 6.
There are absolute constants such that for all , we have the bounds
and .
The following lemma collects most of the effort needed in proving our final result in Theorem 4. Here it is also helpful to recall the notation and from Assumption 2.
Lemma 2.
Suppose that the matrix has rows drawn i.i.d. from , and that Assumptions 2–6 hold. Furthermore, assume that chosen so that . Then, the statements below are true.
(i) (bias inequality)
Fix any . Then, there is an absolute constant , such that for all large , the following event holds with probability at least ,
(15) 
(ii) (variance inequality)
There are absolute constants such that for all large , the following event holds with probability at least ,
(16) 
(iii) (mspe inequalities)
Suppose that is chosen as when , and that is chosen as when . Then, there are absolute constants such that for all large ,
Remarks.
Note that the two rates in part (iii) coincide as approaches . At a conceptual level, the entire lemma may be explained in relatively simple terms. Viewing the quantities , and as functionals of a Gaussian matrix, the proof involves deriving concentration bounds for each of them. Indeed, this is plausible given that these quantities are smooth functionals of . However, the difficulty of the proof arises from the fact that they are also highly nonlinear functionals of .
Remark.
Lemma 2 gives explicit bounds on the numbers , as well as the probabilities of the corresponding events, but we have stated the result in this way for the sake of readability.
4 Simulations
In four different settings of and the decay parameter , we compared the nominal confidence intervals (CIs) of four methods: “oracle”, “ridge”, “normal”, and “OLS”, to be described below. In each setting, we generated random designs with i.i.d. rows drawn from , where ,
, and the eigenvectors of
were drawn randomly by setting them to be the factor in a decomposition of a standard Gaussian matrix. Then, for each realization of , we generated realizations of according to the model (1), where , and is the centered distribution on degrees of freedom, rescaled to have standard deviation . For each , and each corresponding , we considered the problem of setting a CI for the mean response value , where is the row with the highest leverage score, i.e. and . This problem was shown in B&F 1983 to be a case where the standard RB method based on leastsquares fails when . Below, we refer to this method as “OLS”.To describe the other three methods, “ridge” refers to the interval , where is the quantile of the numbers computed in the proposed algorithm in Section 2, with and . To choose the parameters and for a given and , we first computed as the value that optimized the MSPE error of a ridge estimator with respect to 5fold cross validation; i.e. cross validation was performed for every distinct pair . We then put and , as we found the prefactors 5 and 0.1 to work adequately across various settings. (Optimizing with respect to MSPE is motivated by Theorems 1, 2, and 3. Also, choosing to be somewhat smaller than conforms with the constraints on and in Theorem 4.) The method “normal” refers to the CI based on the normal approximation , where , , and
is the usual unbiased estimate of
based on OLS residuals. The “oracle” method refers to the interval , with , and being the empirical quantile of over all realizations of based on a given . (This accounts for the randomness in .)Within a given setting of the triplet , we refer to the “coverage” of a method as the fraction of the instances where the method’s CI contained the parameter . Also, we refer to “width” as the average width of a method’s intervals over all of the instances. The four settings of correspond to moderate/high dimension and moderate/fast decay of the eigenvalues . Even in the moderate case of , the results show that the OLS intervals are too narrow and have coverage noticeably less than 90%. As expected, this effect becomes more pronounced when . The ridge and normal intervals perform reasonably well across settings, with both performing much better than OLS. However, it should be emphasized that our study of RB is motivated by the desire to gain insight into the behavior of the bootstrap in high dimensions — rather than trying to outperform particular methods. In future work, we plan to investigate the relative merits of the ridge and normal intervals in greater detail.
oracle  ridge  normal  OLS  
setting 1  width  0.21  0.20  0.23  0.16 
coverage  0.90  0.87  0.91  0.81  
setting 2  width  0.22  0.26  0.26  0.06 
coverage  0.90  0.88  0.88  0.42  
setting 3  width  0.20  0.21  0.22  0.16 
coverage  0.90  0.90  0.91  0.81  
setting 4  width  0.21  0.26  0.23  0.06 
coverage  0.90  0.92  0.87  0.42 
Acknowledgements.
MEL thanks Prof. Peter J. Bickel for many helpful discussions, and gratefully acknowledges the DOE CSGF under grant DEFG0297ER25308, as well as the NSFGRFP.
Appendix A Proof of Theorem 1
Proof.
Due to line (4) and Lemma 8.8 in B&F 1981,
(18) 
If is a random vector whose entries are drawn i.i.d. from , then the definition of gives the matching relations
(19) 
To make use of these relations, we apply Lemma 8.9 in B&F 1981, which implies that if is a generic deterministic vector, and if and are random vectors with i.i.d. entries, then
Therefore,
(20) 
Combining this with line (18) and dividing through by proves the claim.∎
Appendix B Proof of Theorem 2
Proof.
By the triangle inequality,
(21) 
Let be the (uncentered) empirical distribution of the residuals of , which places mass at each value , for . The proofs of Lemmas 2.1 and 2.2 in Freedman 1981, show that
(22) 
where we have used the algebraic identity , which holds for any estimator . This completes the proof.∎
Appendix C Proof of Theorem 3
Proof.
We begin with a simple biasvariance decomposition,
(23) 
We will handle the bias and variance terms separately. To consider the bias term, note that where
Hence,
(24) 
If we let , then the eigenvalues of are of the form . In particular, it is simple to check^{1}^{1}1 Note that if and , then is maximized at . Also, if , then there at least one that scales at the rate of . that whenever , and so
(25) 
Note that this bound is tight, since it is achieved whenever is parallel to the top eigenvector of .
To consider the variance term, note that and so
(26) 
It is natural to decompose the sum in terms of the index set
(27) 
which satisfies . We will bound the variance term in two complementary cases; either or . First assume . Then,
(28)  
(29)  
Comments
There are no comments yet.