1 Introduction and Motivation
In the vast literature on high-dimensional linear regression, it has become customary to assume an underlying linear model along with a sparsity constraint on the true regression parameter. Although results exist for model misspecification, it is often not clear just what is being estimated. Suppose the statistician is unwilling to assume sparsity of the parameter or even just a linear model? Minimax lower bounds for this problem imply the impossibility of consistent estimation of the parameter vector without structural constraints; seeRaskutti et al. (2011). Thus, for consistent estimation with sparsity as the structural constraint, the number of non-zero elements of the parameter vector must be less than the sample size
. Now consider the following popular procedure in applied statistics and data science. High dimensional data is first explored either in a principled way (e.g., lasso or best subset selection) or even in an unprincipled way to select a manageable set of variables, and then linear regression is applied to the reduced data. For practical purposes, this final set of variables is often much smaller than the sample size and the total number of initial variables, and yet treated as a high-dimensional linear regression. By definition, this procedure makes use of all the data to come up with a “significant” subset of variables and no sparsity constraints are required. The current article is about understanding what is being estimated by this procedure in a model-free high-dimensional framework.
Variable selection plays a central role in data analysis when data on too many variables is available. This could be for logistical reasons or to obtain a parsimonious set of variables for interpretation purposes. As described above, it has been common practice to explore the data to first select a subset of variables, and then ignore the selection process for estimation and inference. The implications of such a method of data analysis have been recognized for a long time and can often be disastrous in terms of providing misleading conclusions, see Berk et al. (2013) and the references therein for a discussion. These considerations have led to the recent field of post-selection inference. Regression applications, in particular the structure of response and covariates, do not play any special role in this general problem, and the exploration methodology above is typically practiced whenever there are too many variables to consider for a final statistical analysis. In this paper, however, we focus on linear regression, as it leads to tractable closed form estimation yielding a more transparent analysis. We should mention that a similar analysis can be done for other -estimation problems; see Section 6 for more details.
In addressing the problem of post model-selection inference, perhaps the main question needing an answer is “what is being estimated by the estimator from the data analysis?” A major thrust of the present article is to provide an answer to this question in a very general setting for linear regression, an answer that will be seen to lead to a valid interpretation of the post selection linear regression estimator. This question was answered in a very restrictive setting in Berk et al. (2013) from an intuitive point of view. In particular, Berk et al. (2013)
assumed that the covariates in the data are fixed (non-random) and the response is normally distributed. This distributional assumption allows for a simple explanation of what is being estimated by the least squares linear regression estimator on a subset of covariates. We are not aware of other work that treats this question in the fully general setting we consider. However, in a related vein,Belloni and Chernozhukov (2013) established the rate of convergence results for the least squares linear regression estimator post lasso type model selection, see their Theorem 4, comparing its behavior with respect to the sparse oracle estimator.
Before answering the main question posed above one must clarify “what does it mean to say is estimating ?” It is natural to answer this by showing that is consistent for , however, because we are in a high-dimensional setting, the norm underlying this consistency must be made precise. To then answer our main question in full generality, we establish various deterministic inequalities that are uniform over subsets of covariates. Finally, we apply these inequalities to both independent and dependent data to obtain concrete rates of convergence. We use the dependence structure of data introduced by Wu (2005), which is based on the idea of coupling and covers the dependence structure of many linear and non-linear time series. In the process of applying our results to dependent observations, we prove a tail bound for zero mean dependent sums that extends the results of Wu and Wu (2016).
Our main results include uniform-in-model estimation error of the least squares linear regression estimator in terms of the -, -norms, and also uniform-in-model asymptotic linear representation of the estimator in terms of the -norm. Each model here corresponds to a distinct subset of covariates. These results are established for both independent and dependent observations. All of our results are non-asymptotic in nature and allow for the total number of covariates to grow almost exponentially in the sample size when the observations have exponential tails. The rates we obtain are comparable to the ones obtained by Portnoy (1988), see also Portnoy (1984, 1985) and He and Shao (1996)
for more results, though there are many differences in the settings considered. Portnoy assumes a true linear model with fixed covariates, but deals with a more general class of loss functions and his results are not uniform-in-model.
There is a rich literature on uniform asymptotic linear representations, which have been used in optimal -estimation problems. See Section 4 of Arcones (2005) and Sections 10.2, 10.3, Equation (10.25) of Dodge and Jurevckova (2000) for examples where uniform asymptotic linear representations are established for a large class of -estimators indexed by a subset of
. The main focus there is to choose a tuning parameter that asymptotically leads to an estimator with “smallest” variance, and to take into account this randomness in proving that the final estimator with the tuning parameter estimate has an asymptotic normal distribution with “smallest” variance. It is possible to derive some of our results by viewing the problem of least squares linear regression of the response on a subset of covariates as a parametrized-estimation problem indexed by the set of subsets, and then applying the general results of Arcones (2005).
The remainder of our paper is organized as follows. In Section 2, we introduce our notation and general framework. In Section 3, we derive various deterministic inequalities for linear regression that form the core of the paper. The application of these results to the case of independent observations is considered in Section 4. The application of the deterministic inequalities to the case of dependent observations is considered Section 5. An extension of our results to a class of general -estimators is given in Section 6. Proofs of the results in this section along with several examples will be provided in a future paper. A discussion of our results along with their implications is given in Section 7A and Appendix B, respectively.
Suppose are random vectors in . Throughout the paper, we implicitly think of as a function of and so the sequence of random vectors should be thought of as a triangular array. The term “model” is used to specify the subset of covariates used in the regression and does not refer to any probability model. We do not assume a linear model (in any sense) to be true anywhere for any choice of covariates in this or in the subsequent sections. In this sense all our results are applicable in the case of misspecified linear regression models.
For any vector for and , let denote the -th coordinate of . For any non-empty model given by a subset of , let denote a sub-vector of with indices in . For instance, if and , then . The notation is used to denote the cardinality of . For any non-empty model and any symmetric matrix , let denote the sub-matrix of with indices in and for , let denotes the value at the -th row and the -th column of . Define the -norm of a vector for as
Let denote the number of non-zero entries in (note this is not a norm). For any matrix , let
denote the minimum eigenvalue of. Also, let the elementwise maximum and the operator norm be defined, respectively, as
The following inequalities will be used throughout without any special mention. For any matrix and ,
For any , define the set of models
so that is the power set of with the deletion of the empty set. The set denotes the set of all non-empty models of size bounded by . The main importance of our results is the “uniform-in-model” feature. These results are proved uniform over for some that is allowed to diverge with .
Traditionally, it is common to include an intercept term when fitting the linear regression. To avoid extra notation, we assume that all covariates under consideration are included in the vectors . So, take the first coordinate of all ’s to be 1, that is, for all if an intercept is required. For any
, define the ordinary least squares empirical risk (or objective) function as
Expanding the square function it is clear that
Only the second and the third term depend on and since the quantities in these terms play a significant role in our analysis, define
The least squares linear regression estimator is defined as
Based on the quadratic expansion (4) of the empirical objective , the estimator is given by the closed form expression
assuming non-singularity of . It is worth mentioning that is not equal to . The matrix being the average of rank one matrices in , its rank is at most . This implies that the least squares estimator is not uniquely defined unless .
It is clear from Equation (7) that is a (non-linear) function of two averages and
. Assuming for a moment that the random vectorsare independent and identically distributed (iid) with finite fourth moments, it follows that and converge in probability to their expectations. The iid assumption here can be relaxed to weak dependence and non-identically distributed random vectors. Define the “expected” matrix and vector as
If the convergence (in probability) of to holds, then by a Slutsky type argument, it follows that converges to , where
These convergence statements are only about a single model and not uniform. By uniform-in-model -norm consistency of to for , we mean that
As shown above convergence of to only requires convergence of to and to . The specific structure of these matrices being the average of random matrices and random vectors is not required. In the following section in proving deterministic inequalities, we generalize the linear regression estimator by the function as
assuming the existence of the inverse of . We call this the linear regression map. In the next section, we shall bound
in terms of the differences and In this regard, thinking of as a function of , our results are essentially about studying Lipschitz continuity properties and understanding what kind of norms are best suited for this purpose. The following error norms will be very useful for these results:
The quantity RIP is a norm for any and is not a norm for . This error norm is very closely related to the restricted isometry property used in compressed sensing and high-dimensional linear regression literature. Also, define the -sparse minimum eigenvalue of a matrix as
3 Deterministic Results for Linear Regression
All our results in this section depend on the error norms and in (11). These are, respectively, the maximal -sparse eigenvalue of and the maximal -sparse -norm of . At first glance, it may not be clear how these quantities behave. We first present a simple inequality for RIP and in terms of and .
For any ,
It is easy to see that
Here we have used inequalities (1). A similar proof implies the second result. ∎
In many cases, it is much easier to control the maximum elementwise norm rather than the RIP error norm. However the factor on the right hand side often leads to sub-optimal dependence in the dimension. For the special cases of independent and dependent random vectors discussed in Sections 4 and 5, we directly control RIP and .
The sequence of lemmas to follow are related to uniform consistency in - and -norms. To state these results, the following quantities that represent the strength of regression (or linear association) are required. For
(Uniform -consistency) Let be any integer such that
Then simultaneously for all ,
Recall from the linear regression map (10) that
Fix . Then
By definition of the operator norm,
To control , note that
represents the identity matrix of dimension. Now combining bounds on , we get
Using the triangle inequality of -norm and assumption (14), it follows for all that
This proves the result. ∎
As will be seen in the application of Theorem 3.1, the complicated looking bound provided above gives the optimal bound. Combining Proposition 3.1 and Theorem 3.1, we get the following simple corollary that gives sub-optimal rates.
Let be any integer such that
Remark 3.1 (Bounding in (13)) The bound for uniform -consistency requires a bound on in addition to bounds on the error norms related to -matrices and -vectors. It is a priori not clear how this quantity might vary as the dimension of the model changes. In the classical analysis of linear regression where a true linear model is assumed, the true parameter vector is seen as something chosen by nature and hence its norm is not in control of statistician. So, in the classical analysis, an growth rate on is imposed as an assumption.
From the viewpoint taken in this paper under misspecification the nature picks the whole distribution sequence of random vectors and the quantity came up in the analysis. In the full generality of linear regression maps, we do not know of any techniques to bound the norm of this vector. It is, however, possible to bound this, if is defined by a least squares linear regression problem. Recall the definition of from (8) and from (9). Observe that by definition of ,
Hence for every ,
It is immediate from these results that if the second moment of the response is uniformly bounded, then behaves like a constant when is well-conditioned. See Foygel and Srebro (2011) for a similar calculation.
Based on uniform-in-model -bound, the following result is trivially proved.
(Uniform -consistency) Let be such that
Then simultaneously for all ,
The proof follows by using the first inequality in (1). ∎
The results above only prove a rate of convergence which gives uniform consistency. These results are not readily applicable for inference. From classical asymptotic theory, we know that for inference about a parameter an asymptotic distribution result is required. It is also well-known that asymptotic normality of an estimator is usually proved by proving an asymptotic linear representation. In what follows we prove uniform-in-model linear representation for the linear regression map. The result in terms of the regression map itself can be too abstract. For this reason, it might be helpful to revisit the usual estimators and from (6) and (9) to understand what kind of representation is possible. From the definition of , we have
Assuming and are close, one would expect
Note, by substituting all the definitions, that
This being an average the left hand side quantity in (20) is called the linear representation error. Now using essentially the same argument and letting (and ) take place of (and ), we get the following result. Recall the notation and from Equations (13) and (12).
(Uniform Linear Representation) Let be any integer such that
Then for all models ,
Furthermore, using Theorem 3.1, we get
From the definition (10) of , we have
Adding and subtracting from in (24), it follows that
Now adding and subtracting from in this equation, we get
The right hand side is almost the quantity we need to bound to complete the result. Multiplying both sides of the equation by and then applying the Euclidean norm implies that for ,
This proves the first part of the result. The second part of the result follows by the application of Theorem 3.1. ∎
If the minimal and maximal -sparse eigenvalues of are of the same order, then the upper and lower bounds for the linear representation error match up to the order under the additional assumption that the minimal and maximal sparse eigenvalues of are of the same order.
Remark 3.3 (Improved -Error Bounds) Uniform linear representation error bounds (22) and (23) prove more than just linear representation. These bounds allow us to improve the bounds provided for uniform -consistency. Bound (22) is of the form
Therefore, assuming , it follows that for all ,
This is a more precise result than informed by Theorem 3.1 since here we characterize the estimation error exactly up to a factor of 2. Also, note that in case of and the upper and lower bounds here are Euclidean norms of averages of random vectors. Dealing with linear functionals like averages is much simpler than dealing with non-linear functionals such as .
instead of . Here is the identity matrix in . Bounding this quantity might not require bounded condition number of , however, we only deal with in the following sections.
Summarizing all the results in this section it is enough to control
to derive uniform-in-model results in any linear regression type problem. In this respect, these are the norms in which one should measure the accuracy of the gram matrix and the inner product of covariates and response. So, if one wants to use shrinkage estimators because and are high-dimensional “objects”, then the estimation accuracy should be measured with respect to RIP and for uniform-in-model type results.
Before proceeding to the rates of convergence of these error norms for independent and dependence data, we describe the importance of defining the linear regression map with general matrices instead of just gram matrices. Of course, it is more general now but it would be worthless in case no interesting applications exist. The goal now is to provide a few such interesting examples.
Heavy-Tailed Observations: The -norm is a supremum over all models of size less than and so the supremum is over
models. Note that this bound is polynomial in the total number of covariates but is exponential in the size of models under consideration. Therefore, if the total number of covariates is allowed to diverge, then the question we are interested in is inherently high-dimensional. If the usual gram matrices are used then
and so, RIP in this case is a supremum of at least many averages. As is well-understood from the literature on concentration of measure or even the union bound, one would require exponential tails on the initial random vectors to allow a good control on if the usual gram matrix is used. Does this mean that the situation is hopeless if the initial random vectors do not exponential tails? The short answer is not necessarily. Viewing the matrix (the “population” gram matrix) as a target, there have been many variations of the sample mean gram matrix estimator that are shown to provide exponential tails even though the initial observations are heavy tailed. See, for example, Catoni (2012), Wei and Minsker (2017) and Catoni and Giulini (2017) along with the references therein for more details on the estimator and its properties. It should be noted that they do not study the estimator accuracy with respect to the RIP-norm. We do not prove it here and will be explored in the future.
Real data, more often than not, is contaminated with outliers and it is a hard problem to remove/classify observations in case contamination is present. Robust statistics provide estimators that can ignore or down-weigh the observations suspected to be outliers and behave comparably when there is no contamination present in the data. Some simple examples include entry-wise median, or trimmed mean. SeeMinsker (2015) and reference therein for some more examples. Almost none of these estimators are simple averages but behave regularly in the sense that they can be expressed as averages up to a negligible asymptotic error. Chen et al. (2013) provide a simple estimator of the gram matrix under adversarial corruption and case-wise contamination.
Indirect Observations: This example is taken from Loh and Wainwright (2012). The setting is as follows. Instead of observing the real random vectors , , , we observe a sequence with linked with via some conditional distribution that is for ,
As discussed in page 4 of Loh and Wainwright (2012), this setting includes some interesting cases like missing data and noisy covariates. A brief hint of the settings is given below:
If with independent of . Also, is assumed to be of mean zero with a known covariance matrix.
For some fraction , we observe a random vector such that for each component , we independently observe with probability and with probability .
If where is again a random vector independent of and is the Hadamard product. The problem of missing data is a special case.
On page 6, Loh and Wainwright (2012) provide various estimators in place of in (5). The assumption in Lemma 12 of Loh and Wainwright (2012) is essentially a bound on the RIP-norm in our notation and they verify this assumption in all the examples above. So, all our results in this section apply to these settings.
In the following two sections, we prove finite sample non-asymptotic bounds for and for
Let be any integer such that . Then for all models ,
It is worth recalling here that and are non-random matrices given in (8). So, Theorem 3.4 proves an asymptotic linear representation. Remark 3.5 (Non-uniform Bounds) The bound above applies for any satisfying the assumption and noting that for , as well as , Theorem 3.4 implies that
The point made here is that even though the bound in Theorem 3.4 only uses the maximal model size, it can recover model size dependent bounds since the result is proved for every .
Remark 3.6 (Post-selection Consistency) One of the main importance of our results is in proving consistency of the least squares linear regression estimator after data exploration. Suppose a random model chosen based on data satisfies with probability converging to one. Then with probability converging to one,
Similar bound also holds for the linear representation error. Therefore, the uniform-in-model results above allows one to prove consistency and asymptotic normality of the least squares linear regression estimator after data exploration. See Belloni and Chernozhukov (2013) for similar applications and methods of choosing the random model .
Remark 3.7 (Bounding ) As shown in Remark 3.1, for the setting of averages