Various statistical procedures have been proposed over the last decade for solving high dimensional statistical problems, where the dimensionality of the parameter space may exceed or even be much larger than the sample size. Under certain low-dimensional structural assumption such as sparsity, the high-dimensional problem becomes statistically identifiable and estimation procedures are constructed in various ways to achieve estimation minimax optimality[21, 1, 31, 25, 29, 9, 34, 22] and variable selection consistency [33, 16, 27, 28]. See the book  and the survey article  for a selective review on this subject.
On the other hand, due to the intractable limiting distribution of sparsity-inducing estimators such as the Lasso , little progress has been made on how to conduct inference. Vanilla bootstrap and subsampling techniques fail to work for the Lasso even in the low-dimensional regime due to the non-continuity and the unknown parameter-dependence of the limiting distribution . Moreover, Leeb has shown in a series of his work [13, 14, 15] that there is no free lunch—one cannot achieve super-efficiency and accurate estimation of the sampling distribution of the super-efficient estimator at the same time. Recently, initiating by the pioneer work [32, 24, 10], people start seeking point estimators in high-dimensional problems that are not super-efficient but permit statistical inference, such as constructing confidence intervals and conducting hypothesis testing. Reviews and comparisons regarding other statistical approaches for quantifying uncertainties in high-dimensional problems can be found in  and .
consistency of the Lasso estimator. This new class of post-processing estimators are no longer super-efficient but shown to admit asymptotically normal limiting distributions, which facilitates statistical inference. However, as we empirically observed in the numerical experiments, this solution is still not satisfactory since confidence intervals based on these de-sparsified estimators tend to be under-coveraged for unknown signals with non-zero true values, meaning that the actual coverage probabilities of the confidence intervals are lower than their nominal significance level; and tend to be over-coveraged for zero unknown true signals, meaning that the actually coverage probability higher than nominal. This unappealing practical performance of de-sparsified estimators can be partly explained by the somehow crude de-biasing procedure for correcting the bias, as the “bias-corrected” estimator has not been fully escaped from super-efficiency.
In this paper, we take a different route by directly imposing a zero-bias constraint for the low-dimensional parameter of interest accompanied by an -penalized procedure for estimating the remaining high–dimensional nuisance parameter
. This new zero-bias constraint requires the projection of the fitted residual of the response vector onto certain carefully chosen directions to vanish. From a semiparametric perspective, a carefully chosen constraint has the effect of forcing the efficient score function along certain least favourable submodel to be close to zero when evaluated near the truth. We show that the resultingConstrained Lasso (CLasso) estimator admits an asymptotically normal limit and achieves optimal semiparametric efficiency, meaning that its asymptotic covariance matrix attains the semiparametric Cramér-Rao lower bound. We propose an iterative algorithm for numerically computing the CLasso estimator via iteratively updating via solving a linear system, and updating
via solving a Lasso programming. The algorithm enjoys globally linear convergence up to the statistical precision of the problem, meaning the typical distance between the sampling distribution of the estimator and its asymptotic normal distribution. Different from gradient-based procedures where the optimization error typically contracts at a constant factor independent of sample sizeand dimensionality (but depends on the conditional number) of the problem, our algorithm exhibits a contraction factor proportional to (this quantity encodes the typical difficulty of high-dimensional statistical problems with sparsity level ) that decays towards zero as . Moreover, our algorithm involves no step size and is tuning free. In our numerical experiments, a few iterations such as are typically suffice for the algorithm to well converge.
More interestingly, we find a close connection between the CLasso and the aforementioned de-sparsified procedures proposed in —when initialized at the Lasso estimator, the de-sparsified estimator is shown to be asymptotically equivalent to the first iterate from our iterative algorithm for solving the CLasso. This close connection explains and solves the under- and over-coverage issue associated with the de-sparsified estimator: by refining the de-sparsified estimator through more iterations, the resulting CLasso estimator is capable of escaping from the super-efficiency region centered around the Lasso initialization, and leads to more balanced coverages for truth unknown signals with both zero and non-zero values. Depending on the convergence speed of the algorithm, the improvement on the coverage can be significant—this also suggests a poor performance of the de-sparsified estimator when algorithmic rate of convergence is large.
Overall, our results suggest that by incorporating constraints with widely-used high dimensional penalized methods, we are able to remove the bias term appearing in limiting distributions of low-dimensional components in high-dimensional models that prevents us from conducting statistical inference at the price of losing super-efficiency. By carefully selecting the constraints, we can achieve the best efficiency in the semiparametric sense.
The rest of the paper is organized as follows. In Section 2, we motivate and formally introduce the CLasso method. We also propose an iterative algorithm for implementing the CLasso, and discuss its relation with de-sparsified estimators. In Section 3, we provide theory of the proposed method, along with a careful convergent analysis of the iterative algorithm. In Section 4, we conduct numerical experiments and apply our method to a real data. We postpone all the proofs to Section 5 and conclude the paper with a discussion in Section 6.
2 Constrained Lasso
To begin with, we formulate the problem and describe the key observation in Section 2.1 that motivates our method. In Section 2.2, we formally introduce the new method, termed Constrained Lasso (CLasso), proposed in this paper. In Section 2.3, we describe an iterative algorithm for implementing the CLasso. In Section 2.4, we illustrate a close connection between the proposed method and a class of de-sparsifying based methods.
Consider the linear model:
where is the design matrix, is the unknown regression coefficient vector, is the response vector and is a Gaussian noise vector. Suppose among all components of , we are only interested in conducting statistical inference for its first components, denoted by , and the remaining part is a nuisance parameter. Correspondingly, we divide the design matrix into two parts: design matrix for the parameter of interest and design matrix for the nuisance part. Under this setup, we can rewrite the model into a semiparametric form:
We are interested in the regime where the nuisance parameter is high-dimensional, or , while the parameter of interest is low-dimensional, or . A widely-used method for estimating the regression coefficient , or the pair, is the Lasso ,
where is a regularization parameter controlling the magnitude of the -penalty term in the objective function. Under the assumption that the true unknown regression coefficient vector is -sparse with , the optimal scaling of regularization parameter is of order , leading to minimax-rate of estimation and prediction . However, due to the -penalty term, the resulting estimator is biased, with a bias magnitude proportional to (see, for example,  for fix-dimensional results and  for high-dimensional results). This -magnitude bias destroys the root -consistency of as the dimensionality grows with , rendering statistical inference for an extremely difficult task. The most relevant method in the literature is a class of post-processing procedures developed in [32, 24, 10], where an estimator of is constructed by removing from the original Lasso estimator an“estimated bias term” that prevents from achieving the root- consistency. As we discussed in the introduction, this post-processing procedure tends to have unappealing empirical performance due to the seemingly crude bias-correction when the original statistical problem is hard, meaning that is relatively large.
In this work, we take a different route by directly removing the bias term through combining a bias-eliminating constraint with the Lasso procedure (3). This new approach deals with the bias directly and is free of post-processing. Surprisingly, as we will show in Section 2.4, the de-sparsified Lasso estimator proposed in  is asymptotically equivalent to the first iterate in our algorithm (Algorithm. 1) for solving the Constrained Lasso (CLASSO).
To motivate the method, let us first consider a naive correction to the original Lasso programming (3) by removing the penalty term of , which will be referred to as the un-penalized Lasso (UP Lasso),
It can be shown that the KKT condition of the UP Lasso is
By plugging in , where denotes the true parameter, and rearranging the terms, the first KKT condition on can be rearranged as
Under some reasonable assumption on , the first term converges to a normal limit, while the second term has a typical order that does not vanish as increases. As a consequence, the UP Lasso estimate still fails to achieve the root -consistency of .
After taking a more careful look at the decomposition (6), we find that the second bias term will be exactly zero if columns of are orthogonal to columns of (or designs of and are orthogonal). This suggests that the last bias term in is primarily due to the non-zero projection of the bias in the nuisance part onto the column space of . Consequently, if we replace the KKT condition (5a) on by
for some suitable matrix such that product is close to zero in some proper metric, then the same argument leads to
where we use to denote the residual of after subtracting . Since now is close to zero, the second bias term will be vanished as , leading to the asymptotic normality of . This observation motivates the new method proposed in the next subsection.
According to the observations in the previous subsection, we propose a new high-dimensional -estimator of that simultaneous solves the following two estimation equations that are obtained via modifying the KKT conditions (5) of the UP Lasso,
where the construction of the critical matrix
will be specified later. At this moment, we can simply viewas some “good” matrix that makes the product
close to a zero matrix. The second equation (7b) corresponds to the KKT condition of the Lasso programming of regressing the residual on . Therefore, problem (7) can be expressed into an equivalent form as
We say columns of is in a general position, or simply is in a general position, if the affine span of any points , for arbitrary signs , does not contain any element of . In many examples, such as when entries of is drawn from a continuous distribution on , satisfies this condition. When is in a general position, for any the residual regression equation (8b) has a unique solution, which is also the unique solution of equation (7b) . Therefore, both equations (7) and (8) are well-posted.
We will refer to these two equivalent methods of estimating as Constrained Lasso (CLasso). In the special case of , equation (7) coincides with the KKT condition of the UP Lasso problem (4) and therefore UP Lasso (4) is a special case of the CLasso with . However, for general matrices, equation (7) may not correspond to the KKT condition of any optimization problem. The following theorem show that when is in a general position, the CLasso is a well-posed procedure with a unique solution with a high probability.
Suppose the assumptions in Section 3.1 holds. In addition suppose we have for some constant independent of (for precise definitions of those quantities, please refer to Section 3.1), and is in a general position and satisfies the sparse eigenvalue condition (SEC):
is in a general position and satisfies the sparse eigenvalue condition (SEC):for all vectors with sparsity level for some sufficiently large constant . Then under the same choice of as in Theorem 3, we have that with probability at least for some , the estimating equation (7) or equation (8) admits a unique solution.
According to results in Section 3.1, are constants and is typically of order . Consequently, the additional assumption is always satisfied as long as we are in the regime , where recall that is the sparsity of the true unknown nuisance parameter
. This assumption for statistical inference in high dimensional sparse linear regression turns out to be stronger than common sufficient conditionfor estimation consistency (see [4, 11] for some detailed discussions concerning these conditions). The sparse eigenvalue condition is also stronger than the restricted eigenvalue condition made in Section 3.1 for proving Lasso consistency, although both of them can be verified for a class of random design matrices .
Now we specify our choice for the critical matrix . Let us introduce some notation first. For any by matrix , we use to denote the th row of and its th column. For any vector and any index set , we use the shorthand to denote the vector formed by keeping the components whose indices are in unchanged and setting the rest to be zero. From now on, we always assume the design to be random with zero mean, meaning that rows and of the two design matrices and are i.i.d. random vectors with dimensions and , respectively, and satisfy and . Denote the covariance matrix of by
Under the random design assumption, the ideal choice of would be
since it satisfies the population level uncorrelated condition , from which we may expect its empirical version to be close to zero with a high probability. To motivate our constructing procedure for , we use another equivalent definition of as the minimizer of , or equivalently, for each , the th column is the minimizer of the mean squared residual , where recall that denotes the th component of any matrix . In practice, these population level quantities are rarely known and we propose to estimate each column via the following node-wise regression  by minimizing a penalized averaging squared residuals,
The CLasso has an interpretation from the semiparametric efficiency theory. Let
denote the probability distribution of the linear model (2) with parameter pair . When is not orthogonal to , is not the least favourable direction (see the following for a brief explanation) of the nuisance part for estimating . Therefore, equation (7a) is not the right constraint to impose. In fact, in model (2), the (multivariate) least favourable direction is given by (the same as previously defined in (9)), meaning that -dimensional the sub-model is the hardest parametric sub-problem within the original statistical distribution family that passes through , which is (this can be proved by applying the Gauss-Markov theorem, see Section 2.3.3 in  for a rigorous statement and more details). Therefore, in order to achieve the best asymptotic efficiency, we need focus on the score function (derivative of the negative likelihood function) along the path in the least favourable sub-model (the corresponding score function is called efficient score function),
and enforce it to be zero at to remove the bias, that is, by requiring
This constraint can be interpreted as force the impact of the bias from the nuisance part on the least favourable sub-model to vanish. When is not directly available, we may again replace it with any approximation under which , and therefore the efficient score function at , is close to zero. This also leads to the node-wise regression procedure (10) of choosing in the CLasso, where the KKT condition of the node-wise regression implies an element-wise sup-norm bound on the product (see Theorem 4
). This second semiparametric interpretation heuristically explains the optimality of the CLasso in terms of achieving the smallest asymptotic variance (for a rigorous statement, see Section 2.3.3 in and Corollary 5).
2.3 Iterative algorithm for solving the Constrained Lasso
We propose an iterative algorithm for solving the CLasso problem (7) and its equivalent form (8). More specifically, we iteratively solve from equation (7a) and from equation (7b) in an alternating manner. More precisely, at iteration with a current iterate for , the first equation (7a) yields an updating formula for as
In the case , this reduces to , which is the least square estimate for fitting the residual obtained by subtracting the nuisance part from the response. Next, given the newly updated iterate for , we update by using the equivalence between equation (7b) and equation (8b) via
Since this optimization problem shares the same structure as the Lasso programming by equating the response variable with the current residual, we can use the state-of-the-art algorithm (such as the glmnet package in R) of the Lasso programming to efficiently find . In practice, the following unadjusted Lasso estimate serves as a good initialization of the algorithm,
We may also consider a more general form of the algorithm by allowing the regularization parameter to change across the iterations. The reason for using a -dependent is as follows. In order for the algorithm to have globally exponential convergence from any initialization, we need to pick a slightly larger that grows proportionally to at the beginning. However, a larger tends to incur a large bias in , which in turn induces a large bias in . Therefore, as becomes close to as the algorithm proceeds, we may gradually reduce to make it close to the optimal with order . Rigorous analysis of the convergence of this algorithm and the associated estimation error bounds can be found in Section 3.4. At the end of this subsection, we summarize the full algorithm for implementing the CLasso in Algorithm. 1 below.
|Input: response , design matrices and|
|Output: Fitted and , and the asymptotic covariance matrix of|
|Find matrix :|
|Estimate noise variance via the scaled Lasso |
|Iterative algorithm for solving :|
|Initialize and at the Lasso solution, that is, set|
|For to ( is the number of iterations)|
|(Here as , for example, for ,|
|Output solution and .|
2.4 Relation with De-sparsified Lasso estimator
In this subsection, we discuss the relationship between the de-sparsified Lasso estimator [24, 32] and the proposed CLasso method. For simplicity, we consider the special case when the parameter of interest is one dimensional. Recall that and are the full design matrix and regression coefficient vector, respectively. Throughout this subsection, we consider with .
First, we briefly review the de-sparsified Lasso procedure. Following the presentation of , we denote by an proxy of the inverse of the sample covariance matrix , in the sense of making the product close to the -dimensional identify matrix . Their de-sparsified Lasso estimator is defined as
where is the unadjusted Lasso estimate, which is also our initialization in Algorithm. 1. Focusing on the first component of , the parameter of interest, we express it as
where denotes the first row of . According to , the first row takes the form as
where is constructed in the node-wise regression (10) with , and
By plugging in these into formula (11), we obtain
In comparison, it is easy to write out the updating formula for in the first iteration of Algorithm. 1 in a similar form
By comparing formulas (12) and (13), the only difference is in the denominator and , which are both expected to converge to the population level squared residual , since tends to be small, while both the empirical product and the averaging squared norm tend to converge to the population level quantity as and . More precisely, we have the following proposition.
Under the assumption in Theorem 3, we have that as and ,
According to Proposition 2, the de-sparsified Lasso estimator is asymptotically equivalent to the first iterate in the iterative Algorithm. 1 when initialized at the original Lasso estimator. As a consequence, the de-sparsified Lasso estimator tends to be close to the Lasso estimator when the convergence of the algorithm is slow. Since the Lasso estimator has super-efficiency—meaning that it shrinks small signals to be exactly zero and incurs some amount of shrinkage for non-zero signals, the de-sparsified estimator may also inherit the super-efficiency from the Lasso to some extent. This explains our empirical observations in Section 4 that for the de-sparsified Lasso, the coverage probabilities of confidence intervals for unknown true signals with non-zero values tend to significantly below the nominal level, while the coverage probabilities of zero signals almost always attain, even exceed the nominal level. In comparison, the CLasso estimator, a refined de-sparsified Lasso estimator though applying more iterations, tends to be fully escaped from the local super-efficiency region. Consequently, the penalty-induced bias in those non-zero signals has been fully corrected and the shrinking-to-zero signals have been fully released from zero (see Figure. 1 and Figure. 2 and captions therein for an illustration). As a result, the CLasso tends to produce a more balanced coverage probabilities between zero and non-zero signals (see our empirical studies in Section 4 for more details). This improvement over the de-sparsified Lasso becomes more prominent as the convergence of the iterative algorithm becomes slow, that is, when the algorithmic convergence rate (see Theorem 6) becomes relatively large.
We show the asymptotic normality of the CLasso estimator under suitable conditions on and the design matrix in Section 3.1. In Section 3.2, we show that the constructed via node-wise regression (10) satisfies those conditions and in addition leads to asymptotic optimality in terms of semiparametric efficiency. In Section 3.4, we turn to the algorithmic aspect of the CLasso by showing a globally linear contraction rate of the iterative algorithm proposed in Section 2.3.
3.1 Asymptotic normality of the Constrained Lasso
For technical convenience, we study the following variant of the CLasso problem by adding constraint for some sufficiently large so that the truth is feasible in the second Lasso programming,
This additional constraint becomes redundant as long as the initialization of the iterative algorithm satisfies , since our proof indicates that for any , the time- iterate still satisfies the same constraint.
Recall that the true data generating model is with and is assumed to be -sparse. It is good to keep in mind that we are always working in the regime that and . For any by matrix , we denote its element-wise sup norm by , the to norm by , the to norm by . Let denote the index set corresponding to the support of the -sparse vector . Let denote a cone in . This cone plays a key role in the analysis, since we will show that as well as any iterate in Algorithm. 1 belongs to this cone with high probability due to the regularization. Recall that is the residual of after the part has been removed.
We make the following assumption on the design matrix for the nuisance part, which is a standard assumption in high-dimensional linear regression under the sparsity constraint (for discussions about this condition, see, for example, ).
Restricted eigenvalue condition (REC):
The nuisance design matrix satisfies
Assume REC. Moreover, suppose there are constants such that , , , , , and the design matrices has been normalized so that and . If and the truth of the nuisance parameter satisfies and is -sparse, then for some constant ,
where the randomness is with respect to the noise vector in the linear model.
Theorem 3 shows that if the remainder term , then is asymptotically equivalent to a normally distributed vector . This theorem applies to any design and matrix , and does not use any randomness in them. Let us make some quick remark regarding the conditions in Theorem 3. Since we are interested in the regime that , or more simply, , assumptions on like , and are easily satisfied for some sufficiently large (see, for example, Theorem 4). The design matrix column normalization condition is also standard. The less obvious assumptions are and , which controls the bias magnitude in and critically depends on the choice of . More importantly, in order to make the remainder term in the local expansion of to vanish, needs to decay reasonably fast as . In Theorem 4 below, we show that under mild assumptions on the design, the constructed via node-wise regression (10) has nice properties that makes and hold with high probability with respect to the randomness in the design. By plugging in these bounds, Theorem 3 implies that remainder term is indeed of higher-order relative to as and . Although we assume the noise in the linear model to be Gaussian, the proof can be readily extend to noises with sub-Gaussian tails.
3.2 Semiparametric efficiency of the CLasso
In this subsection, we show that the matrix chosen via optimization procedure (10) satisfies the conditions in Theorem 3. Moreover, the corresponding CLasso estimator is semiparametric efficient—it has the smallest asymptotic variance, or achieves the Cramér-Rao lower bound from a semiparametric efficiency perspective. Recall that is the residual matrix, where is the solution of the node-wise regression (10). Let denote the inverse of the semiparametric efficient information matrix of , which is also the Cramér-Rao lower bound of the asymptotic covariance matrix of any asymptotically unbiased estimator of (see  for more details on the precise definition of semiparametric optimality of ). Recall that we assume both design matrices and to be random.
Let denote the entire design matrix. Rows of are i.i.d. with zero mean and sub-Gaussian tails, that is, and there exists some constant , such that for any vector ,
If Assumption D holds and , then in the node-wise regression (10), with probability at least with respect to the randomness in the design , we have
In addition, if we choose , then for some constant depending on , the largest eigenvalue of and , it holds with probability at least with respect to the randomness in the design that
The choice of heavily depends on the tail behavior of the design . For example, if the design instead has a heavier sub-exponential tail, then we need to increase the regularization parameter to . Similar to the theory in , we do not need to impose any sparsity condition on ’s as in —the only assumption is the boundedness of , which tends to be mild and satisfied in most real situations. In fact, in the proof we find that a “slow rate” type bound  for the penalized estimator suffices for the proof and we do not need to go to the “fast rate” regime that demands sparsity.
Theorem 4 also implies that we may choose the critical quantities and appearing in Theorem 3 to be of order . Finally, by combining Theorem 3 and Theorem 4 with Slutsky’s theorem, we obtain the following corollary showing the semiparametric optimality of the CLasso estimator .
3.3 Confidence intervals and hypothesis testing
In this subsection, we construct asymptotically valid statistical inference procedures based on the form of the asymptotic normal limit of .
For any -dimensional vector , we can construct an confidence interval for linear functional as
where we use notation to denote the square root for any symmetric matrix . Consequently, we have for any ,
implying that is an asymptotically valid confidence interval with significance level for .
In the special case when we are only interested in one component of , say , in the linear model (1), then in order to minimize the asymptotic length of its confidence interval, we take as the parameter of interest and as the nuisance parameter in the semiparametric formulation (2), where for any vector we use notation to denote the its sub-vector without the -th component. Let and to denote the corresponding design matrices. Then, the previous procedure leads to an asymptotically valid -confidence interval of as
By converting the confidence interval (15), we can construct the following asymptotically valid procedure for testing vs for any contrast by rejecting if