The least absolute shrinkage and selection operator or Lasso by Tibshirani (1996) has received tremendous attention in the statistics literature in the past two decades. The main attraction of this method lies in its ability to perform model selection and parameter estimation at very low computational cost and the fact that the estimator can be used in high-dimensional settings where the number of variables exceeds the number of observations (“”).
Due to these reasons, the Lasso has also turned into a very popular and powerful tool in econometrics, and similar things can be said about the estimator’s many variants, among them the adaptive Lasso estimator of Zou (2006) where the -penalty term is randomly weighted according to some preliminary estimator. This particular method has been used in econometrics in the context of diffusion processes (DeGregorio & Iacus, 2012), for instrumental variables (Caner & Fan, 2015), in the framework of stationary and non-stationary autoregressions (Kock & Callot, 2015, Kock, 2016) and for autoregressive distributed lag (ARDL) models (Medeiros & Mendes, 2017), to name just a few.
Despite the popularity of this method, there are still many open questions on how to construct valid confidence regions in connection with the adaptive Lasso estimator. Pötscher & Schneider (2010) demonstrate that the oracle property from Zou (2006) and Huang et al. (2008) cannot be used to conduct valid inference and that resampling techniques also fail. They give confidence intervals with exact coverage in finite samples as well as an extensive asymptotic study in the framework of orthogonal regressors. However, settings more general than the orthogonal case have not been considered yet.
In this paper, we consider an arbitrary low-dimensional linear regression model (
”) where the regressor matrix exhibits full column rank. We allow for the adaptive Lasso estimator to be tuned componentwise with some tuning parameters possibly being equal zero, so that not all coordinates have to be penalized. Due to this componentwise structure, three possible asymptotic regimes arise: the one where each zero component is detected with asymptotic probability less than one, usually termedconservative model selection, the one where each zero component is detected with asymptotic probability equal to one, usually referred to as consistent model selection, and the mixed case where some components are tuned conservatively and some are tuned consistently. The framework we consider encompasses the latter two regimes.
The main challenge for inference in connection with the adaptive Lasso (and related) estimators lies in the fact that the finite-sample distribution depends on the unknown parameter in a complicated manner, and that this dependence persists in large samples. Because of this, the coverage probability of a confidence region varies over the parameter space and in order to conduct valid inference, one needs to guard against the lowest possible coverage and consider the minimal one. This is done so in this paper.
Since explicit expressions for the finite-sample distribution and the coverage probabilities of confidence regions are unknown when the regressors are not orthogonal, our study is set in an asymptotic framework. We determine the appropriate uniform rate of convergence and derive the asymptotic distribution of an appropriately scaled estimator that has been centered at the true parameter. While the limit distribution is still only implicitly defined through a minimization problem, the key observation and finding is that one may explicitly characterize the set of minimizers once the union over all true parameters is taken. This is done by heavily exploiting the structure of the corresponding optimization and leads to a compact set that is determined by the asymptotic Gram matrix as well as the asymptotic deviations between the componentwise tuning parameters and the maximal one. This result can then be used to show how any confidence region with positive asymptotic coverage needs to include . Even more so, such confidence sets will necessarily always have asymptotic coverage equal to one, showing that it is impossible to construct classical confidence regions with arbitrary coverage in this setting.
The paper is organized as follows. We introduce the model and the assumptions as well as the estimator in Section 2. In Section 3, we study the relationship of the adaptive Lasso to the least-squares estimator. The consistency properties with respect to parameter estimator, rates of convergence and model selection are derived in Section 4. Section 5 looks at the asymptotic distribution of the estimator and deduces that it is always contained in a compact set, independently of the unknown parameter. These results are used to construct the confidence regions in Section 6 where their shape is also illustrated. We summarize in Section 7 and relegate all proofs to Appendix A for readability.
2 Setting and Notation
We consider the linear regression model
is the response vector,the non-stochastic regressor matrix assumed to have full column rank, the unknown parameter vector and
the unobserved stochastic error term consisting of independent and identically distributed components with finite second moments, defined on some probability space. To define the adaptive Lasso estimator first introduced by Zou (2006), let
where are non-negative tuning parameters and
is the ordinary least-squares (LS) estimator. We assume the eventto have zero probability for all and do not consider this occurring in the subsequent analysis. The adaptive Lasso estimator we consider is given by
which always exists and is uniquely defined in our setting. Note that, in contrast to Zou (2006), we allow for componentwise partial tuning where the tuning parameter may vary over coordinates and may be equal to zero, so at not all components need to be penalized. This is in contrast to the typical case of uniform tuning with a single positive tuning parameter. We also look at the leading case of with , in the notation of Zou (2006). For all asymptotic considerations, we will assume that converges to a positive definite matrix as .
We define the active set to be . The quantity is given by the largest tuning parameter, and stands for the extended real line. Finally, the symbol depicts convergence in probability and convergence in distribution, respectively. For the sake of readability, we do not show the dependence of the following quantities on in the notation: , , , , , and .
3 Relationship to LS estimator
The following finite-sample relationship between the adaptive Lasso estimator is essential for proving the results in the subsequent section and will also give some insights in understanding the idea behind the results on the shape of the confidence regions in Section 5 and 6. It shows that the difference between the adaptive Lasso and the LS estimator is always contained in a bounded and closed set that depends on the regressor matrix as well as on the tuning parameters. Note that the statements in Lemma 1 and Corollary 2 hold for all .
Lemma 1 (Relationship to LS estimator).
Lemma 1 can be used to show under what tuning regime the adaptive Lasso is asymptotically behaving the same as the LS estimator, as is stated in the following corollary.
Corollary 2 (Equivalence of LS and adaptive Lasso estimator).
If , and are asymptotically equivalent in the sense that
Corollary 2 shows that in case , the adaptive Lasso estimator is asymptotically equivalent to the LS estimator, so that this becomes a trivial case. How the estimator behaves in terms of parameter estimation and model selection for different asymptotic tuning regimes is treated in the next section.
4 Consistency in parameter estimation and model selection
We start our investigation by deriving the pointwise convergence rate of the estimator.
Proposition 3 (Pointwise convergence rate).
Let . Then the adaptive Lasso estimator is pointwise -consistent for in the sense that for every , there exists a real number such that
The fact that the pointwise convergence rate is given by only if does not diverge has implicitly been noted in Zou (2006)’s oracle property in Theorem 2 in that reference, reflected in the assumption of 111Note that in that reference corresponds to in our notation, assuming uniform tuning over all components.. In the one-dimensional case, it can be learned from Theorem 5 Part 2 in Pötscher & Schneider (2009) that the sequence is not stochastically bounded if diverges222To make the connection from that reference to our notation, note that there and set and .. However, neither of these references determine the slower rate of explicitly when it applies.
The uniform convergence rate is presented in the next proposition.
Proposition 4 (Uniform convergence rate).
Let . Then the adaptive Lasso estimator is uniform -consistent for in the sense that for every , there exists a real number such that
Proposition 4 shows that the uniform convergence rate is slower than if , in which case it is also slower than the pointwise rate. The fact that the uniform rate may differ from the pointwise one has been noted in Pötscher & Schneider (2009).
Theorem 5 (Consistency in parameter estimation).
The following statements are equivalent.
is pointwise consistent for .
is uniformly consistent for .
Condition 4 in Theorem 5 states that the adaptive Lasso only chooses correct and never underparametrized models with asymptotic probability equal to 1. It underlines the fact that is basic condition that we will assume in all subsequent statements.
Theorem 6 (Consistency in model selection).
Suppose that as . If as well as as for all , then the adaptive Lasso estimator performs consistent model selection in the sense that
This statement is in particular interesting for the case of partial tuning where some are set to zero and the corresponding components are not penalized, revealing that the other components can still be tuned consistently in this case.
5 Asymptotic distribution
In this section, we investigate the asymptotic distribution and subsequently construct confidence regions in Section 6. We perform our analysis for the case when which, by Theorem 6, encompasses the tuning regime of consistent model selection and often is the regime of choice in applications. If the estimator is tuned uniformly over all components, the condition is in fact equivalent to consistent tuning, given the basic condition of .
The requirement also corresponds to the case where the convergence rate of adaptive Lasso estimator is given by rather than , as can be seen from Proposition 4. Pötscher & Schneider (2009) and Pötscher & Schneider (2010) demonstrate that in order to get a representative and full picture of the behavior of the estimator from asymptotic considerations, one needs to consider a moving-parameter framework where the unknown parameter is allowed to vary over sample size. For these reasons, we study the asymptotic distribution of , which is done in the following.
measuring the two different deviations between each tuning parameter to the maximal one. Note that we have and for uniform tuning and that not penalizing the -th parameter leads to . Note that assuming the existence of these limits does not pose a restriction as we could always perform our analyses on convergent subsequences and characterize the limiting behavior for all accumulation points.
Theorem 7 (Asymptotic distribution).
Assume that and . Moreover, define by for . Then
There are a few things worth mentioning about Theorem 7. First of all, in contrast to the one-dimensional case, the asymptotic limit of the appropriately scaled and centered estimator may still be random. However, this can only occur if is non-zero and finite for some component , meaning that the maximal tuning parameter diverges faster (in some sense) than the tuning parameter for the -th component, but not too much faster. When no randomness occurs in the limit, the rate of the stochastic component of the estimator is obviously smaller by an order of magnitude compared to the bias component. In particular, this will always be the case for uniform tuning when .
As is expected, the proof of Theorem 7 will be carried out by looking at the corresponding asymptotic minimization problem of the quantity of interest, which can shown to be the minimization of . However, since this limiting function is not finite on an open subset of , the reasoning of why the appropriate minimizers converge in distribution to the minimizer of is not as straightforward as might be anticipated.
The assumption of converging in in the above theorem is not restrictive in the sense that otherwise, we simply revert to converging subsequences and characterize the limiting behavior for all accumulation points, which will prove to be all we need for Proposition 8 and the confidence regions in Section 6.
While we cannot explicitly minimize for a fixed other than in trivial cases, surprisingly, we can still explicitly adduce the set of all minimizers of over all , which yields the same set regardless of the realization of in . This is done in the following proposition.
Proposition 8 (Set of minimizers).
Then for any we have
So, while the limit of will, in general, be random, the set is not. In fact, Proposition 8 shows that for any , the union of limits over all possible sequences of unknown parameters is always given by the same compact set . This observation is central for the construction of confidence regions in the following section. It also shows that while in general, a stochastic component will survive in the limit, it is always restricted to have bounded support that depends on the regressor matrix and the tuning parameter through the matrix and the quantities and . Interestingly, only depends on for the components when , in which case the set loses a dimension. This can be seen as a result of the -th component being penalized much less than the maximal so that the scaling factor used in Theorem 7 is not enough for this component to survive in the limit. Note that in case of uniform tuning where and , does not depend on the sequence of tuning parameters at all. Also, we have for and , a fact that has been shown in Pötscher & Schneider (2009) and used in Pötscher & Schneider (2010).
A very “quick-and-dirty” way to motivate the result in Proposition 8 is to rewrite
and observe that the second term on the right-hand side is whereas the first term is always contained in the set
by Lemma 1, which contains the set in the limit. Theorem 7 and Proposition 8 can therefore be viewed as the theory that make this observation precise by sharpening the set and showing that it only contains the limits. This can then be used for constructing confidence regions, which is done in the following section.
6 Confidence regions – coverage and shape
Theorem 9 (Confidence regions).
Let and . Then every open superset of satisfies
For , define . We then have that
for any .
Theorem 9 essentially shows the following. The set is the “boundary case” for confidence sets in the sense that if we take a “slightly larger”, multiplied with the appropriate factor and centered at the adaptive Lasso estimator, we get a confidence region with minimal asymptotic coverage probability equal to 1. If, however, we take a “slightly smaller” set, we end up with a confidence region of asymptotic minimal coverage 0. Nothing can be said in general about the case when we use itself. All this entails that constructing valid confidence regions based on the Lasso is not possible in classical sense where one can go for an arbitrary prescribed coverage level, if at least one component of the estimator is tuned to perform consistent model selection. The reason for this is the fact that when controlling the bias of the estimator, the stochastic component either has the same bounded support as the bias component or completely vanishes at all, as has been pointed out in Section 5.
The statements in Theorem 9 can be strengthened in the following way. Let and .
If and , then for any we have
If , then any closed and proper subset of fulfills
Note that for uniform tuning, both refinements hold since and .
One might wonder how this type of confidence region compares to the confidence ellipse based on the LS estimator. Note that the regions will be multiplied by a different factor and centered at a different estimator. In general, the following observation can be made. For , let with be such that is an asymptotic -confidence region for . If we contrast this with , we see that since both and have positive, finite volume and since , the regions based on the adaptive Lasso are always larger by an order of magnitude.
We now illustrate the shape of . We start with and the matrix
We consider the case of uniform tuning, so that and and show the resulting set in Figure 1. The color indicates the value of at the specific point inside the set. The higher the absolute value of the correlation of the covariates is, the flatter and more stretched the confidence set becomes. As one may expect intuitively, in the case of positive correlation, the confidence set covers more of the area where the signs of the covariates are equal. A negative correlation causes the opposite behavior seen in Figure 1. Note that the corners of the set touch the boundary of the ellipse for a certain value of .
For the case of , we again start with an example with uniform tuning so that and and consider the matrix
The resulting set is depicted in Figure 2. To give a better impression of the shape, the set is colored depending on the value of the third coordinate. Here, the high correlation between the first and third covariate stretches the set in the direction where the signs of the covariates differ. Figure 2 shows the projections of the three-dimensional set of Figure 2 onto three planes where one component is held fixed at a time. The projection onto the plane where the second component is held constant clearly shows the behavior explained above. On the other hand, the other two projections emphasize that for covariates with a lower correlation in absolute value the confidence set is less distorted.
Finally, Figure 3 illustrates the partially tuned case with the same matrix . The first component is not penalized whereas the the remaining ones are tuned uniformly. This implies that and . Due to the condition for all , the resulting set is an intersection of a plane with the set in Figure (1(a)
). The fact that the confidence set is only two-dimensional might appear odd and is due to the fact that the unpenalized component exhibits a faster convergence rate so that the factorwith which is multiplied is not enough for this component to survive in the limit.
7 Summary and conclusions
We give a detailed study the asymptotic behavior of the adaptive Lasso estimator with partial consistent tuning in a low-dimensional linear regression model. We do so within a framework that takes into account the non-uniform behavior of the estimator, non-trivially generalizing results from Pötscher & Schneider (2009) that were derived for the case of orthogonal regressors. We use these distributional results to show that valid confidence regions based on the estimator can essentially only have asymptotic coverage equal to 0 or to 1, a fact that has been observed before for the one-dimensional case in Pötscher & Schneider (2010). We illustrate the shape of these regions and demonstrate the effect of componentwise tuning at different rates as well as the implications of partial tuning on the confidence set.
Appendix A Appendix – Proofs
We introduce the following additional notation for the proofs. The symbol denotes the -th unit vector in and the sign function is given by for . For a function , the directional derivative of at in the direction of is denoted by , given by
For a vector and an index set , contains only the components of corresponding to indices in . Finally, denotes convergence in probability.
a.1 Proofs for Section 3
Proof of Proposition 1.
Consider the function
which can, using the normal equations of the LS estimator, be rewritten to
Note that is minimized at and that, since all directional derivatives have to be non-negative at the minimizer of a convex function. After some basic calculations we get
for all . When , this implies that
holds. When , the equations in (1) imply
If , clearly, (2) also holds. If , we have yielding
which completes the proof. ∎
a.2 Proofs for Section 4
Proof of Proposition 3.
Consider the function defined by which can be written as
is minimized at and, since , we have which implies that
where in the latter sum we have dropped the non-positive terms for and have used the fact that on the terms for . Now note that both and are bounded by 1 and that the sequences and for are tight, so that we can bound the right-hand side of the above inequality by a term that is stochastically bounded times . Moreover, since converges to and all matrices are positive definite, we can bound the left-hand side of the above inequality from below by a positive constant times , so that we can arrive at
which proves the claim. ∎
Proof of Proposition 4.
Proof of Theorem 5.
We have 3 2 by Proposition 4 and clearly, 2 holds. To show 1 3, assume that is consistent for and that for some along a subsequence . Let . On the event , which by consistency has asymptotic probability equal to 1, we have
by Equation (3). By consistency and the convergence of , the left-hand side converges to zero in probability, whereas the right-hand side converges to in probability along the subsequence , yielding a contradiction. This shows the equivalence of the first three statements.
The final implication we show is 4 3. For this, assume that so that there exists a subsequence such that as for some . We first look at the case of . Note that is stochastically bounded, since implies
As and , the quadratic term on the left-hand side dominates the linear term on the right-hand side which is only possible if is . Now note that by Equation 3, implies
The fact that and that and are stochastically bounded for fixed show the left-hand side of the above display is also bounded in probability. The right-hand side, however, diverges to regardless of the value of . We therefore have for all , which is a contradiction to 4. If , we first observe that is always contained in a compact set by Lemma 1 and the convergence of to . This implies that for some for all . Again, by Equation 3,
whenever . The left-hand side is bounded by whereas the right-hand side converges to in probability. We therefore get for all satisfying , also yielding a contraction to 4. ∎
Proof of Theorem 6.
Since the condition guards against false negatives asymptotically by Theorem 5, we only need to show that the estimator detects all zero coefficients with asymptotic probability equal to one. Assume that and that . The partial derivative of with respect to is given by
Since is -consistent for , converges, and is tight, the left-hand side of the above display is stochastically bounded. The behavior of the right-hand side is governed by as is also stochastically bounded for . If does not converge to zero, then the right-hand side diverges because does. If , we have eventually, so that which also diverges by assumption. ∎
a.3 Proofs for Section 5
Assume that and . Moreover, suppose that and . Then for any , the term