Asymptotic Confidence Regions Based on the Adaptive Lasso with Partial Consistent Tuning

10/05/2018
by   Nicolai Amann, et al.
0

We construct confidence sets based on an adaptive Lasso estimator with componentwise tuning in the framework of a low-dimensional linear regression model. We consider the case where at least one of the components is penalized at the rate of consistent model selection and where certain components may not be penalized at all. We perform a detailed study of the consistency properties and the asymptotic distribution that includes the effects of componentwise tuning within a so-called moving-parameter framework. These results enable us to explicitly provide a set M such that every open superset acts as a confidence set with uniform asymptotic coverage equal to 1 whereas every proper closed subset with non-empty interior is a confidence set with uniform asymptotic coverage equal to 0. The shape of the set M depends on the regressor matrix as well as the deviations within the componentwise tuning parameters. Our findings can be viewed as a generalization of Pötscher & Schneider (2010) who considered confidence intervals based on components of the adaptive Lasso estimator for the case of orthogonal regressors.

READ FULL TEXT VIEW PDF

Authors

page 8

01/30/2008

On the Distribution of the Adaptive LASSO Estimator

We study the distribution of the adaptive LASSO estimator (Zou (2006)) i...
03/18/2021

Confidence Regions Near Singular Information and Boundary Points With Applications to Mixed Models

We propose confidence regions with asymptotically correct uniform covera...
08/23/2019

On the asymptotic properties of SLOPE

Sorted L-One Penalized Estimator (SLOPE) is a relatively new convex opti...
11/02/2019

Global Adaptive Generative Adjustment

Many traditional signal recovery approaches can behave well basing on th...
06/10/2008

Confidence Sets Based on Penalized Maximum Likelihood Estimators in Gaussian Regression

Confidence intervals based on penalized maximum likelihood estimators su...
11/05/2007

On the Distribution of Penalized Maximum Likelihood Estimators: The LASSO, SCAD, and Thresholding

We study the distributions of the LASSO, SCAD, and thresholding estimato...
11/14/2015

Sparse Nonlinear Regression: Parameter Estimation and Asymptotic Inference

We study parameter estimation and asymptotic inference for sparse nonlin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The least absolute shrinkage and selection operator or Lasso by Tibshirani (1996) has received tremendous attention in the statistics literature in the past two decades. The main attraction of this method lies in its ability to perform model selection and parameter estimation at very low computational cost and the fact that the estimator can be used in high-dimensional settings where the number of variables exceeds the number of observations (“”).

Due to these reasons, the Lasso has also turned into a very popular and powerful tool in econometrics, and similar things can be said about the estimator’s many variants, among them the adaptive Lasso estimator of Zou (2006) where the -penalty term is randomly weighted according to some preliminary estimator. This particular method has been used in econometrics in the context of diffusion processes (DeGregorio & Iacus, 2012), for instrumental variables (Caner & Fan, 2015), in the framework of stationary and non-stationary autoregressions (Kock & Callot, 2015, Kock, 2016) and for autoregressive distributed lag (ARDL) models (Medeiros & Mendes, 2017), to name just a few.

Despite the popularity of this method, there are still many open questions on how to construct valid confidence regions in connection with the adaptive Lasso estimator. Pötscher & Schneider (2010) demonstrate that the oracle property from Zou (2006) and Huang et al. (2008) cannot be used to conduct valid inference and that resampling techniques also fail. They give confidence intervals with exact coverage in finite samples as well as an extensive asymptotic study in the framework of orthogonal regressors. However, settings more general than the orthogonal case have not been considered yet.

In this paper, we consider an arbitrary low-dimensional linear regression model (

”) where the regressor matrix exhibits full column rank. We allow for the adaptive Lasso estimator to be tuned componentwise with some tuning parameters possibly being equal zero, so that not all coordinates have to be penalized. Due to this componentwise structure, three possible asymptotic regimes arise: the one where each zero component is detected with asymptotic probability less than one, usually termed

conservative model selection, the one where each zero component is detected with asymptotic probability equal to one, usually referred to as consistent model selection, and the mixed case where some components are tuned conservatively and some are tuned consistently. The framework we consider encompasses the latter two regimes.

The main challenge for inference in connection with the adaptive Lasso (and related) estimators lies in the fact that the finite-sample distribution depends on the unknown parameter in a complicated manner, and that this dependence persists in large samples. Because of this, the coverage probability of a confidence region varies over the parameter space and in order to conduct valid inference, one needs to guard against the lowest possible coverage and consider the minimal one. This is done so in this paper.

Since explicit expressions for the finite-sample distribution and the coverage probabilities of confidence regions are unknown when the regressors are not orthogonal, our study is set in an asymptotic framework. We determine the appropriate uniform rate of convergence and derive the asymptotic distribution of an appropriately scaled estimator that has been centered at the true parameter. While the limit distribution is still only implicitly defined through a minimization problem, the key observation and finding is that one may explicitly characterize the set of minimizers once the union over all true parameters is taken. This is done by heavily exploiting the structure of the corresponding optimization and leads to a compact set that is determined by the asymptotic Gram matrix as well as the asymptotic deviations between the componentwise tuning parameters and the maximal one. This result can then be used to show how any confidence region with positive asymptotic coverage needs to include . Even more so, such confidence sets will necessarily always have asymptotic coverage equal to one, showing that it is impossible to construct classical confidence regions with arbitrary coverage in this setting.

The paper is organized as follows. We introduce the model and the assumptions as well as the estimator in Section 2. In Section 3, we study the relationship of the adaptive Lasso to the least-squares estimator. The consistency properties with respect to parameter estimator, rates of convergence and model selection are derived in Section 4. Section 5 looks at the asymptotic distribution of the estimator and deduces that it is always contained in a compact set, independently of the unknown parameter. These results are used to construct the confidence regions in Section 6 where their shape is also illustrated. We summarize in Section 7 and relegate all proofs to Appendix A for readability.

2 Setting and Notation

We consider the linear regression model

where

is the response vector,

the non-stochastic regressor matrix assumed to have full column rank, the unknown parameter vector and

the unobserved stochastic error term consisting of independent and identically distributed components with finite second moments, defined on some probability space

. To define the adaptive Lasso estimator first introduced by Zou (2006), let

where are non-negative tuning parameters and

is the ordinary least-squares (LS) estimator. We assume the event

to have zero probability for all and do not consider this occurring in the subsequent analysis. The adaptive Lasso estimator we consider is given by

which always exists and is uniquely defined in our setting. Note that, in contrast to Zou (2006), we allow for componentwise partial tuning where the tuning parameter may vary over coordinates and may be equal to zero, so at not all components need to be penalized. This is in contrast to the typical case of uniform tuning with a single positive tuning parameter. We also look at the leading case of with , in the notation of Zou (2006). For all asymptotic considerations, we will assume that converges to a positive definite matrix as .

We define the active set to be . The quantity is given by the largest tuning parameter, and stands for the extended real line. Finally, the symbol depicts convergence in probability and convergence in distribution, respectively. For the sake of readability, we do not show the dependence of the following quantities on in the notation: , , , , , and .

3 Relationship to LS estimator

The following finite-sample relationship between the adaptive Lasso estimator is essential for proving the results in the subsequent section and will also give some insights in understanding the idea behind the results on the shape of the confidence regions in Section 5 and 6. It shows that the difference between the adaptive Lasso and the LS estimator is always contained in a bounded and closed set that depends on the regressor matrix as well as on the tuning parameters. Note that the statements in Lemma 1 and Corollary 2 hold for all .

Lemma 1 (Relationship to LS estimator).

Lemma 1 can be used to show under what tuning regime the adaptive Lasso is asymptotically behaving the same as the LS estimator, as is stated in the following corollary.

Corollary 2 (Equivalence of LS and adaptive Lasso estimator).

If , and are asymptotically equivalent in the sense that

Corollary 2 shows that in case , the adaptive Lasso estimator is asymptotically equivalent to the LS estimator, so that this becomes a trivial case. How the estimator behaves in terms of parameter estimation and model selection for different asymptotic tuning regimes is treated in the next section.

4 Consistency in parameter estimation and model selection

We start our investigation by deriving the pointwise convergence rate of the estimator.

Proposition 3 (Pointwise convergence rate).

Let . Then the adaptive Lasso estimator is pointwise -consistent for in the sense that for every , there exists a real number such that

The fact that the pointwise convergence rate is given by only if does not diverge has implicitly been noted in Zou (2006)’s oracle property in Theorem 2 in that reference, reflected in the assumption of 111Note that in that reference corresponds to in our notation, assuming uniform tuning over all components.. In the one-dimensional case, it can be learned from Theorem 5 Part 2 in Pötscher & Schneider (2009) that the sequence is not stochastically bounded if diverges222To make the connection from that reference to our notation, note that there and set and .. However, neither of these references determine the slower rate of explicitly when it applies.

The uniform convergence rate is presented in the next proposition.

Proposition 4 (Uniform convergence rate).

Let . Then the adaptive Lasso estimator is uniform -consistent for in the sense that for every , there exists a real number such that

Proposition 4 shows that the uniform convergence rate is slower than if , in which case it is also slower than the pointwise rate. The fact that the uniform rate may differ from the pointwise one has been noted in Pötscher & Schneider (2009).

Theorem 7 in Section 5 shows that the limit of for certain sequences are non-zero, demonstrating that the uniform rate given in Proposition 4 can indeed not be improved upon.

Theorem 5 (Consistency in parameter estimation).

The following statements are equivalent.

  1. [label=(),ref=()]

  2. is pointwise consistent for .

  3. is uniformly consistent for .

  4. as .

  5. whenever .

Condition 4 in Theorem 5 states that the adaptive Lasso only chooses correct and never underparametrized models with asymptotic probability equal to 1. It underlines the fact that is basic condition that we will assume in all subsequent statements.

Theorem 6 (Consistency in model selection).

Suppose that as . If as well as as for all , then the adaptive Lasso estimator performs consistent model selection in the sense that

Remark.

Inspecting the proof of Theorem 6 shows that in fact a more refined statement than Theorem 6 holds. Assume that . We then have that whenever and

This statement is in particular interesting for the case of partial tuning where some are set to zero and the corresponding components are not penalized, revealing that the other components can still be tuned consistently in this case.

5 Asymptotic distribution

In this section, we investigate the asymptotic distribution and subsequently construct confidence regions in Section 6. We perform our analysis for the case when which, by Theorem 6, encompasses the tuning regime of consistent model selection and often is the regime of choice in applications. If the estimator is tuned uniformly over all components, the condition is in fact equivalent to consistent tuning, given the basic condition of .

The requirement also corresponds to the case where the convergence rate of adaptive Lasso estimator is given by rather than , as can be seen from Proposition 4. Pötscher & Schneider (2009) and Pötscher & Schneider (2010) demonstrate that in order to get a representative and full picture of the behavior of the estimator from asymptotic considerations, one needs to consider a moving-parameter framework where the unknown parameter is allowed to vary over sample size. For these reasons, we study the asymptotic distribution of , which is done in the following.

Throughout Section 5 and Section 6, let and be defined by

measuring the two different deviations between each tuning parameter to the maximal one. Note that we have and for uniform tuning and that not penalizing the -th parameter leads to . Note that assuming the existence of these limits does not pose a restriction as we could always perform our analyses on convergent subsequences and characterize the limiting behavior for all accumulation points.

Theorem 7 (Asymptotic distribution).

Assume that and . Moreover, define by for . Then

where

with .

There are a few things worth mentioning about Theorem 7. First of all, in contrast to the one-dimensional case, the asymptotic limit of the appropriately scaled and centered estimator may still be random. However, this can only occur if is non-zero and finite for some component , meaning that the maximal tuning parameter diverges faster (in some sense) than the tuning parameter for the -th component, but not too much faster. When no randomness occurs in the limit, the rate of the stochastic component of the estimator is obviously smaller by an order of magnitude compared to the bias component. In particular, this will always be the case for uniform tuning when .

As is expected, the proof of Theorem 7 will be carried out by looking at the corresponding asymptotic minimization problem of the quantity of interest, which can shown to be the minimization of . However, since this limiting function is not finite on an open subset of , the reasoning of why the appropriate minimizers converge in distribution to the minimizer of is not as straightforward as might be anticipated.

The assumption of converging in in the above theorem is not restrictive in the sense that otherwise, we simply revert to converging subsequences and characterize the limiting behavior for all accumulation points, which will prove to be all we need for Proposition 8 and the confidence regions in Section 6.

While we cannot explicitly minimize for a fixed other than in trivial cases, surprisingly, we can still explicitly adduce the set of all minimizers of over all , which yields the same set regardless of the realization of in . This is done in the following proposition.

Proposition 8 (Set of minimizers).

Define

Then for any we have

So, while the limit of will, in general, be random, the set is not. In fact, Proposition 8 shows that for any , the union of limits over all possible sequences of unknown parameters is always given by the same compact set . This observation is central for the construction of confidence regions in the following section. It also shows that while in general, a stochastic component will survive in the limit, it is always restricted to have bounded support that depends on the regressor matrix and the tuning parameter through the matrix and the quantities and . Interestingly, only depends on for the components when , in which case the set loses a dimension. This can be seen as a result of the -th component being penalized much less than the maximal so that the scaling factor used in Theorem 7 is not enough for this component to survive in the limit. Note that in case of uniform tuning where and , does not depend on the sequence of tuning parameters at all. Also, we have for and , a fact that has been shown in Pötscher & Schneider (2009) and used in Pötscher & Schneider (2010).

A very “quick-and-dirty” way to motivate the result in Proposition 8 is to rewrite

and observe that the second term on the right-hand side is whereas the first term is always contained in the set

by Lemma 1, which contains the set in the limit. Theorem 7 and Proposition 8 can therefore be viewed as the theory that make this observation precise by sharpening the set and showing that it only contains the limits. This can then be used for constructing confidence regions, which is done in the following section.

6 Confidence regions – coverage and shape

The insights from Theorem 7 and Proposition 8 can now be used for deriving the following theorem on confidence regions.

Theorem 9 (Confidence regions).

Let and . Then every open superset of satisfies

For , define . We then have that

for any .

Theorem 9 essentially shows the following. The set is the “boundary case” for confidence sets in the sense that if we take a “slightly larger”, multiplied with the appropriate factor and centered at the adaptive Lasso estimator, we get a confidence region with minimal asymptotic coverage probability equal to 1. If, however, we take a “slightly smaller” set, we end up with a confidence region of asymptotic minimal coverage 0. Nothing can be said in general about the case when we use itself. All this entails that constructing valid confidence regions based on the Lasso is not possible in classical sense where one can go for an arbitrary prescribed coverage level, if at least one component of the estimator is tuned to perform consistent model selection. The reason for this is the fact that when controlling the bias of the estimator, the stochastic component either has the same bounded support as the bias component or completely vanishes at all, as has been pointed out in Section 5.

Remark.

The statements in Theorem 9 can be strengthened in the following way. Let and .

  1. [label=(),ref=()]

  2. If and , then for any we have

  3. If , then any closed and proper subset of fulfills

Note that for uniform tuning, both refinements hold since and .

Part 1 holds since then has non-empty interior and therefore contains an open superset of . Part 2 hinges on the fact that the limits in Theorem 7 are always non-random under the given assumptions.

One might wonder how this type of confidence region compares to the confidence ellipse based on the LS estimator. Note that the regions will be multiplied by a different factor and centered at a different estimator. In general, the following observation can be made. For , let with be such that is an asymptotic -confidence region for . If we contrast this with , we see that since both and have positive, finite volume and since , the regions based on the adaptive Lasso are always larger by an order of magnitude.

We now illustrate the shape of . We start with and the matrix

We consider the case of uniform tuning, so that and and show the resulting set in Figure 1. The color indicates the value of at the specific point inside the set. The higher the absolute value of the correlation of the covariates is, the flatter and more stretched the confidence set becomes. As one may expect intuitively, in the case of positive correlation, the confidence set covers more of the area where the signs of the covariates are equal. A negative correlation causes the opposite behavior seen in Figure 1. Note that the corners of the set touch the boundary of the ellipse for a certain value of .

Figure 1: An example for the set with uniform tuning in dimensions.

For the case of , we again start with an example with uniform tuning so that and and consider the matrix

The resulting set is depicted in Figure 2. To give a better impression of the shape, the set is colored depending on the value of the third coordinate. Here, the high correlation between the first and third covariate stretches the set in the direction where the signs of the covariates differ. Figure 2 shows the projections of the three-dimensional set of Figure 2 onto three planes where one component is held fixed at a time. The projection onto the plane where the second component is held constant clearly shows the behavior explained above. On the other hand, the other two projections emphasize that for covariates with a lower correlation in absolute value the confidence set is less distorted.

(a)
(b)
Figure 2: An example for the set with uniform tuning and dimensions. The three-dimensional set is depicted in (subfig:conf_3d) whereas its two-dimensional projections are shown in (subfig:conf_proj).

Finally, Figure 3 illustrates the partially tuned case with the same matrix . The first component is not penalized whereas the the remaining ones are tuned uniformly. This implies that and . Due to the condition for all , the resulting set is an intersection of a plane with the set in Figure (1(a)

). The fact that the confidence set is only two-dimensional might appear odd and is due to the fact that the unpenalized component exhibits a faster convergence rate so that the factor

with which is multiplied is not enough for this component to survive in the limit.

Figure 3: An example of the set with partial tuning and dimensions. The first component is not penalized resulting in the set being part of a two-dimensional subspace.

7 Summary and conclusions

We give a detailed study the asymptotic behavior of the adaptive Lasso estimator with partial consistent tuning in a low-dimensional linear regression model. We do so within a framework that takes into account the non-uniform behavior of the estimator, non-trivially generalizing results from Pötscher & Schneider (2009) that were derived for the case of orthogonal regressors. We use these distributional results to show that valid confidence regions based on the estimator can essentially only have asymptotic coverage equal to 0 or to 1, a fact that has been observed before for the one-dimensional case in Pötscher & Schneider (2010). We illustrate the shape of these regions and demonstrate the effect of componentwise tuning at different rates as well as the implications of partial tuning on the confidence set.

Appendix A Appendix – Proofs

We introduce the following additional notation for the proofs. The symbol denotes the -th unit vector in and the sign function is given by for . For a function , the directional derivative of at in the direction of is denoted by , given by

For a vector and an index set , contains only the components of corresponding to indices in . Finally, denotes convergence in probability.

a.1 Proofs for Section 3

Proof of Proposition 1.

Consider the function

which can, using the normal equations of the LS estimator, be rewritten to

Note that is minimized at and that, since all directional derivatives have to be non-negative at the minimizer of a convex function. After some basic calculations we get

(1)

for all . When , this implies that

and therefore

(2)

holds. When , the equations in (1) imply

(3)

If , clearly, (2) also holds. If , we have yielding

which completes the proof. ∎

Proof of Corollary 2.

By Lemma 1, we have

Since with positive definite, the claim follows. ∎

a.2 Proofs for Section 4

Proof of Proposition 3.

Consider the function defined by which can be written as

is minimized at and, since , we have which implies that

where in the latter sum we have dropped the non-positive terms for and have used the fact that on the terms for . Now note that both and are bounded by 1 and that the sequences and for are tight, so that we can bound the right-hand side of the above inequality by a term that is stochastically bounded times . Moreover, since converges to and all matrices are positive definite, we can bound the left-hand side of the above inequality from below by a positive constant times , so that we can arrive at

which proves the claim. ∎

Proof of Proposition 4.

Let

denote the infimum of all eigenvalues of

and taken over and note that . By Lemma 1 we have

For any we therefore have

The claim now follows from the uniform -consistency of the LS estimator. ∎

Proof of Theorem 5.

We have 3 2 by Proposition 4 and clearly, 2 holds. To show 1 3, assume that is consistent for and that for some along a subsequence . Let . On the event , which by consistency has asymptotic probability equal to 1, we have

by Equation (3). By consistency and the convergence of , the left-hand side converges to zero in probability, whereas the right-hand side converges to in probability along the subsequence , yielding a contradiction. This shows the equivalence of the first three statements.

Moreover, 1 4 since for

by consistency in parameter estimation.

The final implication we show is 4 3. For this, assume that so that there exists a subsequence such that as for some . We first look at the case of . Note that is stochastically bounded, since implies

As and , the quadratic term on the left-hand side dominates the linear term on the right-hand side which is only possible if is . Now note that by Equation 3, implies

The fact that and that and are stochastically bounded for fixed show the left-hand side of the above display is also bounded in probability. The right-hand side, however, diverges to regardless of the value of . We therefore have for all , which is a contradiction to 4. If , we first observe that is always contained in a compact set by Lemma 1 and the convergence of to . This implies that for some for all . Again, by Equation 3,

whenever . The left-hand side is bounded by whereas the right-hand side converges to in probability. We therefore get for all satisfying , also yielding a contraction to 4. ∎

Proof of Theorem 6.

Since the condition guards against false negatives asymptotically by Theorem 5, we only need to show that the estimator detects all zero coefficients with asymptotic probability equal to one. Assume that and that . The partial derivative of with respect to is given by

which yields

Since is -consistent for , converges, and is tight, the left-hand side of the above display is stochastically bounded. The behavior of the right-hand side is governed by as is also stochastically bounded for . If does not converge to zero, then the right-hand side diverges because does. If , we have eventually, so that which also diverges by assumption. ∎

a.3 Proofs for Section 5

Lemma 10.

Assume that and . Moreover, suppose that and . Then for any , the term

satisfies where