Consider a data matrix with features and rows, and a response from the standard linear model
where is the noise term. In this paper, we study the Lasso, which, given a penalty parameter , finds the solution to the convex optimization problem [lasso]:
A variable is selected by the Lasso if and the selection is false if the variable is a noise variable in the sense that .
The Lasso is perhaps the most popular method in high-dimensional setting where the number of features is very large and therefore sparse solutions are wished for. Here the sparsity, defined as , indicates that only out of the sea of all the explanatory variables are in effect and have non-zero regression coefficients. Owing to the sparsity of its solution, the Lasso is also widely recognized as a feature selection tool in practice.
Therefore, one particularly interesting and important question is to understand how well the Lasso performs as a feature selector. The best-case scenario is that one can find a well-tuned
such that the Lasso estimator can discover all the true variables and only the true variables, resulting in zero type I error and full power. This is known as the feature selection consistency. In theory, when the signals are strong enough, and when the sparsity is small enough, consistency can be guaranteed asymptotically[wainwright2009information]. In practice, nevertheless, the model selection consistency is usually a pipe dream: for example, even in a moderately sparse regime, consistency is impossible [su2017false], and thus the tradeoff between the type I error and the statistical power is unavoidable. Therefore, it is much more sensible to evaluate the performance of the Lasso not at some single ”best” point (since there is no such point), but rather to focus on the entire Lasso path and evaluate the overall tradeoff between the type I error and the statistical power. To assess the quality of each selected model at some , we use the false discovery proportion (FDP) and the true positive proportion (TPP) as measures. Formally, the FDP and TPP are defined as:
In this paper, we characterize the exact region in the TPP–FDP diagram where the Lasso tradeoff curves locate. A complete theoretical study of such a diagram will surely enhance our understanding of the advantages and limitations of Lasso. It can be used to theoretically guide the data analysis procedure and explain why Lasso has good empirical performance in certain scenarios.
1.1 Prior Art and Our Contribution
In [su2017false], Su et al. proved that under the linear sparsity regime, where the ratio is roughly constant, it is impossible for the Lasso to achieve the feature selection consistency and a tight lower boundary of the tradeoff curves was established. Moreover, they showed that when is less than 1 (i.e., in the high-dimensional setting) and when the sparsity is large enough, the TPP of Lasso is always bounded away from 1, regardless of . This phenomenon is closely related to the Donoho–Tanner (DT) phase transition [donoho2009observed, donoho2009counting, donoho2005neighborliness, donoho2006high, donoho2011noise, bayati2015universality]. The results in [su2017false] can be visualized in the schematic plots in Figure 1. However, the dichotomy offered by their lower boundary does not give a complete picture: it is clear that the red region is unachievable, yet the lower boundary says little about the entire “Possibly Achievable” region above it. This ambiguity is resolved in this paper.
In a more recent paper [wang2020price], the authors recognized a region in the TPP–FDP diagram termed as the “Lasso Crescent” in the noiseless case, which is enclosed by sharp upper boundary and lower boundary. To gain more insights into those two boundaries, a new notion term as the “effect size heterogeneity” is proposed: while all the other conditions remain unchanged, the more heterogeneous the magnitudes of the effect sizes are, the better the Lasso performs. As a result, the upper boundary (or the worst tradeoff curve) in the noiseless case is given by the homogeneous effect sizes, while the lower boundary (or the best tradeoff of TPP–FDP) is achieved asymptotically by the most heterogeneous effect sizes. Though the achievability is not the primary focus in [wang2020price], they partially refine the achievable region from the entire white region in Figure 1 to the Lasso Crescent. But their scenario is limited to the case without taking the DT phase transition into account, and it is still unclear whether the entire region enclosed in the Lasso Crescent is achievable or not.
In this paper, we study the exact achievable region, taking into account the cases below and above the DT phase transition. We depict the complete Lasso achievability diagrams in terms of the TPP–FDP tradeoff in all possible scenarios. On top of [su2017false], we specifically find three inherently different tradeoff diagrams as shown in Figure 2: We enclose the achievable region by exact upper boundaries, and notably, we identify two distinct sub-cases (Left and Middle panels of Figure 2) when it is below the DT phase transition, as opposed to the corresponding panels (Left and Middle panels) in Figure 1. Our work provides a worthwhile understanding of the DT phase transition, in the sense that we consider not only the possible region of the TPP (i.e., power), but also the possible region of (TPP, FDP) jointly.
To establish the theoretical results of our work, we leverage the powerful Approximate Message Passing (AMP) theory that has been developed recently [donoho2013high, bayati2011dynamics, bayati2011lasso, donoho2009message]. We also use a homotopy argument to show the asymptotic achievability of every point within the claimed region.
2 Main Results
We start presenting our results by laying out the working assumptions as follows.
We assume that consists of i.i.d. entries so that each column has approximately unit norm. As the index , we assume with in the limit for some positive constant . The index is often omitted for the sake of simplicity.
The noise is i.i.d. drawn from , where is any fixed constant. For the completeness of our definition, , and are jointly independent.
From now on, we consider the tradeoff between the TPP and FDP as varies in , i.e., along the Lasso path. Notation-wise, we use fdp and tpp to denote values in , and reserve TPP and FDP for the corresponding random variables.
2.1 The Complete Lasso tradeoff Diagram
In this subsection, we characterize the feasible region of the Lasso on the TPP–FDP diagram. We provide a set of constraints that any Lasso path must satisfy, and further show that any point in the region defined by the constraints is achievable asymptotically. We term such a region as the feasible region of Lasso. For any and , we define the feasible region of the Lasso to be the set of (tpp, fdp) pairs that satisfy the following constraints: (1) tpp ; (2) fdp ; (3) fdp tpp; (4) tpp fdp . The function in the above definition is a deterministic increasing function of , which is defined in (3.7) and is first proposed in [su2017false]. Recall is the sparsity . We now state our main theorem. To avoid any confusion, we say that a point (tpp, fdp) is asymptotically achievable, if there exists some , and a sequence of (possibly adaptive) , such that (tpp, fdp).111The and are the TPP and FDP calculated at a realization of design matrix , regression coefficients and noise [The complete Lasso tradeoff diagram] For any and , under the working assumptions, the following conditions hold:
Any Lasso tradeoff curve lies inside the region asymptotically.
Any point in is asymptotically achievable by the Lasso.
Theorem 1 provides a complete characterization of the location of the Lasso solution on the TPP–FDP diagram. It can be seen from this theorem that the region is essentially the union of all the Lasso paths asymptotically:
We pause here to explain the intuition behind the constraints that define : constraint (1) directly comes from the definition; constraint (2) is from the fact that the Lasso outperforms the random guess, and hence selects no more than fraction of false signals; constraint (3) is from the main result from [su2017false, Theorem 2.1]; constraint (4) comes from the fact that the number of selected variables do not exceed .
In Figure 2, we plot three different cases of the Lasso tradeoff diagrams with different and , where the region is marked in blue. The difference between these diagrams comes from different active constraints defining .
We rewrite the boundary lines of the constraints (2) and (4) of Definition 2.1 as
and then we can describe the three tradeoff diagrams accordingly.
When and , we obtain that is always above in the interval , and thus only the constraint from is active. We note that this case is below the DT phase transition. This case corresponds to the left plot in Figure 2.
When and is sufficiently small, we see that and intersects. Hence both constraints from and are active. However, is always above and thus never intersects with on . Similar to Case 1, it is below the DT phase transition. This case corresponds to the middle plot of Figure 2.
When and is sufficiently large, again and intersect, and also intersects with within . We observe the DT phase transition. This case corresponds to the right plot of Figure 2. Surprisingly, the maximum TPP achievable by the Lasso is exactly where and intersect.
In particular, when (i.e., Case 2 or 3), consider the following equation in from [su2017false, Equation (C.5)],
There is a unique positive such that when , (2.3) has a unique positive root in . When there is no positive root . This is known as the DT phase transition point. With , we can define the maximum achievable TPP as in [su2017false, Lemma C.2]
When , Lasso can have power arbitrarily close to , which falls into Case 2 in our discussion above. When , the power is at most and bounded away from 1, and this corresponds to Case 3. We can show in the following lemma an equivalent characterization of the DT phase transition. The curve intersects with in if and only if , and in the intersecting scenario, the intersection point is at . This lemma provides an alternative view of the DT phase transition: the setting is above the DT phase transition if and only if the curve and line intersect with each other. From this perspective, the maximum power is not a magic output of some mysterious machine, but is indeed the power when exactly discoveries are made by Lasso when it is on the lower boundary . The proof of this lemma can be found in Appendix A.
Before we end this section, we emphasize the difference between our findings in Theorem 1 with the results in [su2017false]. We prove here that all asymptotically achievable points indeed constitute , while [su2017false] only showed that all Lasso paths are above the curve without further specification of the achievable region nor the unachievable one. Furthermore, when it is below the DT phase transition, we separate the Case 1 and Case 2 (the Left and Middle panels in Figure 2) which is not distinguished therein. Those two diagrams are very different. The case in the Middle panel guarantees that one cannot make too many mistakes with full power, since the FDP has a non-trivial upper bound.
To elaborate on the last point, we demonstrate three more tradeoff diagrams below, focusing on the case when . As it is clear from Figure 3, our complete Lasso tradeoff diagrams show that the achievable region is relatively narrow in its vertical direction when the TPP is large (close to ). In the left and right panel, we see that it is above the DT phase transition in both cases, and thus there is a single value of FDP (around and ) when the TPP achieves its maximum. From the middle panel, we see that the range of the FDP is also narrow () when the TPP is close to . We want to emphasize that according to the result in [su2017false], the lower bounds of FDP in all cases (, and ) are the best possible value achievable when the TPP is close to its maximum. However, our complete Lasso tradeoff diagram also guarantees that it is impossible to have a much worse FDP than the best possible ones when the TPP is large.
In this section, we establish part (a) of Theorem 1. To this end, we first show the fact that under our assumptions, the TPP and FDP have uniform limits. As will be clear in subsection 3.1, it is not hard to check that such limits of pairs (TPP, FDP) are indeed within .
To prove part (b) of Theorem 1, we first show that the boundary of can be achieved by a sequence of priors and noises. Then, we extend the achievability result from the boundary to the entire interior of the region by a homotopy argument.
3.1 Proof of Part (a) of Theorem 1
In this subsection, we first invoke the following lemma to derive uniform limits of TPP and FDP over ’s. We denote to be the conditional distribution of the prior given it is not zero, that is, the distribution .
(Lemma A.1 and A.2 in [su2017false], see also Theorem 1 in [supp] and Theorem 1.5 in [BM12]). Fix any , , , and . When and , we have the following uniform convergence,
where the two deterministic functions are
and where independent of . In addition, , is the unique solution222We note the first equation is known as the state evolution equation, and the second is the calibration equation. The notation is the soft-thresholding operator defined as . We note that the solution also satisfies , where is the unique root to in to
The guarantee of uniform convergence of TPP and FDP along the Lasso path allows us to directly deal with the two deterministic functions and , instead of considering TPP and FDP as random variables for each finite and , which depends on the realizations of the signal, the noise, and the design matrix. Thus, we can focus on the properties of and .
To prove the part (a) of the Theorem 1, we will prove any pair satisfies all the four constraints of Definition 2.1. The constraint (1) is trivial, since TPP is bounded between and by definition, and so is . We now prove the constraints from Definition 2.1 in the order (4)(2)(3). We start proving by constraint (4).
For any , and , we have . Proof of Lemma 3.1. Let be the number of false discoveries, and be the number of total discoveries. Recall the sparsity is . By definition, we have and . So, . We view as a function of , and treat as fixed. Note that by definition, , and the number of discoveries made by the Lasso is less than or equal to , so always holds. It is not hard to see that, at the two endpoints and , attains its maximum . Hence, for any , and , . By the uniform convergence of TPP and FDP in (3.1)(3.2), we finally have .∎
The key element of this lemma is simply that “the Lasso never selects more than variables”, which results in . This is a simple corollary of the KKT condition of the Lasso.
The next lemma proves the constraint (2) in Definition 2.1 of . For any , and , we have . To see why this lemma should be intuitively true, recall that
, and if we select signals uniformly at random, each variable that we select is false with probability. As a result of randomly selecting variables, we end up with FDP on average. It is natural for the Lasso to produce a better result. Therefore, this lemma is a sanity check on the Lasso, claiming that the Lasso performs no worse than a random guess. Consequently, serves as a simple upper bound for FDP.
Proof on Lemma 3.1. Observe the probability
, where the inequality holds as the standard normal distribution is uni-modal at the origin and. Therefore, by (3.3) we have
Next, we proceed to prove constraint (3) in Definition 2.1. We introduce the lower boundary of any Lasso tradeoff curve as follows: let be the largest solution to
denote the cumulative density function and probability density function of the standard normal distribution, respectively. Thenis defined as
With the definition of , we can now state the fundamental tradeoff between and , which serves as a lower bound for FDP. [Theorem 2.1 in [su2017false]] For any , and , we have and the inequality is tight.
Combining all those pieces, we can now prove part (a) of Theorem 1.
3.2 Proof of Part (b) of Theorem 1
In the previous subsection, we have shown that, asymptotically, all (TPP, FDP) pairs along the Lasso path locate inside the region . We now proceed to prove that every point in is indeed asymptotically achievable by some Lasso solution.
We note that for each prior , each specific noise level , and each fixed , the pair (TPP, FDP) is asymptotically a fixed point specified as the in Lemma 3.1. Therefore, given and , the entire achievable region is composed by the trajectory of resulting from the variation of the three parameters: and . To simplify the analysis of the trajectory of , we will always fix some of these parameters.
Following the proposal in the preceding paragraph, we first “fix” the penalty parameter and consider the variation of the noise and the prior. We focus on two extreme scenarios that are easy to analyze, namely when is large enough and when . We can make the large enough so that the Lasso makes no discovery333We note that, this is the limiting regime when , which is the moment when we are about to have an infinitesimal positive power. When this happens, the FDP can be non-zero, depending on the possibility of the first variable being a false variable. For notational convenience, we abuse the notation a little bit, and use
, which is the moment when we are about to have an infinitesimal positive power. When this happens, the FDP can be non-zero, depending on the possibility of the first variable being a false variable. For notational convenience, we abuse the notation a little bit, and useto denote this limiting regime in the following. for any fixed prior and noise, and hence the trajectory of is a vertical line of the form . In the other case when , there is almost zero shrinkage. Thus, the Lasso tends to make the maximum possible amount of discoveries. Notice that when is larger than , the almost surely, since Lasso eventually behaves as the least square regression. However, when is less than , the Lasso selects at most variables, and thus the maximum of can either be or strictly less than , depending on the sparsity . Therefore, with different sampling ratio and sparsity ratio , one expects the pairs to have different trajectories when we vary the prior and the noise. Another useful extreme case corresponds to “fixing” the prior to be some (sequence of) , such that when we vary from to , the corresponding Lasso path approaches the lower boundary of (i.e., the curve ) and the upper boundary of (i.e., ).
We will prove that the four trajectories in the above discussion jointly constitute the boundary of , whose achievability is guaranteed asymptotically. Therefore, it is just a stone’s throw to prove the achievability of the interior of by the homotopy theory.
We start with the following lemma, which analyzes the trajectory of when and , separately. Recall that we define in equation (2.3).
For any prior and any noise level , let be the corresponding TPP–FDP tradeoff curve. The following statements hold:
As , we have , for any and ;
When , as , we have lying on the vertical line ;
When and , as , we have lying on the line ;
When and , as , we have lying on the poly-line of and .
Next, we need to find some specific priors to approach the upper and lower boundary when varies. However, a single prior isn’t enough to attain the boundary. Instead, we show in the following lemma that there exists a sequence of priors, such that when , the Lasso tradeoff curves approximate the lower boundary in (3.7).
[Lemma 4.8 in [wang2020price]] Suppose , then there exists a sequence of priors , such that as , converges uniformly to .
This lemma finds a sequence of priors that achieves the lower boundary when . We now need to show when , the tradeoff curves of the same sequence of priors approach the upper boundary . This claim is intuitively correct: in the presence of infinite noises, the should be
for any prior, since the signal to noise ratio is infinitely small, and any discovery of the Lasso is equivalent to the discovery of a random guess. Formally, we have the following lemma. For any prior, when the noise level , .
Combining Lemma 3.2, Lemma 3.2, and Lemma 3.2, it is easy to check that in those extreme cases, the points form the entire boundary of . Now we introduce a homotopy lemma to bridge from the achievability of the boundary of to the achievability of the interior of . The idea of homotopy is pretty intuitive: suppose there are two curves, Curve and Curve , and a continuous transformation move to . It is easy to imagine that the trajectories of the two endpoints of Curve during the transformation form two other curves, say Curve , and Curve . Then there is a region surrounded by Curve , and . It is a region defined by Curve , and the transformation . The homotopy theory guarantees that every point in this region is passed by the transforming curve during the transformation.
We formalize this idea in the following lemma. Since its form is similar in spirit to the one-dimensional intermediate value theorem, we term it as an intermediate value theorem on the plane. Its proof is given in Appendix B.
[Intermediate value theorem on the plane] If a continuous curve in is parameterized by and if the four curves:
join together as a simple closed curve 444We actually want to specify the orientation of the curve to be positively oriented, i.e., counter-clockwise oriented, as the convention for Jordan’s curve. So if we assume , , and are all positively oriented, then should be . More details can be found in Appendix B., , then encloses an interior area , and , such that . In other words, every point inside the boundary curve is realizable by some . Following our proposal at the beginning of this section, we are now equipped with all we need to prove part (b) of Theorem 1.
Proof of Theorem 1(b).
Fix any and . By Lemma 3.1, to prove the asymptotic achievability of by , we only need to prove every point of can be achieved by some . Note that is a function of and . So we can denote , , and
By Lemma 3.2, there exists a sequence of priors such that the lower boundary is the uniform limit of when . Combining this with our definition of , we know that is exactly the curve . By Lemma 3.2, we have for any , and thus , which is the upper boundary. We can view the lower boundary as curve and the upper boundary as curve in the Lemma 3.2. Given this, our goal is to find a transformation such that transforms to and encloses exactly the region .
We define the transformation as . From the uniform convergence, we know itself is continuous. It is direct to verify that is the lower boundary and is the upper boundary.555Notice that corresponds to the case when .
Now, consider the tradeoff curve of where . Notice that for each , the curve is continuous and bounded on any , and from Proposition 3.2 we know has a well-defined limit as , and . Therefore, we can make a continuous extension of from to using the natural compactification. So, without loss of generality, we can think of as a compact interval on the extended real line. Let , and .
Now, we observe that the curve corresponds to the case that , or effectively . Therefore by Lemma 3.2, it is just the line , where is the intersection of and . By Lemma 3.2, curve is just the curve in the range . By the part (1) of Proposition 3.2, is always on the segment joining () and (), so is just this segment. Similarly, by the part (2-4) of Proposition 3.2, is always on the poly-line joining from down to the endpoint of curve .
We observe that is the boundary of the region . Therefore, every point in is achievable by some by the homotopy Lemma 3.2. ∎
To better illustrate our theoretical results and understand the proof of Theorem 1(b), we present the following simulations where we fix the signals and vary the sampling ratio and the magnitude of the noise . As indicated by Theorem 1, the Lasso path indeed ‘swipes’ the complete achievable region enclosed by the upper and lower boundaries in (2.1),(2.2), and (3.7), when varies from to . In the simulations, we fix , and set and , respectively. Consequently, the sparsity ratio is fixed to while the sampling ratio is and . We note that these are the same parameters as we used in Figure 1 and Figure 2. Across all simulations, we fix to be a
-sparse vector with 5 different levels of magnitudes:or , and each level contains variables. When , the Lasso path is close to the lower boundary. As increases, the Lasso path gradually moves upward and becomes worse - as for each fixed TPP, the FDP is larger. When is sufficiently large, the Lasso tradeoff curve approaches the upper boundary and behaves similarly to the random guess. We plot the Lasso tradeoff curves for 8 levels of in Figure 4. From the results, it is not hard to see the alignment of our theoretical results and the real simulations: our theoretical achievable TPP–FDP regions indeed enclose all Lasso paths (up to small random errors) and the boundaries are tight. The R codes for the simulations and for plotting Lasso tradeoff Diagrams are available at https://github.com/HuaWang-wharton/CompleteLassoDiagram.
5 Conclusions and Future Works
Our result provides the first complete discussion of the Lasso tradeoff diagram between the TPP and FDP under all possible circumstances. In contrast to the previous works, we focus on quantifying the exact achievable region of (TPP, FDP) pairs asymptotically, and therefore resolved the unanswered question of determining the achievability of points in the tradeoff diagram. Notably, we discover that even for the case below the DT phase transition, there is a finer sub-classification of the case and the case (compare the Left and Middle panels in Figure 2 to those in Figure 1). Furthermore, comparing Figure 2 to Figure 4, we confirm that our result is asymptotically exact and aligns with simulations with a moderate sample size.
In closing, we introduce several directions for future research. First, it would be of interest to extend the results to other penalized regression-based models, including but not limited to the SLOPE [slope], the SCAD [fan2001variable], the group Lasso [friedman2010note] and the sparse group Lasso [friedman2010note, simon2013sparse]. We believe that the homotopy tool can be applied in the search of the exact tradeoff diagrams of these methods. Such diagrams are of great theoretical interest and would surely enhance our understanding of the advantages and limitations of different methods. Second, it is interesting and important to leverage our understanding of the complete diagram to better analyze practical problems and guide more informed fine-tuning of parameters. As illustrated in Figure 3, we can have a very narrow estimate of the false discoveries when we know Lasso has large power. A complete tradeoff diagram can be used to theoretically guide our analysis and to explain why certain methods have good empirical performance in certain scenarios. This can be a good scaffold to develop better methods.
Appendix A Basic Lemmas
In this section, we prove Lemma 2.1 and Lemma 3.2. They are basic facts about the complete Lasso tradeoff diagram, and their proofs are based on the technical tool of the approximate message passing theory. We devote Appendix B for a detailed discussion of the homotopy theory.
To start with, we list two known facts for the reader’s reference. Their proofs are essentially algebraic calculations from the state-evolution equation (The first equation in (3.5)). (Corollary 1.7 in [BM12]) Fix , and . , defined in equation (3.5), is an increasing function of , and .
(Theorem 3.3 in [mousavi2018consistent]) Fix , and . , defined in equation (3.5), is a continuously differentiable (positive) function of , and has exactly a one-time sign change from negative to positive, and therefore is quasi-convex.
The two lemmas above are properties of the parameters in the state evolution equation. We use them together with the following two lemmas to prove Lemma 3.2.
Suppose , where is defined in (2.3). We have
Though it is not hard to prove this fact using pure calculus tool, we give a simple proof that leverages the definition of .
Proof of Lemma A.
Let and Notice that for any , we have
where the first inequality is due to the well-known fact for . So is strictly decreasing when , and thus for all .
Now, suppose (A.1) does not hold, and thus we can find some such that
Since , we know
But and . By the continuity of , we know there is a root for in and , contradicting with the definition of , or the fact that has a unique positive root. ∎
for and , we have
Proof of Lemma A.
We prove by contradiction and suppose . By Lemma A, we know the only regime for this to hold is when monotonically decreases as decreases. By easy calculation, we have
This inequality implies that when is small enough, we must have
For this , from the state evolution equation (3.5), we have
which is clearly a contradiction. ∎
Given the lemmas above, we can now prove the Lemma 3.2.
Proof of Lemma 3.2.
The last equality is due to Lemma A, which implies that exist in .
To prove (2), notice that when , we have
By the second equation in (3.5) again, we know that as on the left-hand side, we must have
Since by definition , we must have
We now proceed to prove (3) and (4). Note that when , is always positive. Recall is the solution666It is defined in the footnote below Lemma 3.1 , where . Since and , we know that the solution to must be positive.
Now, for any and , we consider the following two cases:
When , we have:
As we have shown in Lemma A, when only case (b) happens, and thus we have proven (3); while when , both cases (a) and (b) are possible, and thus we have proven (4). ∎