Without a priori knowledge about data, it is often difficult to estimate a model or make predictions, either because the number of observations is too small, or the problem dimension too high. When a problem solution is known to be sparse, sparsity-inducing penalties have proven to be useful to improve both the quality of the prediction and its intepretability. In particular, the-norm has been used for that purpose in the Lasso formulation (Tibshirani, 1996).
Controlling the regularization often requires to tune a parameter. In a few cases, the regularization path—that is, the set of solutions for all values of the regularization parameter, can be shown to be piecewise linear (Rosset & Zhu, 2007). This property is exploited in homotopy methods, which consist of following the piecewise linear path by computing the direction of the current linear segment and the points where the direction changes (also known as kinks). Piecewise linearity of regularization paths was discovered by Markowitz (1952) for portfolio selection; it was similarly exploited by Osborne et al. (2000) and Efron et al. (2004) for the Lasso, and by Hastie et al. (2004)
for the support vector machine (SVM). As observed byGärtner et al. (2010), all of these examples are in fact particular instances of parametric quadratic programming formulations, for which path-following algorithms appear early in the optimization literature (Ritter, 1962).
In this paper, we study the number of linear segments of the Lasso regularization path. Even though experience with data suggests that this number is linear in the problem size (Rosset & Zhu, 2007), it is known that discrepancies can be observed between worst-case and empirical complexities. This is notably the case for the simplex algorithm (Dantzig, 1951)
, which performs empirically well for solving linear programs even though it suffers from exponential worst-case complexity(Klee & Minty, 1972). Similarly, by using geometrical tools originally developed to analyze the simplex algorithm, Gärtner et al. (2010) have shown that the complexity of the SVM regularization path can be exponential. However, to the best of our knowledge, none of these results do apply to the Lasso regularization path, whose theoretical complexity remains unknown. The goal of our paper is to fill in this gap.
Our first contribution is to show that in the worst-case the number of linear segments of the Lasso regularization path is exactly , where is the number of variables (predictors). We remark that our proof is constructive and significantly different than the ones proposed by Klee & Minty (1972) for the simplex algorithm and by Gärtner et al. (2010) for SVMs. Our approach does not rely on geometry but on an adversarial scheme. Given a Lasso problem with variables, we show how to build a new problem with variables increasing the complexity of the path by a multiplicative factor. It results in explicit pathological examples that are surprisingly simple, unlike pathological examples for the simplex algorithm or SVMs.
Worst-case complexity analyses are by nature pessimistic. Our second contribution on approximate regularization paths is more optimistic. In fact, we show that an approximate path for the Lasso with at most segments can always be obtained, where every point on the path is guaranteed to be optimal up to a relative -duality gap. We follow in part the methodology of Giesen et al. (2010) and Jaggi (2011), who have presented weaker results but in a more general setting for parameterized convex optimization problems. Our analysis builds upon approximate optimality conditions, which we maintain along the path, leading to a practical approximate homotopy algorithm.
2 Background on the Lasso
In this section, we present the Lasso formulation of Tibshirani (1996) and well known facts, which we exploit later in our analysis. For self-containedness and clarity reasons we include simple proofs of these results. Let be a vector in and be a matrix in . The Lasso is formulated as:
where the -norm induces sparsity in the solution and controls the amount of regularization. Under a few assumptions, which are detailed in the sequel, the solution of this problem is unique. We denote it by and define the regularization path as the set of all solutions for all positive values of :111For technicality reasons, we enforce even though the limit may exist.
The following lemma presents classical optimality and uniqueness conditions for the Lasso solution (see Fuchs, 2005), which are useful to characterize :
Lemma 1 (Optimality Conditions of the Lasso).
A vector in is a solution of Eq. (1) if and only if for all in ,
Define . Assuming the matrix to be full rank, the solution is unique and we have
where is in , and the notation for a vector denotes the vector of size recording the entries of indexed by .
Eq. (2) can be obtained by considering subgradient optimality conditions. These can be written as , where denotes the subdifferential of the -norm at . A classical result (Borwein & Lewis, 2006) says that the subgradients are the vectors in such that for all in , if , and otherwise. This gives Eq. (2). The equalities in Eq. (2) define a linear system that has a unique solution given by (3) when is full rank.
Let us now show the uniqueness of the Lasso solution. Consider another solution and choose a scalar in . By convexity, is also a solution. For all , we have . Combining this inequality with the conditions (2), we necessarily have ,222 denotes the complement of the set in . and the vector is also a solution of the following reduced problem:
When is full rank, the Hessian is positive definite and this reduced problem is strictly convex. Thus, it admits a unique solution . It is then easy to conclude that . ∎
With the assumption that the matrix is always full-rank, we can formally recall a well-known property of the Lasso (see Markowitz, 1952; Osborne et al., 2000; Efron et al., 2004) in the following lemma:
Lemma 2 (Piecewise Linearity of the Path).
The existence/uniqueness of the regularization path was shown in Lemma 1.
Let us define the set of sparsity patterns. Let us now consider such that . For all , it is easy to see that the solution satisfies the optimality conditions of Lemma 1 for , and that .
This shows that whenever two solutions and have the same signs for , the regularization path between and is a linear segment. As an important consequence, the number of linear segments of the path is smaller than , the number of possible sparsity patterns in . The path is therefore piecewise linear with a finite number of kinks.
Moreover, since the function is piecewise linear, it is piecewise continuous and has right and left limits for every . It is easy to show that these limits satisfy the optimality conditions of Eq. (2). By uniqueness of the Lasso solution, they are equal to and the function is in fact continuous. ∎
It can be shown that this algorithm maintains the optimality conditions of Lemma 1 when decreases. Two assumptions have nevertheless to be made for the algorithm to be correct. First, has to be invertible, which is a reasonable assumption commonly made when working with real data and when one is interested in sparse solutions. When becomes ill-conditioned, which may typically occur for small values of , the algorithm has to stop and the path is truncated. Second, one assumes in Step 7 of the algorithm that the value corresponds to a single event for in or hits zero for in . In other words, variables enter or exit the path one at a time. Even though this assumption is reasonable most of the time, it can be problematic from a numerical point of view in rare cases. When the length of a linear segment of is smaller than the numerical precision, the algorithm can fail. In contrast, our approximate homotopy algorithm presented in Section 4 is robust to this issue. In the next section, we present our worst-case complexity analysis of the regularization path, showing that Algorithm 1 can have exponential complexity.
3 Worst-Case Complexity
We denote by the set of sparsity patterns in encountered along the path . We have seen in the proof of Lemma 2 that whenever , for , then for all , and thus the number of linear segments of is upper-bounded by . With an additional argument, we can further reduce this number, as stated in the following proposition:
Proposition 1 (Upper-bound Complexity).
Let assume the same conditions as in Lemma 2. The number of linear segments in the regularization path of the Lasso is less than .
We have already noticed that the number of linear segments of the path is at most . Let us consider for . We now show that for all , we have , and therefore the number of different sparsity patterns on the path is in fact less than or equal to .
Let us assume that there exists with , and look for a contradiction. We define the set , and consider the solution of the reduced problem for all :
which is well defined since the optimization problem is strictly convex (the conditions of Lemma 2 imply that is full rank). We remark that , and . Given the optimality conditions of Lemma 1, it is then easy to show that . Since the signs of and are opposite to each other and non-zero, we have . Independently, it is also easy to show that the function should be non-increasing, and we obtain a contradiction. ∎
In the next proposition, we present our adversarial strategy to build a pathological regularization path. Given a Lasso problem with variables and a path , we design an additional variable along with an extra dimension, such that the number of kinks of the new path increases by a multiplicative factor compared to . We call our strategy “adversarial” since it consists of iteratively designing “pathological” variables.
Proposition 2 (Adversarial Strategy).
Let us consider in and in such that the conditions of Lemma 2 are satisfied and is in the span of . We denote by the regularization path of the Lasso problem corresponding to , by the number of linear segments of , and by the smallest value of the parameter corresponding to a kink of . We define the vector in and the matrix in as follows:
where and .
Then, the regularization path of the Lasso problem associated to exists and has linear segments. Moreover, let us consider the sequence of sparsity patterns in of (the signs of the solutions ), ordered from large to small values of . The sequence of sparsity patterns in of the new path is the following:
Let us first make some remarks about this proposition:
According to Eq. (4) the sparsity patterns of the new path are related to those of . More precisely, they have either the form or , where is a sparsity pattern in of .
The last column of involves a factor that controls its norm. With small enough, the -th variable enters late the path . As shown in Eq. (4), the first sparsity patterns of do not involve this variable and are exactly the same as those of .
Let us give some intuition about the pathological behavior of the path . The first kinks of are the same as those of , and after these first kinks we have . Then, the
-th variable enters the path and we heuristically have
The left side of Eq. (5) tells us that when the -th variable is inactive, the coefficients associated to the first variables should be close to . At the same time, the right side of Eq. (5) tells us that when the -th variable is active, these same coefficients should be instead close to . According to Eq. (4), the signs of these coefficients along the path switch from to by following the sequence , resulting in a path with linear segments. The proof below more rigorously describes this strategy:
Existence of the new regularization path:
Let us rewrite the Lasso problem for .
with the change of variable and assuming . The solution of this problem is unique since it is a point of and we therefore have
Since the last column of is not in the span of the first columns by construction of , it is then easy to see that the conditions of Lemma 2 are necessarily satisfied and therefore is in fact the unique solution of Eq. (6). Since this is true for all , the regularization path is well defined, and we denote from now on the above solutions by and .
Maximum number of linear segments:
We now show that the number of linear segments of the path is upper-bounded by . Eq. (7) shows that has the form , where in is one of the sparsity patterns from , whereas we have three possibilities for , namely . Since one can not have two non-zero sparsity patterns that are opposite to each other on the same path, as shown in the proof of Proposition 1, the number of possible sparsity patterns reduces to .
Characterization of the first linear segments:
Let us consider and show that and by checking the optimality conditions of Lemma 1. The first equalities/inequalities in Eq. (2) are easy to verify, the last one being also satisfied:
where the last inequality is obtained from the definition of . Since this inequality is strict, this also ensures that there exists such that and for all . We have therefore shown that the first sparsity patterns of the regularization path are given in Eq. (4).
Characterization of the last segments:
We mainly use here the form of Eq. (7) and a few continuity arguments to characterize the rest of the path. First, we remark that for all in , there exists a value for such that . This is true because: (i) is continuous; (ii) ; (iii) . Point (i) was shown in Lemma 2, point (ii) in the previous paragraph, and point (iii) is necessary to have the term in Eq. (6) go to when goes to .
We now consider two values such that , and for all . On this open interval, we have that , and the continuous function ranges from to . Combining this observation with Eq. (7), we obtain that all sparsity patterns of the form for in appear on the regularization path. With similar continuity arguments, it is easy to show that all sparsity patterns of the form for in appear on the path as well.
We had previously identified of the sparsity patterns, and now have identified different ones. Since we have at most linear segments, the set of sparsity patterns on the path is entirely characterized. The fact that the sequence of sparsity patterns is the one given in Eq. (4) can easily be shown by reusing similar continuity arguments. ∎
With this proposition in hand, we can now state the main result of this section:
Theorem 1 (Worst-case Complexity).
In the worst case, the regularization path of the Lasso has exactly linear segments.
We start with , and define , and , leading to a path with segments. We then recursively apply Proposition 2, keeping , choosing at iteration , , and a factor satisfying the conditions of Proposition 2. Denoting by the number of linear segments at iteration , we have that , and it is easy to show that . According to Proposition 1, this is the longest possible regularization path. Note that this example has a particularly simple shape:
3.1 Numerical Simulations
We have implemented Algorithm 1 in Matlab, optimizing numerical precision regardless of computational efficiency, which has allowed us to check our theoretical results for small values of . For instance, we obtain a path with linear segments for , and present such a pathological path in Figure 1. Note that when gets larger, these examples quickly lead to precision issues where some kinks are very close to each other. Our implementation and our pathological examples will be made publicly available. In the next section, we present more optimistic results on approximate regularization paths.
4 Approximate Homotopy
We now present another complexity analysis when exact solutions of Eq. (1) are not required. We follow in part the methodology of Giesen et al. (2010), later refined by Jaggi (2011), on approximate regularization paths of parameterized convex functions. Their results are quite general but, as we show later, we obtain stronger results with an analysis tailored to the Lasso.
A natural tool to guarantee the quality of approximate solutions is the duality gap. Writing the Lagrangian of problem (1) and minimizing with respect to the primal variable yields the following dual formulation of (1):
where in is a dual variable. Let us denote by the objective function of the primal problem (1) and by the objective function of the dual (8). Given a pair of feasible primal and dual variables , the difference is called a duality gap and provides an optimality guarantee (see Borwein & Lewis, 2006):
In plain words, it upper bounds the difference between the current value of the objective function and the optimal value of the objective function . In this paper, we use a relative duality gap criterion to guarantee the quality of an approximate solution:333Note that our criterion is not exactly the same as in Jaggi (2011). Whereas Jaggi (2011) consider a formulation where the -norm appears in a constraint, Eq. (1) involves an -penalty. Even though these formulations have the same regularization path, they involve slightly different objective functions, dual formulations, and duality gaps.
Definition 1 (-approximate Solution).
let be in . A vector in is said to be an -approximate solution of problem (1) if there exists in such that and .
Given a set , we say that is an -approximate regularization path if any point of is an -approximate solution for problem (1).
Our goal is now to build -approximate regularization paths and study their complexity. To that effect, we introduce approximate optimality conditions based on small perturbations of those given in Lemma 1:
Definition 2 ( Condition).
Let and . A vector in satisfies the condition if and only if for all ,
Note that when , this condition reduces to the exact optimality conditions of Lemma 1. Of interest for us is the relation between Definitions 1 and 2. Let us consider a vector such that is satisfied. Then, the vector is feasible for the dual (8) and we can compute a duality gap:
From Eq. (9), it is easy to show that , and we can obtain the following bound:
From this upper bound, we derive our first result:
Proposition 3 (Approximate Analysis).
Let be in and in such that the conditions of Lemma 2 are satisfied. Let be the value of corresponding to the start of the path, and be the one corresponding to the last kink. For all , there exists an -approximate regularization path with at most linear segments.
From Eq. (9), one can show by a simple calculation that an exact solution for a given satisfies . According to Eq. (10), there exists a dual variable such that . Thus, for any chosen in , the solution is an -approximate solution for the parameter . Between and , we can obtain an -approximate piecewise linear (in fact piecewise constant) regularization path by sampling solutions for in with . The number of segments of the corresponding approximate path is at most . ∎
Note that the term is possibly large, but it
is controlled by a logarithmic function and can be considered as constant
for finite precision machines. In other words,
the complexity of the approximate path is upper-bounded
In contrast, the analysis of Giesen et al. (2010) and Jaggi (2011) give us:
an approximate path with linear segments can be obtained with a weaker approximation guarantee than ours. Namely, a bound along the path, where is a duality gap, whereas we use relative duality gaps of the form ;444When there exists such that , the relative duality gap guarantee is similar (up to a constant) to the simple bound . However, we have for the Lasso that when goes to , as long as is in the span of . Note that as noticed in footnote 3, Jaggi (2011) uses a slightly different duality gap than ours. Interestingly, this bound is proven to be optimal in the context of parameterized convex functions on the -ball. Our result show that such bound can be improved for the Lasso.
a methodology to obtain relative duality gaps along the path, which can easily provide complexity bounds for the full path of different problems, notably support vector machines, but not for the Lasso.
Proposition 3 is optimistic, but not practical since it requires sampling exact solutions of the path . We introduce an approximate homotopy method in Algorithm 2 which does not require computing exact solutions and still enjoys a similar complexity. It exploits the piecewise linearity of the path, but uses a first-order method (Beck & Teboulle, 2009; Fu, 1998) when the linear segments of the path are too short.
Note that when , Algorithm 2 reduces to
Our approach exploits the following ideas, which we formally prove in the sequel. Assume that satisfies . Then,
is an -approximation for all in . This guarantees us that one can always make step sizes for greater than or equal to ;
the direction followed in Step 8 maintains , but when two kinks are too close to each other—that is, , we directly look for a solution for the parameter that satisfies . Any first-order method can be used for that purpose, e.g., a proximal gradient method (Beck & Teboulle, 2009), using the current value as a warm start.
Note also that when is not invertible, the method uses first-order steps. The next proposition precisely describes the guarantees of our algorithm.
Proposition 4 (Analysis of Algorithm 2).
Let be in and in . For all and , Algorithm 2 returns an -approximate regularization path on . Moreover, it terminates in at most iterations, where .
We first show that any solution on the path is an -approximate solution. First, it is easy to check that is always satisfied at Step 6. This is either a consequence of Step 13, or because the direction maintains when varies between and . From Eq. (10), we obtain that is an -approximate solution whenever is satisfied. Thus, we only need to check that is also an -approximate solution for