Complexity Analysis of the Lasso Regularization Path

05/01/2012 ∙ by Julien Mairal, et al. ∙ 0

The regularization path of the Lasso can be shown to be piecewise linear, making it possible to "follow" and explicitly compute the entire path. We analyze in this paper this popular strategy, and prove that its worst case complexity is exponential in the number of variables. We then oppose this pessimistic result to an (optimistic) approximate analysis: We show that an approximate path with at most O(1/sqrt(epsilon)) linear segments can always be obtained, where every point on the path is guaranteed to be optimal up to a relative epsilon-duality gap. We complete our theoretical analysis with a practical algorithm to compute these approximate paths.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Without a priori knowledge about data, it is often difficult to estimate a model or make predictions, either because the number of observations is too small, or the problem dimension too high. When a problem solution is known to be sparse, sparsity-inducing penalties have proven to be useful to improve both the quality of the prediction and its intepretability. In particular, the 

-norm has been used for that purpose in the Lasso formulation (Tibshirani, 1996).

Controlling the regularization often requires to tune a parameter. In a few cases, the regularization path—that is, the set of solutions for all values of the regularization parameter, can be shown to be piecewise linear (Rosset & Zhu, 2007). This property is exploited in homotopy methods, which consist of following the piecewise linear path by computing the direction of the current linear segment and the points where the direction changes (also known as kinks). Piecewise linearity of regularization paths was discovered by Markowitz (1952) for portfolio selection; it was similarly exploited by Osborne et al. (2000) and Efron et al. (2004) for the Lasso, and by Hastie et al. (2004)

for the support vector machine (SVM). As observed by 

Gärtner et al. (2010), all of these examples are in fact particular instances of parametric quadratic programming formulations, for which path-following algorithms appear early in the optimization literature (Ritter, 1962).

In this paper, we study the number of linear segments of the Lasso regularization path. Even though experience with data suggests that this number is linear in the problem size (Rosset & Zhu, 2007), it is known that discrepancies can be observed between worst-case and empirical complexities. This is notably the case for the simplex algorithm (Dantzig, 1951)

, which performs empirically well for solving linear programs even though it suffers from exponential worst-case complexity 

(Klee & Minty, 1972). Similarly, by using geometrical tools originally developed to analyze the simplex algorithm, Gärtner et al. (2010) have shown that the complexity of the SVM regularization path can be exponential. However, to the best of our knowledge, none of these results do apply to the Lasso regularization path, whose theoretical complexity remains unknown. The goal of our paper is to fill in this gap.

Our first contribution is to show that in the worst-case the number of linear segments of the Lasso regularization path is exactly , where  is the number of variables (predictors). We remark that our proof is constructive and significantly different than the ones proposed by Klee & Minty (1972) for the simplex algorithm and by Gärtner et al. (2010) for SVMs. Our approach does not rely on geometry but on an adversarial scheme. Given a Lasso problem with  variables, we show how to build a new problem with  variables increasing the complexity of the path by a multiplicative factor. It results in explicit pathological examples that are surprisingly simple, unlike pathological examples for the simplex algorithm or SVMs.

Worst-case complexity analyses are by nature pessimistic. Our second contribution on approximate regularization paths is more optimistic. In fact, we show that an approximate path for the Lasso with at most segments can always be obtained, where every point on the path is guaranteed to be optimal up to a relative -duality gap. We follow in part the methodology of Giesen et al. (2010) and Jaggi (2011), who have presented weaker results but in a more general setting for parameterized convex optimization problems. Our analysis builds upon approximate optimality conditions, which we maintain along the path, leading to a practical approximate homotopy algorithm.

The paper is organized as follows: Section 2 presents some brief overview of the Lasso. Section 3 is devoted to our worst-case complexity analysis, and Section 4 to our results on approximate regularization paths.

2 Background on the Lasso

In this section, we present the Lasso formulation of Tibshirani (1996) and well known facts, which we exploit later in our analysis. For self-containedness and clarity reasons we include simple proofs of these results. Let  be a vector in  and  be a matrix in . The Lasso is formulated as:

(1)

where the -norm induces sparsity in the solution  and  controls the amount of regularization. Under a few assumptions, which are detailed in the sequel, the solution of this problem is unique. We denote it by  and define the regularization path as the set of all solutions for all positive values of :111For technicality reasons, we enforce  even though the limit may exist.

The following lemma presents classical optimality and uniqueness conditions for the Lasso solution (see Fuchs, 2005), which are useful to characterize :

Lemma 1 (Optimality Conditions of the Lasso).

A vector  in  is a solution of Eq. (1) if and only if for all  in ,

(2)

Define . Assuming the matrix  to be full rank, the solution is unique and we have

(3)

where is in , and the notation  for a vector  denotes the vector of size  recording the entries of  indexed by .

Proof.

Eq. (2) can be obtained by considering subgradient optimality conditions. These can be written as , where denotes the subdifferential of the -norm at . A classical result (Borwein & Lewis, 2006) says that the subgradients are the vectors in  such that for all  in , if , and otherwise. This gives Eq. (2). The equalities in Eq. (2) define a linear system that has a unique solution given by (3) when  is full rank.

Let us now show the uniqueness of the Lasso solution. Consider another solution and choose a scalar  in . By convexity, is also a solution. For all , we have . Combining this inequality with the conditions (2), we necessarily have ,222 denotes the complement of the set  in . and the vector  is also a solution of the following reduced problem:

When  is full rank, the Hessian  is positive definite and this reduced problem is strictly convex. Thus, it admits a unique solution . It is then easy to conclude that . ∎

With the assumption that the matrix  is always full-rank, we can formally recall a well-known property of the Lasso (see Markowitz, 1952; Osborne et al., 2000; Efron et al., 2004) in the following lemma:

Lemma 2 (Piecewise Linearity of the Path).

Assume that for any  and solution of Eq. (1) the matrix  defined in Lemma 1 is full-rank. Then, the regularization path  is well defined, unique and continuous piecewise linear.

Proof.

The existence/uniqueness of the regularization path was shown in Lemma 1.

Let us define  the set of sparsity patterns. Let us now consider such that . For all , it is easy to see that the solution  satisfies the optimality conditions of Lemma 1 for , and that .

This shows that whenever two solutions  and  have the same signs for , the regularization path between  and  is a linear segment. As an important consequence, the number of linear segments of the path is smaller than , the number of possible sparsity patterns in . The path  is therefore piecewise linear with a finite number of kinks.

Moreover, since the function is piecewise linear, it is piecewise continuous and has right and left limits for every . It is easy to show that these limits satisfy the optimality conditions of Eq. (2). By uniqueness of the Lasso solution, they are equal to  and the function is in fact continuous. ∎

Assuming again that  is always full rank, we can now present in Algorithm 1 the homotopy method (Osborne et al., 2000; Efron et al., 2004).

1:  Inputs: a vector  in ; a matrix  in ;
2:  initialization: set to ; we have (trivial solution);
3:  set such that ;
4:  while  do
5:     Set ;
6:      compute the direction of the path:
7:     Find the smallest step  such that:  there exists such that ; add to ;  there exists in such that  and ; remove from ;
8:     replace  by ; record the pair ;
9:  end while
10:  Return: sequence of recorded values .
Algorithm 1 Homotopy Algorithm for the Lasso.

It can be shown that this algorithm maintains the optimality conditions of Lemma 1 when  decreases. Two assumptions have nevertheless to be made for the algorithm to be correct. First, has to be invertible, which is a reasonable assumption commonly made when working with real data and when one is interested in sparse solutions. When becomes ill-conditioned, which may typically occur for small values of , the algorithm has to stop and the path is truncated. Second, one assumes in Step 7 of the algorithm that the value  corresponds to a single event for  in  or hits zero for  in . In other words, variables enter or exit the path one at a time. Even though this assumption is reasonable most of the time, it can be problematic from a numerical point of view in rare cases. When the length of a linear segment of  is smaller than the numerical precision, the algorithm can fail. In contrast, our approximate homotopy algorithm presented in Section 4 is robust to this issue. In the next section, we present our worst-case complexity analysis of the regularization path, showing that Algorithm 1 can have exponential complexity.

3 Worst-Case Complexity

We denote by  the set of sparsity patterns in  encountered along the path . We have seen in the proof of Lemma 2 that whenever , for , then for all , and thus the number of linear segments of  is upper-bounded by . With an additional argument, we can further reduce this number, as stated in the following proposition:

Proposition 1 (Upper-bound Complexity).

Let assume the same conditions as in Lemma 2. The number of linear segments in the regularization path of the Lasso is less than .

Proof.

We have already noticed that the number of linear segments of the path is at most . Let us consider for . We now show that for all , we have , and therefore the number of different sparsity patterns on the path  is in fact less than or equal to .

Let us assume that there exists  with , and look for a contradiction. We define the set , and consider the solution of the reduced problem for all :

which is well defined since the optimization problem is strictly convex (the conditions of Lemma 2 imply that  is full rank). We remark that , and . Given the optimality conditions of Lemma 1, it is then easy to show that . Since the signs of and  are opposite to each other and non-zero, we have . Independently, it is also easy to show that the function  should be non-increasing, and we obtain a contradiction. ∎

In the next proposition, we present our adversarial strategy to build a pathological regularization path. Given a Lasso problem with  variables and a path , we design an additional variable along with an extra dimension, such that the number of kinks of the new path  increases by a multiplicative factor compared to . We call our strategy “adversarial” since it consists of iteratively designing “pathological” variables.

Proposition 2 (Adversarial Strategy).

Let us consider in  and  in  such that the conditions of Lemma 2 are satisfied and  is in the span of . We denote by the regularization path of the Lasso problem corresponding to , by  the number of linear segments of , and by  the smallest value of the parameter  corresponding to a kink of . We define the vector  in  and the matrix  in  as follows:

where and .

Then, the regularization path  of the Lasso problem associated to exists and has linear segments. Moreover, let us consider  the sequence of sparsity patterns in  of  (the signs of the solutions ), ordered from large to small values of . The sequence of sparsity patterns in  of the new path  is the following:

(4)

Let us first make some remarks about this proposition:
       According to Eq. (4) the sparsity patterns of the new path  are related to those of . More precisely, they have either the form or , where  is a sparsity pattern in  of .
       The last column of  involves a factor  that controls its norm. With  small enough, the -th variable enters late the path . As shown in Eq. (4), the first  sparsity patterns of do not involve this variable and are exactly the same as those of .
       Let us give some intuition about the pathological behavior of the path . The first  kinks of  are the same as those of , and after these first  kinks we have . Then, the

-th variable enters the path and we heuristically have

(5)

The left side of Eq. (5) tells us that when the -th variable is inactive, the coefficients associated to the first  variables should be close to . At the same time, the right side of Eq. (5) tells us that when the -th variable is active, these same  coefficients should be instead close to . According to Eq. (4), the signs of these coefficients along the path switch from  to  by following the sequence , resulting in a path with  linear segments. The proof below more rigorously describes this strategy:

Proof.

Existence of the new regularization path:
Let us rewrite the Lasso problem for .

(6)

Let  be a solution for a given . By fixing  in Eq. (6) and optimizing with respect to , we obtain an equivalent problem to (6):

with the change of variable  and assuming . The solution of this problem is unique since it is a point of  and we therefore have

(7)

Since the last column of  is not in the span of the first  columns by construction of , it is then easy to see that the conditions of Lemma 2 are necessarily satisfied and therefore is in fact the unique solution of Eq. (6). Since this is true for all , the regularization path is well defined, and we denote from now on the above solutions by  and .

Maximum number of linear segments:
We now show that the number of linear segments of the path is upper-bounded by . Eq. (7) shows that has the form , where  in  is one of the  sparsity patterns from , whereas we have three possibilities for , namely . Since one can not have two non-zero sparsity patterns that are opposite to each other on the same path, as shown in the proof of Proposition 1, the number of possible sparsity patterns reduces to .

Characterization of the first linear segments:
Let us consider  and show that and  by checking the optimality conditions of Lemma 1. The first equalities/inequalities in Eq. (2) are easy to verify, the last one being also satisfied:

where the last inequality is obtained from the definition of . Since this inequality is strict, this also ensures that there exists  such that and  for all . We have therefore shown that the first  sparsity patterns of the regularization path are given in Eq. (4).

Characterization of the last  segments:
We mainly use here the form of Eq. (7) and a few continuity arguments to characterize the rest of the path. First, we remark that for all in , there exists a value for  such that . This is true because: (i) is continuous; (ii) ; (iii) . Point (i) was shown in Lemma 2, point (ii) in the previous paragraph, and point (iii) is necessary to have the term in Eq. (6) go to  when  goes to .

We now consider two values  such that , and for all . On this open interval, we have that , and the continuous function ranges from to . Combining this observation with Eq. (7), we obtain that all sparsity patterns of the form  for  in  appear on the regularization path. With similar continuity arguments, it is easy to show that all sparsity patterns of the form  for  in  appear on the path as well.

We had previously identified  of the sparsity patterns, and now have identified different ones. Since we have at most  linear segments, the set of sparsity patterns on the path  is entirely characterized. The fact that the sequence of sparsity patterns is the one given in Eq. (4) can easily be shown by reusing similar continuity arguments. ∎

With this proposition in hand, we can now state the main result of this section:

Theorem 1 (Worst-case Complexity).

In the worst case, the regularization path of the Lasso has exactly linear segments.

Proof.

We start with , and define , and , leading to a path with  segments. We then recursively apply Proposition 2, keeping , choosing at iteration , , and a factor  satisfying the conditions of Proposition 2. Denoting by  the number of linear segments at iteration , we have that , and it is easy to show that . According to Proposition 1, this is the longest possible regularization path. Note that this example has a particularly simple shape:

3.1 Numerical Simulations

We have implemented Algorithm 1 in Matlab, optimizing numerical precision regardless of computational efficiency, which has allowed us to check our theoretical results for small values of . For instance, we obtain a path with linear segments for , and present such a pathological path in Figure 1. Note that when  gets larger, these examples quickly lead to precision issues where some kinks are very close to each other. Our implementation and our pathological examples will be made publicly available. In the next section, we present more optimistic results on approximate regularization paths.

Figure 1: Pathological regularization path with  variables and  kinks. The curves represent the values of the coefficients at every kink of the path. For visibility purposes, we use a non-linear scale and report the values for a coefficient . Best seen in color.

4 Approximate Homotopy

We now present another complexity analysis when exact solutions of Eq. (1) are not required. We follow in part the methodology of Giesen et al. (2010), later refined by Jaggi (2011), on approximate regularization paths of parameterized convex functions. Their results are quite general but, as we show later, we obtain stronger results with an analysis tailored to the Lasso.

A natural tool to guarantee the quality of approximate solutions is the duality gap. Writing the Lagrangian of problem (1) and minimizing with respect to the primal variable  yields the following dual formulation of (1):

(8)

where  in  is a dual variable. Let us denote by  the objective function of the primal problem (1) and by  the objective function of the dual (8). Given a pair of feasible primal and dual variables , the difference  is called a duality gap and provides an optimality guarantee (see Borwein & Lewis, 2006):

In plain words, it upper bounds the difference between the current value of the objective function  and the optimal value of the objective function . In this paper, we use a relative duality gap criterion to guarantee the quality of an approximate solution:333Note that our criterion is not exactly the same as in Jaggi (2011). Whereas Jaggi (2011) consider a formulation where the -norm appears in a constraint, Eq. (1) involves an -penalty. Even though these formulations have the same regularization path, they involve slightly different objective functions, dual formulations, and duality gaps.

Definition 1 (-approximate Solution).

let  be in . A vector  in  is said to be an -approximate solution of problem (1) if there exists  in  such that and .

Given a set , we say that  is an -approximate regularization path if any point  of  is an -approximate solution for problem (1).

Our goal is now to build -approximate regularization paths and study their complexity. To that effect, we introduce approximate optimality conditions based on small perturbations of those given in Lemma 1:

Definition 2 ( Condition).

Let  and . A vector  in  satisfies the condition if and only if for all ,

(9)

Note that when , this condition reduces to the exact optimality conditions of Lemma 1. Of interest for us is the relation between Definitions 1 and 2. Let us consider a vector  such that is satisfied. Then, the vector  is feasible for the dual (8) and we can compute a duality gap:

From Eq. (9), it is easy to show that , and we can obtain the following bound:

(10)

From this upper bound, we derive our first result:

Proposition 3 (Approximate Analysis).

Let  be in  and  in  such that the conditions of Lemma 2 are satisfied. Let  be the value of  corresponding to the start of the path, and  be the one corresponding to the last kink. For all , there exists an -approximate regularization path with at most linear segments.

Proof.

From Eq. (9), one can show by a simple calculation that an exact solution for a given  satisfies . According to Eq. (10), there exists a dual variable  such that . Thus, for any  chosen in , the solution  is an -approximate solution for the parameter . Between  and , we can obtain an -approximate piecewise linear (in fact piecewise constant) regularization path by sampling solutions  for  in with . The number of segments of the corresponding approximate path is at most . ∎

Note that the term  is possibly large, but it is controlled by a logarithmic function and can be considered as constant for finite precision machines. In other words, the complexity of the approximate path is upper-bounded by . In contrast, the analysis of Giesen et al. (2010) and Jaggi (2011) give us:
       an approximate path with linear segments can be obtained with a weaker approximation guarantee than ours. Namely, a bound  along the path, where  is a duality gap, whereas we use relative duality gaps of the form ;444When there exists  such that , the relative duality gap guarantee is similar (up to a constant) to the simple bound . However, we have for the Lasso that  when  goes to , as long as  is in the span of . Note that as noticed in footnote 3Jaggi (2011) uses a slightly different duality gap than ours. Interestingly, this bound is proven to be optimal in the context of parameterized convex functions on the -ball. Our result show that such bound can be improved for the Lasso.
       a methodology to obtain relative duality gaps along the path, which can easily provide complexity bounds for the full path of different problems, notably support vector machines, but not for the Lasso.

Proposition 3 is optimistic, but not practical since it requires sampling exact solutions of the path . We introduce an approximate homotopy method in Algorithm 2 which does not require computing exact solutions and still enjoys a similar complexity. It exploits the piecewise linearity of the path, but uses a first-order method (Beck & Teboulle, 2009; Fu, 1998) when the linear segments of the path are too short.

1:  Inputs: a vector  in , a matrix  in , the required precision ; ;
2:  initialization: set to ; set ;
3:  set ;
4:  set such that ;
5:  while  do
6:     if is not invertible then go to 12;
7:     set ;
8:      compute the approximate direction of the path:
Find the smallest step  such that:  there exists in such that ; add to ;  there exists in such that  and ; remove from ;
9:     if  then
10:        replace  by ;
11:     else
12:        replace  by ;
13:        use a first-order optimization method to find a solution  satisfying ;
14:        set .
15:     end if
16:     record the pair ;
17:  end while
18:  Return: sequence of recorded values .
Algorithm 2 Approximate Homotopy for the Lasso.

Note that when , Algorithm 2 reduces to Algorithm 1. Our approach exploits the following ideas, which we formally prove in the sequel. Assume that  satisfies . Then,
       is an -approximation for all  in . This guarantees us that one can always make step sizes for  greater than or equal to ;
       the direction followed in Step 8 maintains , but when two kinks are too close to each other—that is, , we directly look for a solution for the parameter  that satisfies . Any first-order method can be used for that purpose, e.g., a proximal gradient method (Beck & Teboulle, 2009), using the current value  as a warm start.
Note also that when is not invertible, the method uses first-order steps. The next proposition precisely describes the guarantees of our algorithm.

Proposition 4 (Analysis of Algorithm 2).

Let  be in  and  in . For all  and , Algorithm 2 returns an -approximate regularization path on . Moreover, it terminates in at most iterations, where .

Proof.

We first show that any solution on the path is an -approximate solution. First, it is easy to check that  is always satisfied at Step 6. This is either a consequence of Step 13, or because the direction maintains when varies between  and . From Eq. (10), we obtain that  is an -approximate solution whenever is satisfied. Thus, we only need to check that is also an -approximate solution for  in