in signal processing, sparsity inducing penalties have had a tremendous impact on machine learning(Bach et al., 2012)2007), Group Lasso (Yuan and Lin, 2006), Sparse Group Lasso (Simon et al., 2013; Ndiaye et al., 2016), multitask Lasso (Obozinski et al., 2010), Square-Root Lasso (Belloni et al., 2011). All of these estimators fall under the framework of Generalized Linear Models (McCullagh and Nelder, 1989), where the output is assumed to follow an exponential family distribution whose mean is a linear combination of the input variables. The key property of
regularization is that it allows to jointly perform feature selection and prediction, which is particularly useful in high dimensional settings. Indeed, it can drastically reduce the number of variables needed for prediction, thus improving model interpretability and computation time for prediction. Amongst the algorithms proposed to solve these, coordinate descent111throughout the paper, this means cyclic and proximal coordinate descent unless specified otherwise (Tseng, 2001; Friedman et al., 2007) is the most popular in machine learning scenarios (Fan et al., 2008; Friedman et al., 2010; Richtárik and Takáč, 2014; Fercoq and Richtárik, 2015; Perekrestenko et al., 2017; Karimireddy et al., 2018). It consists in updating the vector of parameters one coefficient at a time, looping over all the predictors until convergence.
Since only a fraction of the coefficients are non-zero in the optimal parameter vector, a recurring idea to speed up solvers is to limit the size of the optimization problem by ignoring features which are not included in the solution. To do so, two approaches can be distinguished:
One common idea between the current state-of-art methods for screening (Gap Safe rules Fercoq et al. 2015; Ndiaye et al. 2017) and working sets (, Johnson and Guestrin 2015, 2018) is to rely heavily on a dual point to identify useful features. The quality of such a dual point for the dual problem is critical here as it has a direct impact on performance. However, although a lot of attention has been devoted to creating a sequence of primal iterates that converge fast to the optimum (Fercoq and Richtárik, 2015), the construction of dual iterates has not been scrutinized, and the standard approach to obtain dual iterates from primal ones (Mairal, 2010), although converging, is crude.
In this paper, we propose a principled way to construct a sequence of dual points that converges faster than the standard approach proposed by Mairal (2010). Based on an extrapolation procedure inspired by Scieur et al. (2016), it comes with no significant extra computational costs, while retaining convergence guarantees of the standard approach. This construction was first introduced for non smooth optimization by Massias et al. (2018) for the Lasso case only, while we generalize it here to any Generalized Linear Model (GLM). More precisely, we properly define, quantify and prove the asymptotic Vector AutoRegressive (VAR) behavior of dual iterates for sparse GLMs solved with proximal gradient descent or cyclic coordinate descent. The resulting new construction:
provides a tighter control of optimality through duality gap evaluation,
improves the performance of Gap safe rules,
improves the working set policy proposed in Massias et al. (2017), thanks to better feature identification,
is easy to implement and combine with other solvers.
The article proceeds as follows. We introduce the framework of -regularized GLMs and duality in Section 2. As a seminal example, we present our results on dual iterates regularity and dual extrapolation for the Lasso in Section 3. We generalize it to a variety of problems in Sections 5 and 4. Results of Section 6 demonstrate a systematic improvement in computing time when dual extrapolation is used together with Gap Safe rules or working set policies.
For any integer , we denote by the set . The design matrix is composed of observations stored row-wise, and whose -th column is ; the vector (resp. ) is the response vector for regression (resp. binary classification). The support of is , of cardinality . For , and are and restricted to features in . As much as possible, exponents between parenthesis () denote iterates and subscripts () denote vector entries or matrix columns. The sign function is with the convention
. The sigmoid function is. The soft-thresholding of at level is . Applied to vectors, , and (for ) act element-wise. Element-wise product between vectors of same length is denoted by . The vector of size whose entries are all equal to 0 (resp. 1) is denoted by (resp. ). On square matrices, is the spectral norm (and the standard Euclidean norm for vectors read ), is the -norm. For a symmetric positive definite matrix , is the -weighted inner product, whose associated norm is denoted . We extend the small- notation to vector valued functions in the following way: for and , if and only if , tends to 0 when tends to 0. For a convex and proper function , its Fenchel-Legendre conjugate is defined by , and its subdifferential at is .
2 GLMs, Vector AutoRegressive sequences and sign identification
We consider the following optimization problem: [Sparse Generalized Linear Model] ∈_β∈^p ⏟∑_i = 1^n f_i( β^⊤x_i) + λβ_1_(β) , where all are convex222by that we mean close, convex and proper following the framework in Bauschke and Combettes (2011)., differentiable functions with -Lipschitz gradients. The parameter is a non-negative scalar, controlling the trade-off between data fidelity and regularization.
Two popular instances of Section 2 are the Lasso (, ) and Sparse Logistic regression (, ).
Note that we could use a more complex regularizer in Section 2, to handle group penalties for example. For the sake of clarity we rather remain specific, and generalize to other penalties when needed in Section 4.2.
[Duality for sparse GLMs] A dual formulation of Section 2 reads:
where . The dual solution is unique because the ’s are -strongly convex (Hiriart-Urruty and Lemaréchal, 1993, Thm 4.2.1 p. 82) and the KKT conditions read:
If for we write , the link equation reads .
For any , one has , and . The duality gap can thus be used as an upper bound for the sub-optimality of a primal vector : for any , any , and any feasible :
These results holds because Slater’s condition is met: Section 2 is unconstrained and the objective function has domain (Boyd and Vandenberghe, 2004, §5.2.3), therefore strong duality holds. The latter shows that even though is unknown in practice and the sub-optimality gap cannot be evaluated, creating a dual feasible point allows to construct an upper bound which can be used as a tractable stopping criterion.
In high dimension, solvers such as proximal gradient descent (PG) and coordinate descent (CD) are slowed down due to the large number of features. However, by design of the penalty, is expected to be sparse, especially for large values of . Thus, a key idea to speed up these solvers is to identify the support of so that features outside of it can be safely ignored. Doing this leads to a smaller problem that is faster to solve. Removing features when it is guaranteed that they are not in the support of the solution is at the heart of the so-called Gap Safe Screening rules (Fercoq et al., 2015; Ndiaye et al., 2017):
Therefore, while running an iterative solver and computing the duality gap at iteration , the criterion (5) can be tested for all features , and the features guaranteed to be inactive at optimum can be ignored.
Equations 5 and 4 do not require a specific choice of , provided it is in . It is up to the user and so far it has not attracted much attention in the literature. Thanks to the link equation , a natural way to construct a dual feasible point at iteration , when only a primal vector is available, is:
This was coined residuals rescaling (Mairal, 2010) following the terminology used for the Lasso case where is equal to the residuals, .
To improve the control of sub-optimality, and to better identify useful features, the aim of our proposed dual extrapolation is to obtain a better dual point (closer to the optimum ). The idea is to do it at a low computational cost by exploiting the structure of the sequence of dual iterates ; we explain what is meant by “structure”, and how to exploit it, in the following: [Vector AutoRegressive sequence] We say that is a Vector AutoRegressive (VAR) sequence (of order 1) if there exists and such that for :
We also say that the sequence , converging to , is an asymptotic VAR sequence if there exist and such that for :
[Extrapolation for VAR sequences (Scieur, 2018, Thm 3.2.2)] Let be a VAR sequence in , satisfying with a symmetric positive definite matrix such that , and . Assume that for , the family is linearly independent and define
where . The justification for this extrapolation procedure is the following: since , converges, say to . For , we have . Let be the coefficients of ’s characteristic polynomial. By Cayley-Hamilton’s theorem, . Given that ,
is not an eigenvalue ofand , so we can normalize these coefficients to have . For , we have:
Therefore, it is natural to seek to approximate as an affine combination of the last iterates . Using iterates might be costly for large , so one might rather consider only a smaller number , find such that approximates . Since is a fixed point of , should be one too. Under the normalizing condition , this means that
should be as close to as possible; this leads to solving:
which admits a closed-form solution:
where . In practice, the next proposition shows that when does not have full column rank, it is theoretically sound to use a lower value for the number of extrapolation coefficients : If is not invertible, then .
Let be such that , with (the proof is similar if , etc.). Then and, setting , . Since , it follows that
and subsequently for all . By going to the limit, . ∎
Finally, we will now state the results on sign identification, which implies support identification. For these results, which connect sparse GLMs to VAR sequences and extrapolation, we need to make the following assumption:
the loss functionis not strictly convex and several global minima may exist. However, following earlier results by Rosset et al. (2004), Tibshirani (2013) showed that when the entries of are sampled from a continuous distribution, Section 2 is satisfied almost surely. It is also worth noting that other works on support identification (Nutini et al., 2017; Sun et al., 2019; Poon et al., 2018) involve a non-degeneracy condition which boils down to Section 2 in our case. Motivated by practical applications on real-life datasets, we will therefore use Section 2.
In the following, we extend results by Hale et al. (2008) about sign identification from proximal gradient to coordinate descent. [Sign identification for proximal gradient and coordinate descent] Let Section 2 hold. Let be the sequence of iterates converging to and produced by proximal gradient descent or coordinate descent when solving Section 2 (reminded in lines 1 and 1 of Algorithm 1).
There exists such that:
. The smallest epochfor which this holds is when sign identification is achieved.
For lighter notation in this proof, we denote and . For , the subdifferential inclusion (3) reads:
Motivated by these conditions, the equicorrelation set introduced by Tibshirani (2013) is:
We introduce the saturation gap associated to Section 2:
As is unique, is well-defined, and strictly positive by definition of . By Equation 19, the support of any solution is included in the equicorrelation set and we have equality for almost every since we assumed that the solution is unique (Tibshirani, 2013, Lemma 13) – the non equality holding only at the kinks of the Lasso path, which we exclude from our analysis.
Because of Section 2, we only need to show that the coefficients outside the equicorrelation eventually vanish. The proof requires to study the primal iterates after each update (instead of after each epoch), hence we use the notation for the primal iterate after the -th update of coordinate descent. This update only modifies the -th coordinate, with :
Note that at optimality, for every , one has:
Let us consider an update of coordinate descent such that the updated coordinate verifies and , hence, . Then:
where we used the following inequality (Hale et al., 2008, Lemma 3.2):
Now notice that by definition of the saturation gap (21), and since :
This can only be true for a finite number of updates, since otherwise taking the limit would give , and (Eq. (21)). Therefore, after a finite number of updates, for . For , by Section 2, so has the same sign eventually since it converges to .
The proof for proximal gradient descent is a result of Hale et al. (2008, Theorem 4.5), who provide the bound . ∎
3 A seminal example: the Lasso case
Dual extrapolation was originally proposed for the Lasso in the algorithm (Massias et al., 2018). As the VAR model holds exactly in this case, we first devote special attention to it. We will make use of asymptotic VAR models and generalize to all sparse GLMs in Section 4.
Using the identification property of coordinate descent and proximal gradient descent, we can formalize the VAR behavior of dual iterates:
When is obtained by a cyclic coordinate descent or proximal gradient descent applied to the Lasso problem, is a VAR sequence after sign identification.
Let be the indices of the support of , in increasing order. As the sign is identified, coefficients outside the support are 0 and remain 0. We decompose the -th epoch of coordinate descent into individual coordinate updates:
Let denote the initialization (the beginning of the epoch, ), the iterate after coordinate has been updated, etc., up to after coordinate has been updated, at the end of the epoch ().
Let , then and are equal everywhere, except at coordinate :
where we have used sign identification: . Therefore
This leads to the following linear recurrent equation:
Hence, one gets recursively
We can thus write the following VAR equations for at the end of each epoch coordinate descent:
Let , and denote respectively , and restricted to features in the support . Notice that since we are in the identified sign regime, . With , a proximal gradient descent update reads:
Hence the equivalent of Equation 32 for proximal gradient descent is:
. Right: sequence of residuals after each update of coordinate descent (first iterates in blue, last in yellow). After four updates, the iterates alternate geometrically between the same two constraint hyperplanes.
Figure 1 represents the Lasso dual for a toy problem and illustrates the VAR nature of . As highlighted in Tibshirani (2017), the iterates correspond to the iterates of Dykstra’s algorithm to project onto . During the first updates, the dual iterates do not have a regular trajectory. However, after a certain number of updates (corresponding to sign identification), they alternate in a geometric fashion between two hyperplanes. In this regime, it becomes beneficial to use extrapolation to obtain a point closer to . Equation 31 shows why we combine extrapolation with cyclic coordinate descent: if the coefficients are not always updated in the same order (see Massias et al. 2018, Figure 1(c-d)), the matrix depends on the epoch, and the VAR structure may no longer hold. Having highlighted the VAR behavior of , we can introduce our proposed dual extrapolation: [Extrapolated dual point for the Lasso] For a fixed number of proximal gradient descent or coordinate descent epochs, let denote the residuals at epoch of the algorithm. We define the extrapolated residuals
where is defined as in (17) with .Then, we define the extrapolated dual point as:
In practice, we use and do not compute if cannot be inverted. Additionally, to impose monotonicity of the dual objective, and guarantee a behavior at least as good at , we use as dual point at iteration :
the analysis by Scieur et al. (2016) requires to be symmetrical, which is the case for proximal gradient descent but not for cyclic coordinate descent (as and only commute if and are collinear). To circumvent this issue, we can make symmetrical: instead of considering cyclic updates, we could consider that iterates are produced by a cyclic pass over the coordinates, followed by a cyclic pass over the coordinates in reverse order. The matrix of the VAR in this case is no longer , but (the ’s are symmetrical). A more recent result from Bollapragada et al. (2018) indicates that the bound still holds for a non-symmetric (coherent with the practical results from Section 6), at the price of a much more complex analysis.
for both proximal gradient and coordinate descent we have instead of as soon as : if the support of is of size smaller than (), 1 is an eigenvalue of . Indeed, for coordinate descent, if , there exists a vector , orthogonal to the vectors . The matrix being the orthogonal projection onto , we therefore have for every , hence . For proximal gradient descent, is not invertible when , hence 1 is an eigenvalue of