# Implicit differentiation of Lasso-type models for hyperparameter optimization

Setting regularization parameters for Lasso-type estimators is notoriously difficult, though crucial in practice. The most popular hyperparameter optimization approach is grid-search using held-out validation data. Grid-search however requires to choose a predefined grid for each parameter, which scales exponentially in the number of parameters. Another approach is to cast hyperparameter optimization as a bi-level optimization problem, one can solve by gradient descent. The key challenge for these methods is the estimation of the gradient with respect to the hyperparameters. Computing this gradient via forward or backward automatic differentiation is possible yet usually suffers from high memory consumption. Alternatively implicit differentiation typically involves solving a linear system which can be prohibitive and numerically unstable in high dimension. In addition, implicit differentiation usually assumes smooth loss functions, which is not the case for Lasso-type problems. This work introduces an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems. Our approach scales to high-dimensional data by leveraging the sparsity of the solutions. Experiments demonstrate that the proposed method outperforms a large number of standard methods to optimize the error on held-out data, or the Stein Unbiased Risk Estimator (SURE).

## Authors

• 7 publications
• 4 publications
• 23 publications
• 14 publications
• 57 publications
• 30 publications
• ### Implicit differentiation for fast hyperparameter selection in non-smooth convex learning

Finding the optimal hyperparameters of a model can be cast as a bilevel ...
05/04/2021 ∙ by Quentin Bertrand, et al. ∙ 0

• ### Nonsmooth Implicit Differentiation for Machine Learning and Optimization

In view of training increasingly complex learning architectures, we esta...
06/08/2021 ∙ by Jérôme Bolte, et al. ∙ 0

• ### Efficient and Modular Implicit Differentiation

Automatic differentiation (autodiff) has revolutionized machine learning...
05/31/2021 ∙ by Mathieu Blondel, et al. ∙ 0

• ### Online Hyperparameter Search Interleaved with Proximal Parameter Updates

There is a clear need for efficient algorithms to tune hyperparameters f...
04/06/2020 ∙ by Luis Miguel Lopez-Ramos, et al. ∙ 0

• ### SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models

In recent years, implicit deep learning has emerged as a method to incre...
06/01/2021 ∙ by Zaccharie Ramzi, et al. ∙ 0

• ### Optimizing Millions of Hyperparameters by Implicit Differentiation

We propose an algorithm for inexpensive gradient-based hyperparameter op...
11/06/2019 ∙ by Jonathan Lorraine, et al. ∙ 42

• ### Muddling Labels for Regularization, a novel approach to generalization

Generalization is a central problem in Machine Learning. Indeed most pre...
02/17/2021 ∙ by Karim Lounici, et al. ∙ 0

## Code Repositories

### sparse-ho

Fast hyperparameter settings for non-smooth estimators:

### sparse-ho-qbe

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many statistical applications, the number of parameters is much larger than the number of observations

. In such scenarios, a popular approach to tackle linear regression problems is to consider convex

-type penalties, used in Lasso (Tibshirani96), Group-Lasso (Yuan_Lin06), Elastic-Net (Zou_Hastie05) or adaptive Lasso (Zou06). These Lasso-type estimators rely on regularization hyperparameters, trading data fidelity against sparsity. Unfortunately, setting these hyperparameters is hard in practice: estimators based on -type penalties are indeed more sensitive to the choice of hyperparameters than regularized estimators.

To control for overfitting, it is customary to use different datasets for model training (computing the regression coefficients) and hyperparameter selection (choosing the best regularization parameters). A metric, hold-out loss, is optimized on a validation dataset (Stone_Ramer65). Alternatively one can rely on a statistical criteria that penalizes complex models such as AIC/BIC (Liu_Yang11) or SURE (Stein Unbiased Risk Estimator, Stein81). In all cases, hyperparameters are tuned to optimize a chosen metric.

The canonical hyperparameter optimization method is grid-search. It consists in fitting and selecting the best model over a predefined grid of parameter values. The complexity of grid-search is exponential with the number of hyperparameters, making it only competitive when the number of hyperparameters is small. Other hyperparameter selection strategies include random search (Bergstra_Bengio12) and Bayesian optimization (Brochu_Cora_deFreitas10; Snoek_Larochelle_Ryan12) that aims to learn an approximation of the metric over the parameter space and rely on an exploration policy to find the optimum.

Another line of work for hyperparameter optimization (HO) relies on gradient descent in the hyperparameter space. This strategy has been widely explored for smooth objective functions (Larsen_Hansen_Svarer_Ohlsson96; Bengio00; Larsen_Svarer_Andersen_Hansen12). The main challenge for this class of methods is estimating the gradient the hyperparameters. Gradient estimation techniques are mostly divided in two categories. Implicit differentiation requires the exact solution of the optimization problem and involves the resolution of a linear system (Bengio00). This can be expensive to compute and lead to numerical instabilities, especially when the system is ill-conditioned (Lorraine_Vicol_Duvenaud2019). Alternatively, iterative differentiation computes the gradient using the iterates of an optimization algorithm. Backward iterative differentiation (Domke12) is computationally efficient when the number of hyperparameters is large. However it is memory consuming since it requires storing all intermediate iterates. In contrast, forward iterative differentiation (Deledalle_Vaiter_Fadili_Peyre14; Franceschi_Donini_Frasconi_Pontil17) does not require storing the iterates but can be computationally expensive with a large number of hyperparameters; see Baydin_Pearlmutter_Radul_Siskind18 for a survey.

This article proposes to investigate the use of these methods to set the regularization hyperparameters in an automatic fashion for Lasso-type problems. To cover the cases of both low and high number of hyperparameters, two estimators are investigated, namely the Lasso and the weighted Lasso which have respectively one or as many parameters as features. Our contributions are as follows:

• We show that forward iterative differentiation of block coordinate descent (BCD), a state-of-the-art solver for Lasso-type problems, converges towards the true gradient. Crucially, we show that this scheme converges linearly once the support is identified and that its limit does not depend of the initial starting point.

• These results lead to the proposed algorithm (Algorithm 2) where the computation of the Jacobian is decoupled from the computation of the regression coefficients. The later can be done with state-of-the-art convex solvers, and interestingly, it does not require solving a linear system, potentially ill-conditioned.

• We show through an extensive benchmark on simulated and real high dimensional data that the proposed method outperforms state-of-the-art HO methods.

Our work should not be confused with the line of work by Gregor_LeCun10, where the Lasso objective value is differentiated w.r.t. optimization parameters instead of the solution. In other words, we tackle argmin rather than min differentiation.

Notation The design matrix is (corresponding to samples and

features) and the observation vector is

. The regularization parameter, possibly multivariate, is denoted by . We denote the regression coefficients associated to . We denote the weak Jacobian (evan1992measure) of . For a function with weak derivatives of order two, we denote by (resp. ) its weak gradient the first parameter (resp. the second parameter). The weak Hessian is a matrix in which has a block structure

 ∇2ψ(β,λ)=(∇2βψ(β,λ)∇2β,λψ(β,λ)∇2λ,βψ(β,λ)∇2λψ(β,λ)).

The support of (the indices of non-zero coefficients) is denoted by , and represents its cardinality (the number of non-zero coefficients). The sign vector is the vector of component-wise signs (with the convention that ) of . Note that to ease the reading, we drop in the notation when it is clear from the context and use and .

## 2 Background

### 2.1 Problem setting

To favor sparse coefficients, we consider Lasso-type estimators based on non-smooth regularization functions. Such problems consist in finding: ^β^(λ) ∈_β∈^p ψ(β,λ)  . The Lasso (Tibshirani96) is recovered, with the number of hyperparameters set to : ψ(β,λ) = 12n y - X β_2^2+ e^λβ_1  , while the weighted Lasso (wLasso, Zou06, introduced to reduce the bias of the Lasso) has hyperparameters and reads: ψ(β,λ) = 12n y - X β_2^2 + ∑_j=1^p e^λ_j |β_j|  . Note that we adopt the hyperparameter parametrization of Pedregosa16, we write the regularization parameter as . This avoids working with a positivity constraint in the optimization process and fixes scaling issues in the line search. It is also coherent with the usual choice of a geometric grid for grid search (Friedman_Hastie_Tibshirani10).

Other formulations could be investigated like Elastic-Net or non-convex formulation, MCP (Zhang10). Our theory does not cover non-convex cases, though we illustrate that it behaves properly numerically. Handling such non-convex cases is left as a question for future work. The HO problem can be expressed as a nested bi-level optimization problem. For a given differentiable criterion (hold-out loss or SURE), it reads:

 \argminλ∈\bbRr {L(λ)\eqdefC(^β(λ))} \st^β(λ)∈\argminβ∈\bbRpψ(β,λ). (1)

Note that SURE itself is not necessarily weakly differentiable . However a weakly differentiable approximation can be constructed (Ramani_Blu_Unser08; Deledalle_Vaiter_Fadili_Peyre14). Under the hypothesis that Section 2.1 has a unique solution for every , the function is weakly differentiable (Vaiter_Deledalle_Peyre_Dossal_Fadili13)

. Using the chain rule, the gradient of

then writes:

 ∇λL(λ)=^\jac⊤(λ)∇C(^β(λ)). (2)

Computing the weak Jacobian of the inner problem is the main challenge, as once the hypergradient has been computed, one can use usual gradient descent, for a step size . Note however that is usually non-convex and convergence towards a global minimum is not guaranteed. In this work, we propose an efficient algorithm to compute for Lasso-type problems, relying on improved forward differentiation.

### 2.2 Implicit differentiation (smooth case)

Implicit differentiation, which can be traced back to Larsen_Hansen_Svarer_Ohlsson96, is based on the knowledge of and requires solving a linear system (Bengio00, Sec. 4). Since then, it has been extensively applied in various contexts. Chapelle_Vapnick_Bousquet_Mukherjee02; Seeger08 used implicit differentiation to select hyperparameters of kernel-based models. Kunisch_Pock13 applied it to image restoration. Pedregosa16 showed that each inner optimization problem could be solved only approximately, leveraging noisy gradients. Related to our work, Foo_Do_Ng08 applied implicit differentiation on a “weighted” Ridge-type estimator (a Ridge penalty with one per feature).

Yet, all the aforementioned methods have a common drawback : they are limited to the smooth setting, since they rely on optimality conditions for smooth optimization. They proceed as follows: if is a smooth convex function (for any fixed ) in Section 2.1, then for all , the solution satisfies the following fixed point equation:

 ∇βψ(^β(λ),λ)=0. (3)

Then, this equation can be differentiated :

 ∇2β,λψ(^β(λ),λ)+^\jac⊤(λ)∇2βψ(^β(λ),λ)=0. (4)

Assuming that is invertible this leads to a closed form solution for the weak Jacobian :

 ^\jac⊤(λ)=−∇2β,λψ(^β(λ),λ)(∇2βψ(β(λ),λ))p×p−1, (5)

which in practice is computed by solving a linear system. Unfortunately this approach cannot be generalized for non-smooth problems since Equation 3 no longer holds.

### 2.3 Implicit differentiation (non-smooth case)

Related to our work Mairal_Bach_Ponce12 used implicit differentiation with respect to the dictionary () on Elastic-Net models to perform dictionary learning. Regarding Lasso problems, the literature is quite scarce, see (Dossal_Kachour_Fadili_Peyre_Chesneau12; Zou_Hastie_Tibshirani07) and (Vaiter_Deledalle_Peyre_Dossal_Fadili13; Tibshirani_Taylor11) for a more generic setting encompassing weighted Lasso. General methods for gradient estimation of non-smooth optimization schemes exist (Vaiter_Deledalle_Peyre_Fadili_Dossal17) but are not practical since they depend on a possibly ill-posed linear system to invert. Amos_Brandon17 have applied implicit differentiation on estimators based on quadratic objective function with linear constraints, whereas Niculae_Blondel17 have used implicit differentiation on a smooth objective function with simplex constraints. However none of these approaches leverages the sparsity of Lasso-type estimators.

## 3 Hypergradients for Lasso-type problems

To tackle hyperparameter optimization of non-smooth Lasso-type problems, we propose in this section an efficient algorithm for hypergradient estimation. Our algorithm relies on implicit differentiation, thus enjoying low-memory cost, yet does not require to naively solve a (potentially ill-conditioned) linear system of equations. In the sequel, we assume access to a (weighted) Lasso solver, such as ISTA (Daubechies_Defrise_DeMol04) or Block Coordinate Descent (BCD, Tseng_Yun09, see also Algorithm 5).

### 3.1 Implicit differentiation

Our starting point is the key observation that Lasso-type solvers induce a fixed point iteration that we can leverage to compute a Jacobian. Indeed, proximal BCD algorithms (Tseng_Yun09), consist in a local gradient step composed with a soft-thresholding step (ST), for the Lasso:

 βj←\ST(βj−X⊤:,j(Xβ−y)\normX:,j2,neλ\normX:,j), (6)

where for any and (extended for vectors component-wise). The solution of the optimization problem satisfies, for any , the fixed-point equation (Combettes_Wajs05, Prop. 3.1):

 \hbeta(λ)j=\ST(\hbeta(λ)j−1αX⊤(X\hbeta(λ)−y),neλα). (7)

The former can be differentiated , see Section A.1 in Appendix, leading to a closed form solution for the Jacobian of the Lasso and the weighted Lasso. [linewidth=0.5pt] [Adapting Vaiter_Deledalle_Peyre_Dossal_Fadili13] Let be the support of the vector . Suppose that , then a weak Jacobian of the Lasso writes:

 ^\jac^S =−neλ(X⊤^SX^S)−1\sign\hbeta^S, (8) ^\jac^Sc =0, (9)

and for the weighted Lasso:

 ^\jac^S,^S (10) ^\jacj1,j2 =0 if j1∉^S or  if j2∉^S. (11)

The proof of Section 3.1 can be found in Section A.1. Note that the positivity condition in Section 3.1 is satisfied if the (weighted) Lasso has a unique solution. Moreover, even for multiple solutions cases, there exists at least one satisfying the positivity condition (Vaiter_Deledalle_Peyre_Dossal_Fadili13).

Section 3.1 shows that the Jacobian of the weighted Lasso is row and column sparse. This is key for algorithmic efficiency. Indeed, a priori, one has to store a possibly dense matrix, which is prohibitive when is large. Section 3.1 leads to a simple algorithm (see Algorithm 1) to compute the Jacobian in a cheap way, as it only requires storing and inverting an matrix. Even if the linear system to solve is of size , instead of for smooth objective function, the system to invert can be ill-conditioned, especially when a large support size is encountered. This leads to numerical instabilities and slows down the resolution (see an illustration in Figure 2). Forward (Algorithm 3 in Appendix) and backward (Algorithm 4 in Appendix) iterative differentiation, which do not require solving linear systems, can overcome these issues.

### 3.2 Link with iterative differentiation

Iterative differentiation in the field of hyperparameter setting can be traced back to Domke12 who derived a backward differentiation algorithm for gradient descent, heavy ball and L-BFGS algorithms applied to smooth loss functions. Agrawal_Amos_Barratt_Boyd_Diamond_Kolter19 generalized it to a specific subset of convex programs. MacLaurin_Duvenaud_Adams15

derived a backward differentiation for stochastic gradient descent. On the other hand

Deledalle_Vaiter_Fadili_Peyre14 used forward differentiation of (accelerated) proximal gradient descent for hyperparameter optimization with non-smooth penalties. Franceschi_Donini_Frasconi_Pontil17 proposed a benchmark of forward mode versus backward mode, varying the number of hyperparameters to learn.

Forward differentiation consists in differentiating each step of the algorithm ( in our case). For the Lasso solved with BCD it amounts differentiating Equation 6, and leads to the following recursive equation for the Jacobian, with :

 \jacj← ∂1\ST(zj,neλ\normX:,j2)(\jacj−1\normX:,j2X⊤:,jX\jac) +∂2\ST(zj,neλ\normX:,j2)neλ\normX:,j2, (12)

see Algorithm 3

(in Appendix) for full details. Our proposed algorithm exploits that after a fixed number of epochs

and are constant. It is thus possible to decouple the computation of the Jacobian by only solving Section 2.1 in a first step and then apply the forward differentiation recursion steps, see Algorithm 2. This can be seen as the forward counterpart in a non-smooth case of the recent paper Lorraine_Vicol_Duvenaud2019.

An additional benefit of such updates is that they can be restricted to the (current) support, which leads to faster Jacobian computation. We now show that the Jacobian computed using forward differentiation and our method, Algorithm 2, converges toward the true Jacobian.

[linewidth=0.5pt] Assuming the Lasso solution (Section 2.1) (or weighted Lasso Section 2.1) is unique, then Algorithms 3 and 2 converge toward the implicit differentiation solution defined in Section 3.1. Moreover once the support has been identified the convergence of the Jacobian is linear and its limit does not depend on the initial starting point . ∎ Proof of Section 3.2 can be found in Sections A.3 and A.2.

As an illustration, Figure 1 shows the times of computation of a single gradient and the distance to “optimum” of this gradient as a function of the number of iterations in the inner optimization problem for the forward iterative differentiation (Algorithm 3), the backward iterative differentiation (Algorithm 4), and the proposed algorithm (Algorithm 2). The backward iterative differentiation is several order of magnitude slower than the forward and our implicit forward method. Moreover, once the support has been identified (after 20 iterations) the proposed implicit forward method converges faster than other methods. Note also that in Sections 3.2 and 3.1 the Jacobian for the Lasso only depends on the support (the indices of the non-zero coefficients) of the regression coefficients . In other words, once the support of is correctly identified, even if the value of the non-zeros coefficients are not correctly estimated, the Jacobian is exact, see Sun_Jeong_Nutini_Schmidt2019 for support identification guarantees.

## 4 Experiments

All the experiments are written in Python using Numba (Lam_Pitrou_Seibert15) for the critical parts such as the BCD loop. We compare our gradient computation technique against other competitors (see the competitors section) on the HO problem (Section 2.1).

Solving the inner optimization problem. Note that our proposed method, , has the appealing property that it can be used with any solver. For instance for the Lasso one can combine the proposed algorithm with state of the art solver such as Massias_Gramfort_Salmon18 which would be tedious to combine with iterative differentiation methods. However for the comparison to be fair, for all methods we have used the same vanilla BCD algorithm (recalled in Algorithm 5). We stop the Lasso-types solver when where is the cost function of the Lasso or wLasso and a given tolerance. The tolerance is fixed at for all methods throughout the different benchmarks.

Line search. For each hypergradient-based method, the gradient step is combined with a line-search strategy following the work of Pedregosa16. The procedure is detailed in Appendix111see https://github.com/fabianp/hoag for details.

Initialization. Since the function to optimize is not convex, initialization plays a crucial role in the final solution as well as the convergence of the algorithm. For instance, initializing in a flat zone of could lead to slow convergence. In the numerical experiments, the Lasso is initialized with , where is the smallest such that is a solution of Section 2.1.

Competitors. In this section we compare the empirical performance of algorithm to different competitors. Competitors are divided in two categories. Firstly, the ones relying on hyperparameter gradient:

• : (proposed) described in Algorithm 2.

• Implicit: , which requires solving a linear system as described in Algorithm 1.

• F. Iterdiff.: (Deledalle_Vaiter_Fadili_Peyre14; Franceschi_Donini_Frasconi_Pontil17) which jointly computes the regression coefficients as well as the Jacobian as shown in Algorithm 3.

Secondly, the ones not based on hyperparameter gradient:

• Grid-search: as recommended by Friedman_Hastie_Tibshirani10, we use values on a uniformly-spaced grid from to .

• Random-search: we sample uniformly at random values taken on the same interval as for the Grid-search , as suggested by Bergstra13.

• Bayesian: (SMBO) using a Gaussian process to model the objective function. We used the implementation of Bergstra13. The constraints space for the hyperparameter search was set in , and the expected improvement (EI) was used as aquisition function.

The cost and the quantity computed by each algorithm can be found in Table 1. The (Domke12) is not included in the benchmark since it was several orders of magnitude slower than the other techniques (see Figure 1). This is due to the high cost of the BCD algorithm in backward mode, see Table 1.

### 4.1 Application to held-out loss

When using the held-out loss, each dataset is split in 3 equal parts: the training set , the validation set and the test set .

(Lasso, held-out criterion). For the Lasso and the held-out loss, the bilevel optimization Section 2.1 reads:

 \argminλ∈\bbR \norminyval−Xval^β(λ)2 (13) \st^β(λ) ∈\argminβ∈\bbRp12n\norminytrain% −Xtrainβ22+eλ\norminβ1.

Figure 2 (top) shows on 3 datasets (see Appendix D for dataset details) the distance to the “optimum” of as a function of time. Here the goal is to find solution of Equation 13. The “optimum” is chosen as the minimum of among all the methods. Figure 2 (bottom) shows the loss on the test set (independent from the training set and the validation set). This illustrates how well the estimator generalizes. Firstly, it can be seen that on all datasets the proposed outperforms which illustrates Section 3.2 and corroborates the cost of each algorithm in Table 1. Secondly, it can be seen that on the 20news dataset (Figure 2, top) the (Algorithm 1) convergence is slower than , , and even slower than the grid-search. In this case, this is due to the very slow convergence of the conjugate gradient algorithm (Nocedal_Wright06) when solving the ill-conditioned linear system in Algorithm 1.

(MCP, held-out criterion). We also applied our algorithm on an estimator based on a non-convex penalty: the MCP (Zhang10) with 2 hyperparameters. Since the penalty is non-convex the estimator may not be continuous hyperparameters and the theory developed above does not hold. However experimentally outperforms for the HO, see Appendix C for full details.

### 4.2 Application to another criterion: SURE

Evaluating models on held-out data makes sense if the design is formed from random samples as it is often considered in supervised learning. However, this assumption does not hold for certain kinds of applications in signal or image processing. For these applications, the held-out loss cannot be used as the criterion for optimizing the hyperparameters of a given model. In this case, one may use a proxy of the prediction risk, like the Stein Unbiased Risk Estimation (SURE,

Stein81

). The SURE is an unbiased estimator of the prediction risk under weak differentiable conditions. The drawback of this criterion is that it requires the knowledge of the variance of the noise. The SURE is defined as follows:

where the degrees of freedom (dof

Efron86) is defined as The dof can be seen a measure of the complexity of the model, for instance for the Lasso , see Zou_Hastie_Tibshirani07. The SURE can thus be seen as a criterion trading data-fidelity against model complexity. However, the dof is not differentiable (not even continuous in the Lasso case), yet it is possible to construct a weakly differentiable approximation of it based on Finite Differences Monte-Carlo (see Deledalle_Vaiter_Fadili_Peyre14 for full details), with and :

 dofFDMC(y,λ,δ,ϵ)=1ϵ⟨X^β(λ)(y+ϵδ)−X^β(λ)(y),δ⟩.

We use this smooth approximation in the bi-level optimization problem to find the best hyperparameter. The bi-level optimization problem then reads:

 \argminλ∈\bbR\norminy−X^β(λ)2+2σ2dofFDMC(y,λ,δ,ϵ) (14) \st^β(λ)(y)∈\argminβ∈\bbRp12n\norminy−Xβ22+eλ\norminβ1 \st^β(λ)(y+ϵδ)∈\argminβ∈\bbRp12n\norminy+ϵδ−Xβ22+eλ\norminβ1

Note that solving this problem requires the computation of two (instead of one for the held-out loss) Jacobians of the solution at the points and .

(Lasso, SURE criterion). To investigate the estimation performance of the in comparison to the competitors described above, we used as metric the (normalized) Mean Squared Error (MSE) defined as . The entries of the design matrix are random Gaussian variables . The number of rows is fixed to . Then, we generated with 5 non-zero coefficients equals to . The vector was computed by adding to

additive Gaussian noise controlled by the Signal-to-Noise Ratio:

(here ). Following Deledalle_Vaiter_Fadili_Peyre14, we set . We varied the number of features between 200 and 10,000 on a linear grid of size 10. For a fixed number of features, we performed 50 repetitions and each point of the curves represents the mean of these repetitions. Comparing efficiency in time between methods is difficult since they are not directly comparable. Indeed, grid-search and random-search discretize the HO space whereas others methods work in the continuous space which is already an advantage. However, to be able to compare the hypergradient methods and possibly compare them to the others, we computed the total amount of time for a method to return its optimal value of . In order to have a fair comparison, we compared evaluations of the line-search for each hypergradient methods, 50 evaluations of the Bayesian methods and finally 50 evaluations on fixed or random grid. We are aware that the cost of each of these evaluations is not the same but it allows to see that our method stays competitive in time with optimizing one parameter. Moreover we will also see that our method scales better with a large number of hyperparameters to optimize.

Figure 3 shows the influence of the number of features on the MSE and the computation time. First, MSE of all competitors is comparable which means that the value of obtained by the different methods is approximately the same. The only method that performs worse than the others is . We attribute this to instabilities in the matrix inversion of the . Second, it can be seen that our method has the same estimation performance as state-of-the-art methods like grid-search. This illustrates that our approach is comparable to the others in term of quality of the estimator. Yet, its running time is the lowest of all hypergradient-based strategies and competes with the and the random-search.

(Weighted Lasso vs Lasso, SURE criterion). As our method leverages the sparsity of the solution, it can be used for HO with a large number of hyperparameters, contrary to classical . The weighted Lasso (wLasso, Zou06) has hyperparameters and was introduced to reduce the bias of the Lasso. However setting the hyperparameters is impossible with grid-search.

Figure 4 shows the estimation MSE and the running time of the different methods to obtain the hyperparameter values as a function of the number of features used to simulate the data. The simulation setting is here the same as for the Lasso problems investigated in Figure 3 (, ). We compared the classical Lasso estimator and the weighted Lasso estimator where the regularization hyperparameter was chosen using and the forward iterative differentiation as described in Algorithm 3. Section 2.1 is not convex for the weighted Lasso and a descent algorithm like ours can be trapped in local minima, crucially depending on the starting point . To alleviate this problem, we introduced a regularized version of Section 2.1:

 \argminλ∈\bbR C(\hbeta(λ))+γp∑jλ2j \st\hbeta(λ)∈\argminβ∈\bbRp\eqdefψ(β,λ). (15)

The solution obtained by solving Section 4.2 is then used as the initialization for our algorithm. In this experiment the regularization term is constant . We see in Figure 4 that the weighted Lasso gives a lower MSE than the Lasso and allows for a better recovery of . This experiment shows that the amount of time needed to obtain the vector of hyperparameters of the weighted Lasso via our algorithm is in the same range as for obtaining the unique hyperparameter of the Lasso problem. It also shows that our proposed method is much faster than the naive way of computing the Jacobian using forward iterative differentiation. A maximum running time threshold was used for this experiment checking the running time at each line-search iteration, explaining why the of the wLasso does not explode in time on Figure 4.

## Conclusion

In this work we studied the performance of several methods to select hyperparameters of Lasso-type estimators showing results for the Lasso and the weighted Lasso, which have respectively one or hyperparameters. We exploited the sparsity of the solutions and the specific structure of the iterates of forward differentiation, leading to our algorithm that computes efficiently the full Jacobian of these estimators the hyperparameters. This allowed us to select them through a standard gradient descent and have an approach that scales to a high number of hyperparameters. Importantly, contrary to a classical implicit differentiation approach, the proposed algorithm does not require solving a linear system. Finally, thanks to its two step nature, it is possible to leverage in the first step the availability of state-of-the-art Lasso solvers that make use of techniques such as active sets or screening rules. Such algorithms, that involve calls to inner solvers run on subsets of features, are discontinuous hyperparameters which would significantly challenge a single step approach based on automatic differentiation.

#### Acknowledgments

This work was funded by ERC Starting Grant SLAB ERC-YStG-676943.

## Appendix A Proofs

### a.1 Proof of Section 3.1

We start by a lemma on the weak derivative of the soft-thresholding. The soft-thresholding defined by is weakly differentiable with weak derivatives

 ∂\ST∂t(t,τ)=\ind{\abst>τ}, (16)

and

 ∂\ST∂τ(t,τ)=−\sign(t)⋅\ind{\abst>τ}, (17)

where

 \ind{\abst>τ}={1, if\abst>τ,0, otherwise. (18)

###### Proof.

(Section 3.1, Lasso ISTA) The soft-thresholding is differentiable almost everywhere (a.e.), thus Equation 7 can be differentiated a.e. thanks to the previous lemma, and for any

 ^\jac=⎛⎜ ⎜ ⎜ ⎜⎝\ind{\abs\hbeta1>τ}⋮\ind{\abs\hbetap>τ}⎞⎟ ⎟ ⎟ ⎟⎠⊙(\Idp−1αX⊤X)^\jac−neλα⎛⎜ ⎜ ⎜ ⎜⎝\sign(\hbeta1)\ind{\abs\hbeta1>τ}⋮\sign(\hbetap)\ind{\abs\hbetap>τ}⎞⎟ ⎟ ⎟ ⎟⎠.

Inspecting coordinates inside and outside the support of leads to:

 ⎧⎪⎨⎪⎩^\jac^Sc=0^\jac^S=^\jac^S−1αX⊤:,^SX:,^S^\jac^S−neλα\sign^β^S. (19)

Rearranging the term of Equation 19 it yields:

 X⊤:,^SX:,^S^\jac^S =−neλ\sign^β^S (20) ^\jac^S =−neλ(X⊤:,^SX:,^S)−1\sign^β^S. (21)

(Section 3.1, Lasso BCD)

The fixed point equations for the BCD case is

 ^βj=\ST(^βj−1\normX:j22X⊤:j(X^βj−y),neλ\normX:j22). (22)

As before we can differentiate this fixed point equation Equation 22

 ^\jacj=\ind{\abs\hbetaj>τ}⋅(^\jacj−1\normX:j22X⊤:jX^\jac)−neλ\normX:j22\sign(^βj)\ind{\abs\hbetaj>τ}, (23)

leading to the same result. ∎

### a.2 Proof of Section 3.2 in the ISTA case

###### Proof.

(Lasso case, ISTA) In Algorithm 3, follows ISTA steps, thus converges toward the solution of the Lasso . Let be the support of the Lasso estimator , and

the smallest eigenvalue of

. Under uniqueness assumption proximal gradient descent (ISTA) achieves sign identification (Hale_Yin_Zhang08), there exists such that for all :

 \signβ(k+1) =\sign\hbeta. (24)

Recalling the update of the Jacobian for the Lasso solved with ISTA is the following:

 \jac(k+1)= ∣∣\signβ(k+1)∣∣⊙(\Id−1\normX22X⊤X)\jac(k)−neλ\normX22\signβ(k+1),

it is clear that is sparse with the sparsity pattern for all . Thus we have that for all :

 \jac(k+1)^S =\jac(k)^S−1\normX22X⊤:,^SX\jac(k)−neλ\normX22\sign\hbeta^S =\jac(k)^S−1\normX22X⊤:,^SX:,^S\jac(k)^S−neλ\normX22\sign\hbeta^S =(\Id^S−1\normX22X⊤:,^SX:,^S)\jac(k)^S−neλ\normX22\sign\hbeta^S. (25)

One can remark that defined in Equation 8, satisfies the following:

 ^\jac^S=(\Id^S−1\normX22X⊤:,^SX:,^S)^\jac^S−neλ\normX22\sign\hbeta^S. (26)

Combining Equations 26 and 25 and denoting the smallest eigenvalue of , we have for all :

 \jac(k+1)^S−^\jac^S =(\Id^S−1\normX22X⊤:,^SX:,^S)(\jac(k)^S−^\jac^S) \normin\jac(k+1)^S−^\jac^S2 ≤(1−ν(^S)\normX22)\normin\jac(k)^S−^\jac^S2 \normin\jac(k)^S−^\jac^S2 ≤(1−ν(^S)\normX22)k−k0\normin\jac(k0)^S−^\jac^S2.

Thus the sequence of Jacobian converges linearly to once the support is identified. ∎

###### Proof.

(wLasso case, ISTA) Recalling the update of the Jacobian for the wLasso solved with ISTA is the following:

 \jac(k+1)= ∣∣\signβ(k+1)∣∣⊙(\Id−1\normX22X⊤X)\jac(k) −neλ\normX22\diag(\signβ(k+1)), (27)

The proof follows exactly the same steps as the ISTA Lasso case to show convergence in spectral norm of the sequence toward .

### a.3 Proof of Section 3.2 in the BCD case

The goal of the proof is to show that iterations of the Jacobian sequence generated by the Block Coordinate Descent algorithm (Algorithm 3) converges toward the true Jacobian . The main difficulty of the proof is to show that the Jacobian sequence follows an asymptotic Vector AutoRegressive (VAR, see Massias_Vaiter_Gramfort_Salmon19 for more detail), the main difficulty is to show that there exists such that for all :

 \jac(k+1)=A\jac(k)+B, (28)

with a contracting operator and . We follow exactly the proof of Massias_Vaiter_Gramfort_Salmon19.

###### Proof.

(Lasso, BCD)

Let be the indices of the support of , in increasing order. As the sign is identified, coefficients outside the support are 0 and remain 0. We decompose the -th epoch of coordinate descent into individual coordinate updates: Let denote the initialization (i.e., the beginning of the epoch, ), the iterate after coordinate has been updated, etc., up to after coordinate has been updated, i.e., at the end of the epoch (). Let , then and are equal everywhere, except at coordinate :

 ~\jac(s)js =~\jac(s−1)js−1\normX:,js2X⊤:,jsX~\jac(s−1)−1\normXjs2\signβjsafter sign identification we have: (29) =~\jac(s−1)js−1\normX:,js2X⊤:,jsX:,^S~\jac(s−1)^S−1\normX:,js2\sign^βjs (30) X:,^S~\jac(s)^S =(\Idn−X:,jsX⊤:,js\normX:,js2)AsX:,^S~\jac(s−1)^S−\sign^βjs\normX:,js2X:,jsbs (31)

We thus have:

 X:,^S~\jac(S)^S=AS…A1A∈\bbRn×n~X:,^S\jac(0)^S+A