I Introduction
Given their proven utility to control the model complexity, hyperparameters are crucial for a successful application of many statistical learning schemes in realworld engineering problems. The generalization capability and performance of such schemes on unknown instances can be improved with a careful hyperparameter selection. Regularized models control the tradeoff between a data fidelity term and a complexity term known as regularizer by means of one or several hyperparameters. Ridge regression, Lasso, Group Lasso, and Elastic net are instances of regularized models. While the regression weights can be optimized efficiently via the proximal gradient descent (PGD) method and its variants, the associated hyperparameter optimization (HO) is a nonconvex, challenging problem
[1]. One main motivation to develop HO schemes for PGDbased learning algorithms is the interest in solving for models with sparsity, which can enhance their interpretability.Given a dataset in batch form, a commonly applied criterion for hyperparameter optimization is the leaveoneout (LOO) validation error, because it reflects the ability of an estimator to predict outputs for unobserved patterns
[2]. The computational cost of evaluating the LOO validation error grows superlinearly with the number of data points, so that it is often approximated by Nfold cross validation (CV) with a small N (e.g., 10). Common practice to search for (sub)optimal hyperparameters is to use grid search or random search [1, 3] because of their simplicity.An improved form of random search are configurationevaluation methods, which focus the computation resources in promising hyperparameter configurations by quickly eliminating poor ones, important examples of it being the Hyperband [4] and Bayesian optimizationbased approaches in [5].
Gradientbased (exact and approximate) HO methods have been proposed recently for problems where the cost function is smooth. Several recent approaches formulate a bilevel program where an inner program is the optimization of the model parameters (model weights in the case of regression) and the outer program is the minimization of a surrogate of the generalization capability (e.g. validation MSE). In particular, [6] applies the implicit function Theorem to a stationarity condition to obtain the hypergradient (gradient of the outer cost function w.r.t. the hyperparameters); however, this approach requires calculating the Hessian w.r.t. the model parameters and, consequently, it cannot be applied to widely used nonsmooth regularizers (such as Lasso/group Lasso).
The approaches in [7, 8, 9] obtain a hypergradient by modeling the optimization of the regression weights as a dynamical system, where the state space is the parameter space and each iteration corresponds to a mapping from/to the same space. While [7] requires the aforementioned mapping to be invertible, [8, 9] avoid such a requirement by resorting to an approximation. This work combines ideas from [6, 8] to formulate a different implicit equation, derive the exact hypergradient, and develop a method that can work with nonsmooth regularizers and, additionally, admits an online variant.
If the data is received in a streaming fashion, and the data distribution is timevarying, one may be interested in algorithms that find the right regularization parameter in different time segments or data windows, such as the adaptive approach in [7], which is specific for Lasso estimators. On the contrary, our approach is general enough to be applied to several generalizations of Lasso, such as Group Lasso.
On the other hand, methods that train a neural network to predict optimal regression weights given a hyperparameter vector
[10] have been proposed, together with approximations that alternate between updating the neural network weights and the hyerparameters. However, these methods incur in heavy overparameterization, to the point of requiring more neural network parameters than the dimensionality of the regression weights and the hyperparameters together.Another approximation alleviating computation in hyperparameter search is porposed in [11], where the structure of specific estimators such as Lasso is exploited to approximately compute the LOO error metric at a very low cost. Note the difference with [10] because here it is only the error metric what is approximated, instead of the parameter vector. Despite the reduced computation, using this approximation for HO still requires a grid/random search scheme, which does not scale well with the dimensionality.
In this paper, we propose and evaluate a method that jointly optimizes the regression weights and hyperparameter of a (Group) Lasso regression model and converges to a stationary point of the LOO error curve. Our method can be extended to other estimators with proximable, nonsmooth regularizers. The formulation is inspired by the forwardmode gradient computation in [8]
, but where we use efficient approximations based on online (stochastic) gradient descent.
The contributions and structure of the present paper are listed in the following: Sec. II
provides the general formulation for the HO in supervised learning and presents the use of PGD for our problem. In Sec.
III, we present the derivation of the hypergradient (gradient w.r.t the hyperparameters). In Sec. IV, we discuss how to design our method for nonsmooth cost functions in problems such as Lasso and Group Lasso. The main contribution is presented in Sec. V, consisting in the derivation of an online algorithm and an approximate scheme, both aimed at saving computation. Sec. VI contains numerical tests with synthetic data, and concludes the paper.Ii Problem formulation
Given a set of training input/label pairs , with and , consider the supervised learning problem of minimizing a linear combination of empirical risk (data fit) and structural risk (regularization term):
(1) 
for . This can be for instance particularized to the Lasso regression problem with , , and ; section IV discusses other estimators.
It is well known that minimizing the empirical risk (insample error) does not guarantee that the estimated model will predict labels of unobserved inputs with low error. The role of regularization is to select the right model complexity, and the right choice of the hyperparameter is crucial. To this end, any estimator in the form (1) can be embedded in the bilevel optimization problem (minimization of the validation error):
(2) 
where denotes the set of validation samples, and denotes the training batch associated with the th validation sample. A typical choice in supervised learning is . Since (2) may have several local minima, the notation is reserved for a global minimizer, whereas will be used throughout the text to denote a stationary point.
Regarding the collection of training batches and the validation samples: In a heldout validation scheme, , and . In fold crossvalidation (CV), is the trainandvalidate dataset; the folds are a partition of ; and . Leaveoneout (LOO) validation is a special case of CV where , and ; and therefore, .
The rest of this section reviews how is obtained. The next section will discuss the minimization of (2) via the computation of the gradient w.r.t. the hyperparameter , also referred to as hypergradient [12, 8].
Iia Proximal Gradient Descent
The proximal gradient descent (PGD) algorithm allows to iteratively compute given the training batch and the hyperparameter , and it is advocated here for its simplicity. Extending our formulation to accommodate algorithms such as the accelerated PGD (which gives rise to FISTA when applied to regularized problems) is out of the scope of the present paper and left as future work.
Given a function , the proximity (prox) operator is defined as [13]
(3) 
If is such that the prox operator can be computed in closed form, it is said that is a proximable function, and problem (1) can be solved efficiently via proximal gradient descent (PGD):
(4) 
where is a step size sequence satisfying , where is the Lipschitz smoothness parameter of the empirical risk (aggregate loss component of the cost function). In fact, for , it holds that The PGD step (4) is the composition of a gradient step with the prox operator, and the iteration is frequently split in two steps, yielding the equivalent forwardbackward iterations:
(5a)  
(5b) 
Moreover, for the optimality condition holds:
(6) 
Iii Computing the Hypergradient
The condition in (6) establishes optimality w.r.t. the weight vector, but not w.r.t. the hyperparameter . To optimize over , we leverage the forwardmode gradient computation described by [8] in this section. The condition for being a stationary point for the optimization in (2) is:
(7) 
The hypergradient can be written using the chain rule as
(8) 
where the argument of is the derivative (Jacobian) matrix (column vector if is scalar). In the sequel, we leverage the technique in [8] to compute the latter.
Consider a generic iterative algorithm, whose th iterate is , and a hyperparameter vector . The th iteration can be expressed as: where
is a smooth mapping that represents the operation performed at the latter. The following equation [8, eq. (13)] is fulfilled by the iterates :
(9) 
In the case of PGD, the mapping is the composition [cf. (5)]. For simplicity, we will consider in the sequel a constant step size for PGD, so that , and
(10) 
(11) 
The derivations so far have followed a path common to [9], where an approximation to the hypergradient is computed by reversemode gradient computation [8]. However, differently to this work, in our approach we identify a fixed point equation for the derivatives at the convergence point of PGD:
(12) 
where ; if the linear equation has a solution, it can be expressed in closed form as , where
(13) 
Iiia Hypergradient descent (HGD)
If the iterates
(14) 
(where denotes projection onto the positive orthant) are executed, with an appropriate step size sequence , the sequence will converge to a stationary point of (2).
Remark. Existence of requires the prox operator to be smooth. However, important estimation problems such as Lasso regression rely on nonsmooth prox operators. In the next section, a slight modification of the hypergradient descent is proposed in order to deal with those problems.
Iv Nonsmooth prox operators
In this section, we propose the hypersubgradient descent method, and its extension for large datasets, namely, the online hypersubgradient descent (OHSD) method.
If the prox operator is nonsmooth, its derivatives may not exist at all points, and thus may not be computable. One can instead compute a valid subderivative (which will be denoted by ) by replacing the derivatives of the prox operator with the corresponding subderivatives.
If is replaced in (14) with , the resulting algorithm will be termed hereafter as hypersubgradient descent (HSGD).
The HSGD will be advocated in the next section to optimize the hyperparameters for several estimation problems with nonsmooth prox operators, namely Lasso and Group Lasso.Before proceeding, some of the functions that have been presented before as generic functions, will be particularized to facilitate the readability of the derivations and algorithms.
Regularized leastsquares (LS) linear estimators such as Lasso use the loss function
. Consequently, the forward operator and its Jacobian arewhere , and . If the LOO validation scheme is chosen, then can be computed efficiently as
(15) 
(16) 
If the validation error metric is , then
The equations for particular cases of will be presented after the HSGD algorithm.
Iva Hypersubgradient descent (HSGD)
Let and be valid subderivative (subJacobian) matrices of w.r.t. and , respectively. Then, a valid subderivative matrix of with respect to is [cf. (13)]
(17) 
where ; and the HSGD iterates can be written as
(18) 
Remark. The inverse at (17) will not exist if is rankdefficient. This happens when the model dimensionality is less than , and may also happen when the input data have a high degree of colinearity. In such cases, the LS solution of the linear system can be used. Another option is to numerically approximate by using an iterative algorithm based on the forwardgradient iteration at (10).
IvB Application of HSGD to Lasso and Group Lasso
Depending on the choice of the function , we obtain different regularized estimators, and associated prox operators and HSGD iterates.
IvB1 Lasso
The regularizer is ; its prox operator is known as softthresholding [14], and the latter can be computed entrywise as
(19) 
The corresponding subderivatives , and are defined so that is diagonal and
(20a)  
(20b) 
IvB2 Group Lasso
The regularizer depends on an a priori defined group structure. With denoting the dimensionality of , and the number of groups, let be a partition of . Let denote the subvector of containing the components indexed by . The regularizer is ; its prox operator is known as multidimensional softthresholding [15], and the latter can be computed groupwise as
(21) 
With denoting the subset of the partition where belongs, the corresponding subderivative matrices , and are defined so that is diagonal, and
(22a)  
(22b) 
The HSGD algorithm applied to Lasso and Group Lasso is summarized in the Algorithm 1. The approach in this paper can be extended also to other estimators with proximable regularizers, particularly several generalizations of Lasso such as Weighted Lasso and Fused Lasso, which are left out of the scope of this article for space constraints.
V Approximate algorithms
This section presents two approximations that improve the efficiency of HSGD.
Va Online Hypersubgradient Descent (OHSGD)
To avoid having to evaluate for all in each iteration of HSGD, the online optimization technique is applied here, which consists in doing a gradient descent iteration per , using the corresponding contribution to the subgradient (also known as stochastic subgradient):
(23a)  
(23b)  
(23c) 
To save computation, the instance of PGD that calculates should be initialized at if .
VB OHSGD with inexact weight vector
The algorithm proposed in the previous section requires to evaluate . The iterates produced by PGD converge to the exact optimizer, but in practice one has to stop the inner loop after a certain stopping criterion is met. Clearly, there is a tradeoff between the number of iterations in the th (inner) loop and the suboptimality of its final iterate, .
Even if one is interested in a very precise approximation of , most of the times PGD is run to evaluate for far away from , and is only used to compute the hypergradient. It is well known that when applying gradient methods, using coarsely approximated (hyper) gradients before getting close to a stationary point usually does not hinder the convergence, and may significantly alleviate computation. Even if the number of hypergradient steps required for converge increases, the computation savings in the inner loop usually yield a faster overall convergence. In addition, if the the prox operator is computationally heavy, fast (inexact) approximations of the prox operator also lower the complexity per iteration (inexact PGD method) [16].
Vi Numeric tests
For the two experiments in this section, data are generated so that the inputs are i.i.d., and , with being a 10sparse vector, and generated i.i.d. so that
has a signaltonoise ratio (SNR) of
. The trainandvalidate set contains 200 samples. A test set is generated with the same model and 2000 samples.The first experiment is run in order to visually compare in Fig. 1 the convergence rates of HSGD and OHSGD with different constant stepsizes , in terms of the number of PGD/ISTA iterations executed before producing a given value of . The tolerance to stop the inner loop is set to 1e3.
The second experiment consists in evaluating the convergence rate of OHSGD with inexact weight vectors within a scale of coarserfiner approximate values of the optimal solution of (6). Fig. 2 shows the value of the iterates (averaged over the last to show a stable value, since online iterates hover around the optimizer) against the number of PGD (ISTA) iterations. The PGD loop is stopped when the distance between 0 and subgradient of the training loss is smaller than tol. To confirm the optimality of , Fig. 3 shows the LOO and test error curves for a grid of values for .
The results show that approximate weights as with a subgradient tolerance as coarse as 0.1 still allow convergence of to , and the computation is significantly reduced with respect to instances of OHSGD that calculate the weights more exactly.
Concluding remarks: In this paper, the (hyper)gradient of the validation error w.r.t. the hyperparameters has been derived for estimators with nonsmooth regularizers exploiting the structure of PGD. An algorithm has been developed (with an online variant) to optimize hyperparameters for Lasso and Group Lasso. Actually, this approach is flexible enough to accomodate any convex, proximable regularization term.
References
 [1] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl, “Algorithms for hyperparameter optimization,” in Proc. Advances Neural Inf. Process. Syst., 2011, pp. 2546–2554.
 [2] Darren Homrighausen and Daniel J McDonald, “Leaveoneout crossvalidation is risk consistent for lasso,” Machine learning, vol. 97, no. 12, pp. 65–78, 2014.
 [3] James Bergstra and Yoshua Bengio, “Random search for hyperparameter optimization,” J. Mach. Learn. Res., vol. 13, no. Feb, pp. 281–305, 2012.
 [4] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar, “Hyperband: A novel banditbased approach to hyperparameter optimization,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 6765–6816, 2018.
 [5] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter, “Fast bayesian optimization of machine learning hyperparameters on large datasets,” in Artificial Intelligence and Stat., 2017, pp. 528–536.
 [6] Fabian Pedregosa, “Hyperparameter optimization with approximate gradient,” arXiv preprint arXiv:1602.02355, 2016.

[7]
Ricardo P Monti, Christoforos Anagnostopoulos, and Giovanni Montana,
“Adaptive regularization for lasso models in the context of
nonstationary data streams,”
Stat. Analysis and Data Mining: The ASA Data Science Journal
, vol. 11, no. 5, pp. 237–247, 2018.  [8] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil, “Forward and reverse gradientbased hyperparameter optimization,” in Proc. Int. Conf. Mach. Learn., 2017, vol. 70, pp. 1165–1173.
 [9] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil, “Bilevel programming for hyperparameter optimization and metalearning,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 1568–1577.
 [10] Jonathan Lorraine and David Duvenaud, “Stochastic hyperparameter optimization through hypernetworks,” arXiv preprint arXiv:1802.09419, 2018.
 [11] Shuaiwen Wang, Wenda Zhou, Arian Maleki, Haihao Lu, and Vahab Mirrokni, “Approximate leaveoneout for highdimensional nondifferentiable learning problems,” arXiv preprint arXiv:1810.02716, 2018.
 [12] Dougal Maclaurin, David Duvenaud, and Ryan Adams, “Gradientbased hyperparameter optimization through reversible learning,” in International Conference on Machine Learning, 2015, pp. 2113–2122.
 [13] N. Parikh and S. Boyd, “Proximal algorithms,” Found. Trends Optim., vol. 1, no. 3, pp. 127–239, 2014.
 [14] Ingrid Daubechies, Michel Defrise, and Christine De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Communications on Pure and Applied Mathem., vol. 57, no. 11, pp. 1413–1457, 2004.
 [15] Arnau Tibau Puig, Ami Wiesel, Gilles Fleury, and Alfred O Hero, “Multidimensional shrinkagethresholding operator and group lasso penalties,” IEEE Signal Processing Letters, vol. 18, no. 6, pp. 363–366, 2011.
 [16] Mark Schmidt, Nicolas L Roux, and Francis R Bach, “Convergence rates of inexact proximalgradient methods for convex optimization,” in Proc. Advances Neural Inf. Process. Syst., 2011, pp. 1458–1466.
Comments
There are no comments yet.