# Hyperparameter optimization with approximate gradient

Most models in machine learning contain at least one hyperparameter to control for model complexity. Choosing an appropriate set of hyperparameters is both crucial in terms of model accuracy and computationally challenging. In this work we propose an algorithm for the optimization of continuous hyperparameters using inexact gradient information. An advantage of this method is that hyperparameters can be updated before model parameters have fully converged. We also give sufficient conditions for the global convergence of this method, based on regularity conditions of the involved functions and summability of errors. Finally, we validate the empirical performance of this method on the estimation of regularization constants of L2-regularized logistic regression and kernel Ridge regression. Empirical benchmarks indicate that our approach is highly competitive with respect to state of the art methods.

## Authors

• 22 publications
02/14/2018

### Stealing Hyperparameters in Machine Learning

Hyperparameters are critical in machine learning, as different hyperpara...
11/01/2018

### Efficient Online Hyperparameter Optimization for Kernel Ridge Regression with Applications to Traffic Time Series Prediction

Computational efficiency is an important consideration for deploying mac...
05/13/2019

### Tabular Benchmarks for Joint Architecture and Hyperparameter Optimization

Due to the high computational demands executing a rigorous comparison be...
03/07/2019

### Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

Hyperparameter optimization can be formulated as a bilevel optimization ...
01/17/2022

### Efficient Hyperparameter Tuning for Large Scale Kernel Ridge Regression

Kernel methods provide a principled approach to nonparametric learning. ...
05/23/2021

### Regularization Can Help Mitigate Poisoning Attacks... with the Right Hyperparameters

Machine learning algorithms are vulnerable to poisoning attacks, where a...
07/27/2020

### Stabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization

This research proposes to use the Moreau-Yosida envelope to stabilize th...

## Code Repositories

### hoag

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Most models in machine learning feature at least one hyperparameter to control for model complexity. Regularized models, for example, control the trade-off between a data fidelity term and a regularization term through one or several hyperparameters. Among its most well-known instances are the LASSO [tibshirani1996regression], in which regularization is added to a squared loss to encourage sparsity in the solutions, or -regularized logistic regression, in which squared regularization (known as weight decay

in the context of neural networks) is added to obtain solutions with small euclidean norm. Another class of hyperparameters are the kernel parameters in support vector machines. For example, the popular radial basis function (RBF) kernel depends on a “width” parameter, while polynomial kernels depend on a discrete hyperparameter specifying the degree. Hyperparameters can be broadly categorized into two groups: continuous hyperparameters, such as regularization parameters or the width of an RBF kernel and discrete hyperparameters, such as the degree of a polynomial. In this work we focus on continuous hyperparameters.

The problem of identifying the optimal set of hyperparameters is known as hyperparameter optimization. Hyperparameters cannot be estimated to minimize the same cost function as model parameters, since this would favor models with excessive complexity. For example, if regularization parameters were chosen to minimize the same loss as model parameters, then models with no regularization would always yield the smallest loss. For this reason, hyperparameter optimization algorithms seek to optimize a criterion of model quality which is different from the cost function used to fit model parameters. This criterion can be a goodness of fit on unseen data, such as a cross-validation loss, or some criteria of model quality on the train set such as SURE [stein1981estimation], AIC/BIC [liu2011parametric] or Mallows  [mallows1973some], to name a few.

Choosing the appropriate set of hyperparameters has often a dramatic influence in model accuracy and many hyperparameter optimization algorithms have been proposed in the literature. For example, in the widely used grid-search algorithm, the model is trained over a range of values for the hyperparameters and the value that gives the best performance on the cross-validation loss is chosen. This not only scales poorly with the number of hyperparameters, but also involves fitting the full model for values of hyperparameters that are very unpromising. Random search [bergstra2011algorithms] has been proven to yield a faster exploration of the hyperparameter space than grid search, specially in spaces with multiple hyperparameters. However, none of these methods make use of previous evaluations to make an informed decision of the next iterate. As such, convergence to a global minima can be very slow.

In recent years, sequential model-based optimization (SMBO) techniques have emerged as a powerful tool for hyperparameter optimization (see e.g. [brochu2010tutorial] for an review on current methodologies). These techniques proceed by fitting a probabilistic model to the data and then using this model as an inexpensive proxy in order to determine the most promising location to evaluate next. This probabilistic model typically relies on a Gaussian process regressor but other approaches exist using trees [bergstra2011algorithms] or ensemble methods [lacoste2014sequential]. The model is built using only function evaluations, and for this reason SMBO is often considered as a black-box optimization method.

A third family of methods, which includes the method that we present, estimate the optimal hyperparameters using smooth optimization techniques such as gradient descent. We will refer to these methods as gradient-based hyperparameter optimization methods. These methods use local information about the cost function in order to compute the gradient of the cost function with respect to hyperparameters. However, computing the gradient with respect to hyperparameters has reveled to be a major bottleneck in this approach. For this reason we propose an algorithm that replaces the gradient with an approximation. More precisely, we make the following contributions:

• We propose a gradient-based hyperparameter optimization algorithm that uses approximate gradient information rather than the true gradient.

• We provide sufficient conditions for the convergence of this method to a stationary point.

• We compare this approach against state-of-the art methods for the task of estimation of regularization and kernel parameter on two different models and three datasets.

Notation We denote the gradient of a real-valued function by . If this function has several input arguments, we denote its gradient with respect to the -th argument. Similarly, denotes the Hessian and denotes the second order differential with respect to variables and . For functions that are not real-valued, we denote its differential by . We denote the projection operator onto a set by . That is, , where denotes the euclidean norm for vectors.

Throughout the paper we take the convention of denoting real-valued functions with lowercase letters (such as and ) and vector-valued functions with uppercase letters (such as ). Model parameters are denoted using lowercase Latin letters (such as ) while hyperparameters are denoted using Greek lowercase letters (such as ).

### 1.1 Problem setting

As mentioned in the introduction, the goal of hyperparameter optimization is to choose the hyperparameters that optimizes some criteria, such as a cross-validation loss or a SURE/AIC/BIC criteria. We will denote this criteria by , where is the number of hyperparameters. In its simplest form, the hyperparameter optimization problem can be seen as the problem of minimizing the cost function over a domain . Some approaches, such as sequential model-based optimization, only require function evaluations of this cost function. The methods we are interested in however use local information of the objective function.

The cost function (e.g. the cross-validation error) depends on the model parameters, which we will denote by . These are commonly not available in closed form but rather defined implicitly as the minimizers of some cost function that we will denote , where is the number of model parameters. This makes the hyperparmater optimization problem can be naturally expressed as a nested or bi-level optimization problem:

 argminλ∈D{f(λ)≜g(X(λ),λ)}s.t. X(λ)∈argminx∈Rph(x,λ), (HO)

where the minimization over is commonly referred to as the inner optimization problem. A notable example of hyperparameter optimization problem is that of regularization parameter selection by cross-validation. For simplicity, we restrict the discussion to the case of simple or hold-out cross-validation, where the dataset is split only once, although the methods presented here extend naturally to other cross-validation schemes. In this setting, the dataset is split in two: a train set (denoted ) and a test or hold-out set (denoted . In this case, the outer cost function is a goodness of fit or loss on the test set, while the inner one is a trade-off between a data fitting term on the train set and a penalty term. If the penalty term is a squared -norm, then the problem adopts the form:

 argminλ∈D loss(Stest,X(λ)) (1) s.t. X(λ)∈argminx∈Rp loss(Strain,x)+eλ∥x∥2.

The trade-off in the inner optimization between the goodness of fit term and the penalty term is controlled through the hyperparamter . Higher values of bias the model parameters towards vectors with small euclidean norm, and the goal of the hyperparameter optimization problem is to find the right trade-off between these two terms. The parametrization of the regularization parameter by an exponential () in Eq. (1) might seem unusual, but given that this regularization parameter is commonly optimized over a log-spaced grid, we will find this parametrization useful in later sections.

Turning back to the general problem (HO), we will now describe an approach to compute the derivative of the cost function with respect to hyperparameters. This approach, which we will refer to as implicit differentiation [larsen1996design, bengio2000gradient, NIPS2007_3286], relies on the observation that under some regularity conditions it is possible to replace the inner optimization problem by an implicit equation. For example, if is smooth and verifies that all stationary points are global minima (as is the case for convex functions), then the values are characterized by the implicit equation . Deriving the implicit equation with respect to leads to the equation , which, assuming invertible, characterizes the derivative of

. The chain rule, together with this equation, allows us to write the following formula for the gradient of

:

 ∇f =∇2g+(\DifX)T∇1g (2) =∇2g−(∇21,2h)T(∇21h)−1∇1g.

This formula allows to compute the gradient of given the following quantities: model parameters ( and are evaluated at ) and , which is usually computed as the solution to the linear system for . In the section that follows, we present an algorithm that relaxes the condition of both knowledge of the exact model parameters and exact solution of the linear system.

## 2 Hoag: Hyperparameter optimization with approximate gradient

As we have seen in the previous section, computing an exact gradient of can be computationally demanding. In this section we present an algorithm that uses an approximation, rather than the true gradient, in order to estimate the optimal hyperparameters. This approach yields a trade-off between speed and accuracy: a loose approximation can be computed faster but might result in slow convergence or even divergence of the algorithm. At iteration , this trade-off is balanced by the tolerance parameter . The sequence of tolerance parameters will turn out to play a major role in the convergence of the algorithm, although the time being, we will treat it as free parameter. We now describe our main contribution, the Hoag algorithm: [Hoag] At iteration perform the following: Solve the inner optimization problem up to tolerance . That is, find such that

Solve the linear system for up to tolerance . That is, find such that
Update hyperparameters:

This algorithm consists of four steps. The first two steps of the algorithm compute approximations to the quantities used in Eq. (2) to compute the gradient of . However, since these are not computed to full accuracy, , computed in step is a noisy estimate of the gradient. This approximation is then used as a replacement of the true gradient in a projected gradient-descent () iteration.

This procedure requires access to three quantities at iteration : a -optimal solution to the inner optimization problem which can be computed with any solver, the first-order derivatives of , (), and an -optimal solution to a linear system involving . In practice, this system is solved using a conjugate-gradient method, which only requires access to the matrix through matrix-vector products. For example, in machine learning problems such as the ones introduced in Eq. (1), the quantity corresponds to the Hessian of the inner optimization problem. Efficient schemes for multiplication by the Hessian can be derived for least squares, logistic regression [lin2008trust] and other general loss functions [pearlmutter1994fast].

### 2.1 Related work

There exists a large variety of hyperparameter optimization methods, and a full review of this literature would be outside the scope of this work. Below, we comment on the relationship between Hoag and some of the most closely related methods.

Regarding gradient-based hyperparameter optimization methods we will distinguish two main approaches, implicit differentiation and iterative differentiation, depending on how the gradient with respect to hyperparameters is computed.

Implicit differentiation. This approach consists in deriving an implicit equation for the gradient using the optimality conditions of the inner optimization problem (as we did in Eq. (2)). Originally motivated by the problem of setting the regularization parameter in the context of neural networks [larsen1996design, larsen1998adaptive, bengio2000gradient], has also been applied to the problem of selecting kernel parameters [chapelle2002choosing, seeger2008cross] or multiple regularization parameters in log-linear models [NIPS2007_3286]. This approach has also been successfully applied to the problem of image reconstruction [kunisch2013bilevel, calatroni2015bilevel], in which case the simplicity of the cost function function allows for a particularly simple expression of the gradient with respect to hyperparameters.

Iterative differentiation. In this approach, the gradient with respect to hyperparameters is computed by differentiating each step of the inner optimization algorithm and then using the chain rule to aggregate the results. Since the gradient is computed after a finite number of steps of the inner optimization routine, the estimated gradient is naturally an approximation to the true gradient. This method was first proposed by domke2012generic

and later extended to the setting of stochastic gradient descent by

MacDuvAda2015hyper. We note also that contrary to the implicit differentiation approach, this method can be applied to problems with non-smooth cost functions [deledalle2014stein, ochs2015bilevel].

Hoag, while belonging to the class of implicit differentiation methods, is related to iterative differentiation methods in that it allows the gradient with respect to hyperparameters to be computed approximately.

Finally, we note that similar approaches have also been considered in the setting of sequential model-based optimization. swersky2014freeze proposes an approach in which the inner optimization is “freezed” whenever the method decides that the current hyperparameter values are not promising. It does so by introducing a prior on training curves as a function of input hyperparameters. This approach however requires to make strong assumptions on the shape of the training curves which gradient-based methods do not make.

## 3 Analysis

In this section we will prove that the summability of the tolerance sequence is sufficient to guarantee convergence of the iterates in Hoag. The analysis of this algorithm is inspired by the work of d2008smooth, schmidt2011convergence, friedlander2012hybrid on inexact-gradient algorithms for convex optimization.

We will start this section by enumerating the regularity conditions that we assume for the hyperparameter optimization problem. The following conditions are assumed through the section:

• (A1) -smoothness. We assume that the first derivatives of and the second derivatives of are Lipschitz continuous functions.

• (A2) Nonsingular Hessian. We assume that the matrix , which corresponds to the Hessian of the inner optimization problem, is invertible at the values .

• (A3) Convex compact domain. The domain under which the hyperparameters are optimized, , is a convex non-empty and compact subset of .

These assumptions are verified by many models of interest. For example, for the problem of estimation of regularization parameters of Eq. (1), it allows twice-differentiable loss functions such as logistic regression or least squares (assumption A1) and strongly convex penalties (A2), such as squared regularization. Note that condition (A2) need not be verified on all its domain, only on the points , which would allow in principle to consider models that are defined through a non-convex cost functions. Assumption (A3) requires that the domain of the hyperparameters is a convex compact domain. In practice, hyperparameters are optimized over a -dimensional interval, i.e., a domain of the form . Our analysis however only require this domain to be convex and compact, a constraint that subsumes -dimensional intervals.

The rest of the section is devoted to prove (under conditions) the convergence of Hoag. The proof is divided in two parts. First, we will prove that the difference between the true gradient and the approximate gradient is bounded by (Theorem 1) and in a second part we will prove that if the sequence is summable, then this implies the convergence to a stationary point of (Theorem 2). Because of space limitation, the proofs are omitted and can be found in Appendix A.

###### Theorem 1 (The gradient error is bounded).

For sufficiently large , the error in the gradient is bounded by a constant factor of . That is,

 \norm∇f(λk)−pk=O(εk).

This theorem gives a bound on the gradient from the sequence that bounds the inner optimization and the linear system solution. Is will be the key ingredient in order to show convergence to a stationary point, which is the main result of this section. This property sometimes referred to as global convergence [nocedal2006numerical]:

###### Theorem 2 (Global convergence).

If the tolerance sequence is summable, that is, if is positive and verifies

 ∞∑i=1εi<∞,

then the sequence of iterates in the Hoag algorithm has limit , and this limit verifies the stationary point condition:

 ⟨∇f(λ∗),α−λ∗⟩≥0,∀α∈D .

In particular, if belongs to the interior of it is verified that

 ∇f(λ∗)=0.

This results gives sufficient conditions for the convergence of Hoag. The summability of the tolerance sequence suggest several natural candidates for this sequence, such as the quadratic sequence, or the exponential sequence, , with . We will empirically evaluate different tolerance sequences on different problems and different datasets in the next section.

## Experiments

In this section we compare the empirical performance of Hoag. We start by discussing some implementation details such as the choice of step size. Then, we compare the convergence of different tolerance decrease strategies that were suggested by the theoretical analysis. In a third part, we compare the performance of Hoag against other hyperparameter optimization methods.

Adaptive step size. Our algorithm relies on the knowledge of the Lipschitz constant for the cost function . However, in practice this is not known in advance. Furthermore, since the cost function is costly to evaluate, it is not feasible to perform backtracking line search. To overcome this we use a procedure in which the step size is corrected depending on the gain estimated from the previous step. In the experiments we use this technique although we do not have a formal analysis of the algorithm for this choice of step size.

Let denote the distance between the current iterate and the past iterate, . The -smooth property of the function , together with Lemma 1, implies that there exists a constant such that the following inequality is verified:

 g(λk,xk)≤ g(λk−1,xk−1)+Cεk+ (3) εk−1(C+M)Δk−LΔ2k,

where is the Lipschitz constant of (for loss functions such as logistic or least squares this can easily be computed from the data). This inequality can be derived from the properties of -smooth functions, and the details can be found in Appendix B. The procedure consists in decreasing the step (multiplication by ) whenever the equation is not satisfied and to increase it (multiplication by ) whenever the equation is satisfied to ensure that we are using a step size as large as possible. The constants that we used in the experiments are .

Stopping criterion. The stopping criterion given in Algorithm 2 depends on which is generally unknown. However, for objective functions in the inner optimization which are -strongly convex ( can be taken as the amount of regularization in -regularized objectives), it is possible to lower bound the quantity by . Hence, it is sufficient to ensure . Details can be found in Appendix B.

Initialization. The previous sections tells us how to adjust the step size but relies on an initial value of this parameter. We have found that a reasonable initialization is to initalize it to so that the first update in Hoag is of magnitude at most (it can be smaller due to the projection), where is the approximate gradient on the first step. The initialization of the tolerance decrease sequence is set to . We also limit the maximum precision to avoid numerical instabilities to , which is also the precision for “exact” methods, i.e., those that do not use a tolerance sequence. The initialization of regularization parameters is set to and the width of an RBF kernel is initialized to , where n_feat is the number of features or dimensionality of the dataset.

Although Hoag can be applied more generally, in our experiments we focus on two problems: -regularized logistic regression and kernel Ridge regression. We follow the setting described in Eq. (1), in which an initial dataset is partitioned into two sets, a train set and a test set , where denotes the input features and the target variables.

The first problem consists in estimating the regularization parameter in the widely-used -regularized logistic regression model. In this case, the loss function of the inner optimization problem is the regularized logistic loss function. In the setting of classification, the validation loss or outer cost function is commonly the zero-one loss. However, this loss is non-smooth and so does not verify assumption (A1). To overcome this and following [NIPS2007_3286], we use the logistic loss as the validation loss. This yield a problem of the form:

 argminλ∈D m∑i=1ψ(b′ia′TiX(λ)) (4) n∑i=1ψ(biaTix)+eλ∥x∥2,

where is the logistic loss, i.e., . The second problem that we consider is that of kernel Ridge regression with an RBF kernel. In this setting, the problem contains two hyperparameters: the first hyperparameter () controls the width of the RBK kernel and the second hyperparameter () controls the amount of regularization. The inner optimization depends on the kernel through the kernel matrix, formed by computing the kernel of all pairwise input samples. We denote such matrix as , where the entry is given by , where is the RBF kernel function: . Similarly, the outer optimization also depends on the kernel through the matrix , where its entries are the kernel product between features from the train set and features from the test set, that is, . Denoting the full hyperparameter vector as , the kernel matrix on the train set as, the full hyperparameter optimization problem takes the form

 argminλ∈D \normb−Ktest(eλ1)X(λ)2 (5) s.t. (Ktrain(eλ1)+eλ2I)X(λ)=b,

where for simplicity the inner optimization is already set as an implicit equation. Note that in this setting, and unlike in the logistic regression problem, the outer optimization function depends on the hyperparameters not only through the model parameters but also through the kernel matrix.

The solver used for the inner optimization problem of the logistic regression problem is L-BFGS [liu1989limited], while for Ridge regression we used a linear conjugate descent method. In all cases, the domain for hyperparameters is the -dimensional interval .

For the experiments, we use four different datasets. The dataset 20news and real-sim are studied with an -regularized logistic regression model (1 hyperparameter) while the Parkinson dataset using a Kernel ridge regression model (2 hyperparameters). The MNIST dataset is investigated in a high-dimensional hyperparameter space using a similar setting to [MacDuvAda2015hyper, §3.2] and reported in in Appendix B. Datasets and models are described in more detail in Appendix B.

In all cases, the dataset is randomly split in three equally sized parts: a train set, test set and a third validation set that we will use to measure the generalization performance of the different approaches.

### 3.1 Tolerance decrease sequence

We report in Figure 2 the convergence of different tolerance decrease strategies. From Theorem 2, the sole condition on these sequences is that they are summable. Three notable examples of summable sequences are the quadratic, cubic and exponential sequences. Hence, we choose one representative of each of these strategies. More precisely, the decrease sequences that we choose are a quadratic decrease sequence of the form , a cubic one of the form and an exponential of the form . The value taken as true minima of the hyperparameter optimization problem is computed by taken the minimum reached by 10 randomly initialized instances of Hoag with exponential decrease tolerance.

The plot shows the relative accuracy of the different variants as a function of time. It can be seen that non-exact methods feature a cheaper iteration cost, yielding a faster convergence overall. Note that despite the global convergence result of Theorem 2, Hoag is not guaranteed to be monotonically decreasing, and in fact, some degree of oscillation is expected when the decrease in the tolerance does not match the convergence rate (see e.g. schmidt2011convergence). This can be appreciated in Figure 2, where the quadratic decrease sequence (and to some extent the cubit too) exhibits oscillations in the two first plots.

### 3.2 Comparison with other hyperparameter optimization methods

We now compare against other hyperparameter optimization methods. The methods against which we compare are:

• Hoag. The method we present in this paper, with an exponentially decreasing tolerance sequence. A Python implementation is made freely available at https://github.com/fabianp/hoag.

• Grid Search. This method consists simply in splitting the domain of the hyperparameter into an equally-spaced grid. We split the interval into a grid of values.

• Random. This is the random search method [bergstra2012random]

samples the hyperparameters from a predefined distribution. We choose to samples from a uniform distribution in the interval

.

• SMBO. Sequential model-based optimization using Gaussian Process. We used the implementation found in the Python package BayesianOptimization (http://github.com/fmfn/BayesianOptimization/). As initialization for this method, we choose values equally spaced between and . The acquisition function used is the expected improvement.

• Iterdiff. This is the iterative differentiation approach from [domke2012generic], using the same inner-optimization algorithm as Hoag. While the original implementation used to have a backtracking line search procedure to estimate the step size, we found that this performed worst than any of the alternatives. For this reason, we use the adaptive step size strategy presented in Section Experiments (assuming a zero tolerance parameter ).

For all methods, the number of iterations used in the inner optimization algorithm (L-BFGS or GD) is set to 100, which is the same used by the other methods and the default in the scikit-learn (http://scikit-learn.org) package.

We report in Figure 3 the results of comparing the accuracy of these methods as a function of time. Note that it is expected that the different methods have different starting points. This is because Grid Search and SMBO naturally start from a pre-defined grid that starts from the extremes of the interval, while random search simply chooses a random point from the domain. For Hoag and Iterdiff, we take the initialization .

In the upper row of Figure 3 we can see the suboptimality of the different procedures as a function of time. We observe that Hoag and Iterdiff have similar behavior, although Hoag features a smaller cost per iteration. This can be explained because once Hoag has made a step it can use the previous solution of the inner optimization problem as a warm-start to compute the next gradient. This is not the case in Iterdiff since the computation of the gradient relies crucially on having sufficient iterations of the inner optimization algorithm.

We note that in the Parkinson dataset, solution is inside a region that is almost flat (the different cost functions can be seen in Figure 1 of the supplementary material). This can explain the difficulty of the methods to go beyond the suboptimality level. In this case, SMBO, who starts by computing the cost function at the extremes of the domain converges instantly to this region, which explains its fast convergence, although it is unable to improve the initially reached suboptimality.

Suboptimality plots are a standard way to compare the performance of different optimization methods. However, for the context of machine learning it can be argued that estimating hyperparameters up to a high precision is unimportant and that methods should be compared in terms of generalization performance. In the lower row of Figure 3, we display the test loss () on a validation set, that is, using a third set of samples which is different from both the train and test set. This figure reveals two main effects. First, unsurprisingly, optimization beyond of relative suboptimality is not reflected in this metric. Second, the fast (but noisy) early iterations of Hoag achieve the fastest convergence in two out of three datasets.

## 4 Discussion and future work

In previous sections we have presented and discussed several aspects of the Hoag algorithm. Finally, we outline some future directions that we think are worthwhile exploring.

Given the success of recent stochastic optimization techniques [schmidt2013minimizing, johnson2013accelerating] it seems natural to study a stochastic variant of this algorithm, that is, one in which the updates in the inner and outer optimization schemes have a cost that is independent of the number of samples. However, the dependency on the Hessian of the inner optimization () in the implicit equation (2) makes this non-trivial.

Little is known of the structure of solutions for the hyperparameter optimization problem (HO). In fact, assumption (A3) is introduced almost exclusively in order to guarantee existence of solutions. At the same time recent progress on the setting of image restoration, which can be considered a subproblem of (HO), has given sufficient conditions on the input data for such solution to exist in an unbounded domain [reyes2015structure]. The characterization of solutions for the HO problem can potentially simplify the assumptions made in this paper.

The analysis presented in this paper can be extended in several ways. For instance, the analysis of Hoag is provided for a constant step size and not for the adaptive step size strategy used in the experiments. Also, we have focused on proving asymptotic convergence of our algorithm. An interesting future direction would be to study rates of convergence, which might give insight into an optimal choice for the tolerance decrease sequence.

Although we found the method to be quite robust in practice, there are situations where it can get stuck in flat regions. For example, if the initial step is too big, it might land in a region with a large regularization parameter where the curvature is amost zero (hence the reason to normalize the first step by its norm). An interesting direction of future work is to make the method robust to such flat regions, scaping from flat regions and allowing the method to make bigger steps in early iterations.

### Acknowledgments

I am in debt with Gabriel Peyré for numerous discussions, suggestions and pointers. I would equally like to thank the anonymous reviewers for many insightful comments, and to Justin Domke for posting the code of his Iterative differentiation method.

Feedback and comments are welcome at the author’s blog (http://goo.gl/WoV8R5).

The author acknowledges financial support from the “Chaire Economie des Nouvelles Données”, under the auspices of Institut Louis Bachelier, Havas-Media and Université Paris-Dauphine (ANR 11-LABX-0019).

## Appendix A Analysis

###### Lemma 3 (The gradient error is bounded).

For sufficiently large , the error in the gradient is bounded by a constant factor of . That is,

 \norm∇f(λk)−pk=O(εk).
###### Proof.

Before starting the proof, we introduce the following notation for convenience. We denote by and the Hessian of the inner optimization function evaluated at the model parameters and at the -th iteration approximation, respectively. That is,

 Ak=∇21h(λk,X(λk)) and ^Ak=∇21h(λk,xk).

In a similar way we define and to be the gradient of the outer loss and cross derivatives of the inner loss, respectively:

 bk=∇1g(λk,X(λk)) and ^bk=∇1g(λk,xk)Dk=∇21,2h(λk,X(λk)) % and ^Dk=∇21,2h(λk,xk)

Note that and are the quantities involved in the Hoag algorithm. On the other hand, the quantities and are an “ideal” version of the former, computed when the tolerance is zero. It is not surprising though that the difference between and its hat counterpart will play a fundamental role in the proof.

The proof is structured in two parts. In part (i) we will prove that several sequences of interest are bounded, while in part (ii) we will use this to prove the main result.

Part (i). We first note that that both and are bounded and denote such bounds by and , respectively. is bounded as a direct consequence of assumption (A3). On the other hand, is continuously differentiable as a result of the implicit function theorem. Since its domain is a bounded set, is also bounded. We prove that the following sequences are bounded:

• . By the termination condition of the inner optimization problem we have that . Using the reverse triangular inequality we further have

 εk≥\normX(λk)−xk≥\abs\normX(λk)−\normxk⟹\normxk≤\normX(λk)+εk≤ζ+εk

hence is a bounded sequence since defines a summable (hence bounded) sequence.

• , and . Assumption (A1) implies that there exists constants such that

 \normAk−A0 ≤LE\norm[λk,X(λk)]−[λ0,X(λ0)] ≤LE√ξ2+η2 \normDk−D0 ≤LE\norm[λk,X(λk)]−[λ0,X(λ0)] ≤LE√ξ2+η2 \normbk−b0 ≤Lg\norm[λk,X(λk)]−[λ0,X(λ0)] ≤Lg√ξ2+η2 ,

and so and are all bounded sequences.

• . By assumption (A2), exists and for all . Since is a compact set, the limit of the sequence verifies , and hence is bounded.

Part (ii). By the Lipschitz assumption on and , there exists finite numbers and such that we have the following sequence of inequalities:

 ∥Ak−^Ak∥ ≤LE∥X(λk)−xk∥≤LEεk ∥bk−^bk∥ ≤Lg∥X(λk)−xk∥≤Lgεk

Let be the solution to the linear system of equations and be the solution to . Since by assumption is invertible, and by the Cauchy-Schwarz inequality, we have that

 \normzk≤∥A−1k∥\normbk,

and hence the sequence is also bounded.

By the summability condition of and the boundedness of proved in part (i) of this proof, for all sufficiently large it is verified that and by classical results related to the sensitivity of linear systems (see e.g. [higham2002accuracy, §7.1]) we have the following sequence of inequalities:

 \normzk−^zk ≤εk1−εk∥A−1k∥Lg(∥A−1k∥Lf+\normzk∥A−1k∥Lg) (6) ≤εk1−ρ(∥A−1k∥Lf+\normzk∥A−1k∥Lg) =O(εk) (bound on all % terms involved)

where the first inequality is derived from [higham2002accuracy, §7.1] and the second one comes from the definition of .

This last equation provides a bound on the difference between solving the linear system at and solving the linear system at , assuming that the linear system is solved to full precision. However, in practice we do not attempt to solve the linear system to full precision but rather compute an approximation such that (step of Hoag) and we are interest in bounding the distance between and its noise-less version . We have the following sequence of inequalities:

 \normqk−zk=∥qk−^zk+^zk−zk∥ ≤∥qk−^zk∥+∥^zk−zk∥(% triangular inequality) ≤∥^A−1k^Ak(qk−^zk)∥+∥^zk−zk∥ =∥^A−1k(^Akqk−^bk)∥+∥^zk−zk∥(definition of ^zk) ≤∥^A−1k∥∥^Akqk−^bk∥+∥^zk−zk∥(Cauchy-Schwarz) ≤∥^A−1k∥εk+∥^zk−zk∥ (definition of qk) =O(εk) (Eq.~{}(???)).

Finally, using this we can write that the difference between and the true gradient. Let be defined as

 ck=∇2g(λk,X(λk)) and ^ck=∇2g(λk,xk)

Then it is verified that

 \norm∇f(λk)−pk=∥ck−DTzk−^ck−^DTkqk∥ ≤∥ck−^ck∥+∥DTkzk−^DTkqk∥ (triangular inequality) ≤∥ck−^ck∥+∥DTkzk−^DTkzk+^DTkzk−^DTkqk∥ (Add and remove ^DTkzk) ≤∥ck−^ck∥+∥DTkzk−^DTkzk∥+∥^DTkzk−^DTkqk∥ (triangular inequality) ≤∥ck−^ck∥+∥Dk−^Dk∥∥zk∥+∥^Dk∥∥zk−qk∥ (Cauchy-Schwartz) ≤Lgεk+LEεk∥zk∥+∥^Dk∥∥zk−qk∥ (Assumption (A1)) ≤Lgεk+LEεk∥zk∥+∥^Dk∥O(εk) (previous inequality) =O(εk)(bound on ∥zk∥ % and ∥^Dk∥)

which completes the proof. ∎

###### Theorem 4 (Global convergence).

The sequence of iterates in the HOAG algorithm has limit , and this limit verifies the stationary point condition:

 ⟨∇f(λ∗),α−λ∗⟩≥0,∀α∈D .

In particular, if belongs to the interior of it is verified that

 ∇f(λ∗)=0.
###### Proof.

By assumption, is -smooth. This implies the following inequality for any pair of values :

 f(β)≤f(α)+∇f(α)T(β−α)+L2∥β−α∥2. (7)

This is a classical result on quadratic upper bounds for -smooth functions (see e.g. [nesterov2004introductory, Lemma 1.2.3]). We will also make use of the following inequality concerning the projection , which stems from the fact that projections onto convex sets are firmly nonexpansive operators (see e.g. parikh2013proximal). Let , then the following is verified:

 ∥PD(η)−PD(ν)∥2≤(η−ν)T(PD(η)−PD(ν))

In particular, for , this reduces to

 ∥λk−λk+1∥2≤1LpTk(λk−λk+1) (8)

Setting now in Eq. (7), we have the following sequence of inequalities

 f(λk+1) ≤f(λk)−∇f(λk)T(λk−λk+1) (9) +L2\normλk+1−λk2 =f(λk)−(∇f(λk)−pk+pk)T(Δλk−λk+1) +L2\normλk+1−λk2 =f(λk)−(∇f(λk)−pk)T(λk−λk+1) −pTk(λk−λk+1)+L2\normλk+1−λk2 ≤f(λk)−(∇f(λk)−pk)T(λk−λk+1) −L2\normλk+1−λk2%(byEq. (???)) ≤f(λk)+∥∇f(λk)−pk∥∥λk−λk+1∥ −L2\normλk+1−λk2%(Cauchy−Schwartz)

By Lemma 1, . Since is bounded, we have that there exists and such that for sufficiently large

 ∥∇f(λk)−pk∥∥λk−λk+1∥

which applied to the previous inequality results in

 f(λk+1)≤f(λk)+M(εk)−L2\normλk+1−λk2,

or equivalently

 \normλk+1−λk2≤2L(f(λk)−f(λk+1)+M(εk)).

Let be an lower bound on the function . This bound exist and is finite because has continuous derivatives and is defined on a compact set. Summing the last expression from to we obtain

 ∞∑k=m\normλk+1−λk2≤2L(f(λm)−C+M∞∑k=m(εk)).

Since is a summable sequence we conclude that the right-hand side of this expression is finite. Hence, for the sum on the left-hand side to be finite we must have as . This implies that the limit of the sequence exists. We denote this limit by . Furthermore, we have the following sequence of equalities

 0 =limk→∞{λk+1−λk} =limk→∞{PD(λk−1Lpk)−λk} (definition of λk+1) =PD(λ∗−1L∇f(λ∗))−λ∗ (Taking limits + Lemma~{}???)

from where verifies the fixed point equation

 λ∗=PD(λ∗−1L∇f(λ∗)). (11)

In the introduction we have formulated the projection as . However, it also admits the equivalent definition in terms of an unconstrained optimization problem:

 PD(α)=argminλ∈RsID(λ)+12∥α−λ∥2,

where is the indicator function, which is if and otherwise. In light of this, the first order optimality conditions on Eq. (11) imply that

 0∈∂λ(ID(λ)+12∥α−λ∥), ⟹1L∇f(λ∗)∈∂ID(λ∗) (λ=λ∗,α=λ∗−1L∇f(λ∗))

where denotes the subgradient. Since is a convex set, is a convex function, hence its subgradient is a a monotone mapping (see e.g. rockafellar2009variational). By definition of monotonicity, it is verified that

 (v1−v0)T(x1−x0)≥0 for all v0∈∂ID(x0), v1∈∂ID(x1)

Let be an arbitrary element of . Then since the subgradient of the indicator function is either zero if is in the interior, or it contains all positive numbers (including zero) if is in the border. In any case, letting we have that and so it must be verified

 1L∇f(λ∗)T(α−λ∗)≥0⟹∇f(λ∗)T(α−λ∗)≥0.

If belongs to the interior of , then there exists a ball of radius around contained within . In particular, for every vector within this ball we have . Since this ball contains a basis of , it must be verified that , which concludes the proof. ∎

## Appendix B Experiments

### Stopping criterion

In order to bound the tolerance we used the following lemma

###### Lemma 5.

Let be -strongly convex and -smooth. We denote by the minimum of this function. Then the following is verified for all in the domain

 ∥x∗−x∥≤μ−1∥∇s(x)∥
###### Proof.

Using basic properties of -strongly convex functions (see e.g. [nesterov2004introductory]), we have the following sequence of inequalities for all in the domain:

 μ\normx−y2 ≤⟨∇s(x)−∇s(y),x−y⟩ (by strong convexity) ≤\norm∇s(x)−∇s(y)\normx−y (by Cauchy-Schwarz)

from where . Specializing at and using (the inner optimization is unconstrained) yields the desired result. ∎