# On the Iteration Complexity of Hypergradient Computation

We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We investigate some popular approaches to compute the hypergradient, based on reverse mode iterative differentiation and approximate implicit differentiation. Under the hypothesis that the fixed point equation is defined by a contraction mapping, we present a unified analysis which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity. This analysis suggests a hierarchy in terms of computational efficiency among the above methods, with approximate implicit differentiation based on conjugate gradient performing best. We present an extensive experimental comparison among the methods which confirm the theoretical findings.

## Authors

• 4 publications
• 9 publications
• 54 publications
• 9 publications
• ### Convergence Properties of Stochastic Hypergradients

Bilevel optimization problems are receiving increasing attention in mach...
11/13/2020 ∙ by Riccardo Grazzi, et al. ∙ 0

• ### Bilevel Optimization for Machine Learning: Algorithm Design and Convergence Analysis

Bilevel optimization has become a powerful framework in various machine ...
07/31/2021 ∙ by Kaiyi Ji, et al. ∙ 0

• ### Computing Least and Greatest Fixed Points in Absorptive Semirings

We present two methods to algorithmically compute both least and greates...
06/01/2021 ∙ by Matthias Naaf, et al. ∙ 0

• ### Convergence of parallel overlapping domain decomposition methods for the Helmholtz equation

We analyse parallel overlapping Schwarz domain decomposition methods for...
06/09/2021 ∙ by Shihua Gong, et al. ∙ 0

• ### Nonsmooth Implicit Differentiation for Machine Learning and Optimization

In view of training increasingly complex learning architectures, we esta...
06/08/2021 ∙ by Jérôme Bolte, et al. ∙ 0

• ### Differentiable Implicit Soft-Body Physics

We present a differentiable soft-body physics simulator that can be comp...
02/11/2021 ∙ by Junior Rojas, et al. ∙ 0

• ### Implicit Deep Learning

We define a new class of "implicit" deep learning prediction rules that ...
08/17/2019 ∙ by Laurent El Ghaoui, et al. ∙ 0

## Code Repositories

### hypertorch

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Several problems arising in machine learning and related disciplines can be formulated as bilevel problems, where the lower-level problem is a fixed point equation whose solution is part of an upper-level objective. Instances of this framework include hyperparameter optimization (Maclaurin et al., 2015; Franceschi et al., 2017; Liu et al., 2018; Lorraine et al., 2019; Elsken et al., 2019), meta-learning (Andrychowicz et al., 2016; Finn et al., 2017; Franceschi et al., 2018), as well as recurrent and graph neural networks (Almeida, 1987; Pineda, 1987; Scarselli et al., 2008).

In large scale scenarios, there are thousands or even millions of parameters to find in the upper-level problem, making black-box approaches like grid and random search (Bergstra and Bengio, 2012) or Bayesian optimization (Snoek et al., 2012) impractical. This has made gradient-based methods (Domke, 2012; Maclaurin et al., 2015; Pedregosa, 2016) popular in such settings, but also it has raised the issue of designing efficient procedures to approximate the gradient of the upper-level objective (hypergradient) when finding a solution to the lower-level problem is costly.

The principal goal of this paper is to study the degree of approximation to the hypergradient of certain iterative schemes based on iterative or implicit differentiation111 The reader interested in the convergence analysis of gradient-based algorithms for bilevel optimization is referred to (Pedregosa, 2016; Rajeswaran et al., 2019) and references therein.. In the rest of the introduction we present the bilevel framework, alongside some relevant examples in machine learning. We then outline the gradient approximation methods that we analyse in the paper and highlight our main contributions. Finally, we discuss and compare our results with previous work in the field.

#### The bilevel framework.

In this work, we consider the following bilevel problem.

 minλ∈Λf(λ):=E(w(λ),λ) (1) \ \ \ subject~{}to ~{}w(λ)=Φ(w(λ),λ),

where is a closed convex subset of and and are continuously differentiable functions. We assume that the lower-level problem in (1) (which is a fixed point-equation) admits a unique solution. However, in general, explicitly computing such solution is either impossible or expensive. When is differentiable, this issue affects the evaluation of the hypergradient , which at best can only be approximately computed.

A prototypical example of the bilevel problem (1) is

 minλ∈Λf(λ):=E(w(λ),λ)) (2) \ \ \ subject~{}to ~{}w(λ)=argminu∈Rdℓ(u,λ),

where

is a loss function, twice continuously differentiable and strictly convex w.r.t. the first variable. Indeed if we let

be such that , where is differentiable, then Problem (2) and Problem (1) are equivalent. Specific examples of Problem (2) include hyperparameter optimization and meta-learning, and are discussed in Section 3.1.

Other instances of the bilevel problem (1), which are not of the special form of Problem (2), arise in the context of so-called equilibrium models (EQM). Notably, these comprise some types of connectionist models employed in domains with structured data. Stable recurrent neural networks (Miller and Hardt, 2019), graph neural networks (Scarselli et al., 2008) and the formulations by Bai et al. (2019) belong to this class. EQM differ from standard (deep) neural networks in that the internal representations are given by fixed points of learnable dynamics rather than compositions of a finite number of layers. The learning problem for such type of models can be written as

 minλ=(γ,θ)∈Λf(λ):=n∑i=1Ei(wi(γ),θ), (3) \ \ \ subject~{}to wi(γ)=ϕi(wi(γ),γ), for 1≤i≤n,

where the operators (here ) are associated to the training points ’s, and the error functions are the losses incurred by a standard supervised algorithm on the transformed dataset . A specific example is discussed in Section 3.2

In this paper, we present a unified analysis which allows for the first time to quantitatively compare popular methods to approximate in the general setting of Problem (1). The strategies we consider can be divided in two categories:

1. Iterative Differentiation (ITD) (Maclaurin et al., 2015; Franceschi et al., 2017, 2018; Finn et al., 2017). One defines the sequence of functions , where are the fixed-point iterates generated by the map . Then is approximated by , which in turn is computed using forward (FMAD) or reverse (RMAD) mode automatic differentiation (Griewank and Walther, 2008).

2. Approximate Implicit Differentiation (AID) (Pedregosa, 2016; Rajeswaran et al., 2019; Lorraine et al., 2019). First, an (implicit) equation for is obtained through the implicit function theorem. Then, this equation is approximately solved by using a two stage scheme. We analyse two specific methods in this class: the fixed-point method (Lorraine et al., 2019)

, also referred to as recurrent backpropagation in the context of recurrent neural networks

(Almeida, 1987; Pineda, 1987), and the conjugate gradient method (Pedregosa, 2016).

Both schemes can be efficiently implemented using automatic differentiation (Griewank and Walther, 2008; Baydin et al., 2018) achieving similar cost in time, while ITD has usually a larger memory cost than AID222This is true when ITD is implemented using RMAD, which is the standard approach when is high dimensional. .

#### Contributions.

Although there is a vast amount of literature on the two hypergradient approximation strategies, it remains unclear whether it is better to use one or the other. In this work, we shed some light over this issue both theoretically and experimentally. Specifically our contributions are the following:

• We provide iteration complexity results for ITD and AID when the mapping defining the fixed point equation is a contraction. In particular, we prove non-asymptotic linear rates for the approximation errors of both approaches.

• We make a theoretical and numerical comparison among different ITD and AID strategies considering several experimental scenarios.

We note that, to the best of our knowledge, non-asymptoptic rates of convergence for AID were only recently given in the case of meta-learning (Rajeswaran et al., 2019). Furthermore, we are not aware of any previous results about non-asymptotic rates of convergence for ITD.

#### Related Work.

Iterative differentiation for functions defined implicitly has been extensively studied in the automatic differentiation literature. In particular (Griewank and Walther, 2008, Chap. 15) derives asymptotic linear rates for ITD under the assumption that is a contraction. Another attempt to theoretically analyse ITD is made by Franceschi et al. (2018) in the context of the bilevel problem (2). There, the authors provide sufficient conditions for the asymptotic convergence of the set of minimizers of the approximate problem to the set of minimizers of the bilevel problem. In contrast, in this work, we give rates for the convergence of the approximate hypergradient to the true one (i.e. ). ITD is also considered in (Shaban et al., 2019) where is approximated via a procedure which is reminiscent of truncated backpropagation. The authors bound the norm of the difference between

and its truncated version as a function of the truncation steps. This is different from our analysis which directly considers the problem of estimating the gradient of

.

In the case of AID, an asymptotic analysis is presented in

(Pedregosa, 2016), where the author proves the convergence of an inexact gradient projection algorithm for the minimization of the function defined in Problem (2), using increasingly accurate estimates of . Whereas Rajeswaran et al. (2019) present complexity results on the setting of meta-learning with biased regularization. Here, we extend this line of work by providing complexity results for AID in the more general setting of Problem (1).

We also mention the papers by Amos and Kolter (2017) and Agrawal et al. (2019), which present techniques to differentiate through the solutions of quadratic or convex problems respectively. Using such techniques allows one to treat these optimization problems as layers of a neural network and to use backpropagation for the end-to-end training of the resulting learning model. In the former work, the gradient is obtained by implicitly differentiating through the KKT conditions of the lower-level problem, while the latter performs implicit differentiation on the residual maps of Minty’s parametrization.

A different approach to solve bilevel problems of the form (2) is presented by Mehra and Hamm (2019), who consider a sequence of “single level” objectives involving a quadratic regularization term penalizing violations of the lower-level first-order stationary conditions. The authors provide asymptotic convergence guarantees for the method, as the regularization parameter tends to infinity, and show that it outperforms both ITD and AID on different settings where the lower-level problem is non-convex.

All previously mentioned works except (Griewank and Walther, 2008) consider bilevel problems of the form (2). Another exception is (Liao et al., 2018), which proposes two improvements to recurrent backpropagation, one based on conjugate gradient on the normal equations, and another based on Neumann series approximation of the inverse.

## 2 Theoretical Analysis

In this section we establish non-asymptotic bounds on the hypergradient (i.e. ) approximation errors for both ITD and AID schemes (proofs can be found in Appendix A). In particular, in Section 2.1 we address the iteration complexity of ITD, while in Section 2.2, after giving a general bound for AID, we focus on two popular implementations of the AID scheme: one based on the conjugate gradient method and the other on the fixed-point method.

#### Notations.

We denote by

applied to a vector (matrix) the Euclidean (spectral) norm. For a differentiable function

we denote by the derivative of at . When , we denote by the gradient of . For a real-valued function we denote by and the partial derivatives w.r.t. the first and second variable respectively. We also denote by and the second derivative of w.r.t. the first variable and the mixed second derivative of w.r.t. the first and second variable. For a vector-valued function we denote, for every , by and the partial Jacobians w.r.t. the first and second variable respectively. Finally, we indicate with the identity operator.

In the rest of the section, referring to Problem (1), we will group the assumptions as follows. Assumption A is general while Assumption B and C are specific enrichments for ITD and AID respectively.

###### Assumption A.

For every ,

1. [(i)]

2. is the unique fixed point of .

3. is invertible.

4. and are Lipschitz continuous with constants and respectively.

5. and are Lipschitz continuous with constants and respectively.

A direct consequence of Assumption A1-2 and of the implicit function theorem is that and are differentiable on . Specifically, for every , it holds that

 w′(λ) =(I−∂1Φ(w(λ),λ))−1∂2Φ(w(λ),λ) (4) ∇f(λ) =∇2E(w(λ),λ)+w′(λ)⊤∇1E(w(λ),λ). (5)

See Theorem A.1 for details. In the special case of Problem (2), equation (4) reduces (see Corollary A.1) to

 w′(λ)=−∇21ℓ(w(λ),λ)−1∇221ℓ(w(λ),λ).

Before starting with the study of the two methods ITD and AID, we give a lemma which introduces three additional constants that will occur in the complexity bounds.

###### Lemma 2.1.

Let and let be such that . Then there exist s.t.

 sup∥w∥≤2Dλ∥∇1E(w,λ)∥≤LE,λ,  sup∥w∥≤2Dλ∥∂2Φ(w,λ)∥≤LΦ,λ

The proof exploits the fact that the image of a continuous function applied to a compact set remains compact.

### 2.1 Iterative Differentiation

In this section we replace in (1) by the -th iterate of , for which we additionally require the following.

###### Assumption B.

For every , is a contraction with constant .

The approximation of the hypergradient is then obtained as in Algorithm 1.

Assumption B looks quite restrictive, however it is satisfied in a number of interesting cases:

1. [(a)]

2. In the setting of the bilevel optimization problem (2), suppose that for every , is -strongly convex and -Lipschitz smooth for some continuously differentiable functions , and . Set ,

 α(λ)=2μℓ(λ)+Lℓ(λ), and  qλ=κ(λ)−1κ(λ)+1. (6)

Then, is a contraction with constant (see Appendix B).

3. For strongly convex quadratic functions, accelerated methods like Nesterov’s (Nesterov, 1983) or heavy-ball (Polyak, 1987) can be formulated as fixed-point iterations of a contraction in the norm defined by a suitable positive definite matrix.

4. In certain graph and recurrent neural networks of the form (3), where the transition function is assumed to be a contraction (Scarselli et al., 2008; Almeida, 1987; Pineda, 1987).

The following lemma is a simple consequence of the theory on Neumann series and shows that Assumption B is stronger than Assumption A1-2. For reader’s convenience the proof is given in Appendix A.

###### Lemma 2.2.

Let Assumption B be satisfied. Then, for every , has a unique fixed point and, for every , is invertible and

 ∥(I−∂1Φ(w,λ))−1∥≤11−qλ.

In particular, 1 and 2 in Assumption A hold.

With Assumption B in force and if is defined as at point 1 in Algorithm 1, we have the following proposition that is essential for the final bound.

###### Proposition 2.1.

Suppose that Assumptions B and A3 hold and let , with . Moreover, for every , let be computed by Algorithm 1 and let and be as in Lemma 2.1. Then, is differentiable and, for every ,

 ∥w′t(λ)−w′(λ)∥≤(ν2,λ+ν1,λLΦ,λ1−qλ)Dλtqt−1λ+LΦ,λ1−qλqtλ. (7)

Leveraging Proposition 2.1, we give the main result of this section.

###### Theorem 2.1.

(ITD bound) Suppose that Assumptions A3-4 and B hold and let with . Moreover, for every , let and be defined according to Algorithm 1 and let , and be as in Lemma 2.1. Then, is differentiable and, for every ,

 ∥∇ft(λ)−∇f(λ)∥≤(c1(λ)+c2(λ)tqλ+c3(λ))qtλ, (8)

where

 c1(λ) =(η2,λ+η1,λLΦ,λ1−qλ)Dλ, c2(λ) =(ν2,λ+ν1,λLΦ,λ1−qλ)LE,λDλ, c3(λ) =LE,λLΦ,λ1−qλ.

In this generality this is a new result that provides a non-asymptotic linear rate of convergence for the gradient of towards that of .

### 2.2 Approximate Implicit Differentiation

In this section we study another approach to approximate the gradient of . We derive from (4) and (5) that

 ∇f(λ)=∇2E(w(λ),λ)+∂2Φ(w(λ),λ)⊤v(λ) (9)

where is the solution of the linear system

 (I−∂1Φ(w(λ),λ)⊤)v=∇1E(w(λ),λ). (10)

However, in the above formulas is usually not known explicitly or is expensive to compute exactly. To solve this issue is estimated as in Algorithm 2. Note that, unlike ITD, this procedure is agnostic about the algorithms used to compute the sequences and . Interestingly, in the context of Problem (2), choosing in Algorithm 2 yields the same procedure studied by Pedregosa (2016).

The number of iterations and in Algorithm 2 give a direct way of trading off accuracy and speed. To quantify this trade off we consider the following assumptions.

###### Assumption C.

For every ,

1. [(i)]

2. , is invertible.

3. , , and as .

4. and as .

If Assumption C1 holds, then, for every , since the map is continuous, we have

 sup∥w∥≤2Dλ∥(I−∂1Φ(w,λ))−1∥≤1μλ<+∞, (12)

for some . We note that, in view of Lemma 2.2, Assumption B implies Assumption C1 (which in turn implies Assumption A2) and in (12) one can take . We stress that, Assumption C2-3 are general and do not specify the type of algorithms solving the fixed-point equation and the liner system (11). It is only required that such algorithms have explicit rates of convergence and respectively. Finally, we note that Assumption C2 is less restrictive than Assumption B and encompasses the procedure at point 1 in Algorithm 1: indeed in such case C2 holds with .

It is also worth noting that the AID procedure requires only to store the last lower-level iterate, i.e. . This is a considerable advantage over ITD, which instead requires to store the entire lower-level optimization trajectory , if implemented using RMAD.

The iteration complexity bound for AID is given below. This is a general bound which depends on the rate of convergence of the sequence and the rate of convergence of the sequence .

###### Theorem 2.2.

(AID bound) Suppose that Assumptions A134 and C13 hold. Let , , . Let , and be as in Lemma 2.1 and let be defined according to (12). Let be defined as in Algorithm 2 and let . Then,

 ^Δ≤(η2,λ+η1,λLΦ,λμλ+ν2,λLE,λμλ+ν1,λLΦ,λLE,λμ2λ)×Dλρλ(t)+LΦ,λLE,λμλσλ(k). (13)

Furthermore, if Assumption B holds, then and

 (14)

where , and are defined in Theorem 2.1.

Theorem 2.2 provides a non-asymptotic rate of convergence for which contrasts with the asymptotic result given in Pedregosa (2016). In this respect, making Assumption C1 instead of the weaker Assumption A2 is critical.

Depending on the choice of the solver for the linear system (11) different AID methods are obtained. In the following we consider two cases.

#### AID with the Conjugate Gradient Method (AID-CG).

For the sake of brevity we set and . Then, the linear system (11) is equivalent to the following minimization problem

 minv∈Rd12∥Aλ,tv−bλ,t∥2, (15)

which, if is symmetric (so that is also symmetric) is in turn equivalent to

 minv∈Rd12v⊤Aλ,tv−v⊤bλ,t. (16)

Several first order methods solving problems (15) or (16) satisfy assumption C3 with linear rates and require only Jacobian-vector products. In particular, for the symmetric case (16), the conjugate gradient method features the following linear rate

 ∥vt,k(λ)−vt(λ)∥≤2√κ(Aλ,t)(√κ(Aλ,t)−1√κ(Aλ,t)+1)k∥vt,0(λ)−vt(λ)∥, (17)

where is the condition number of . In the setting of case 1 outlined in Section 2.1, and

 μℓ(λ)I≼∇21ℓ(wt(λ),λ)≼Lℓ(λ)I.

Therefore the condition number of satisfies and hence

 √κ(Aλ,t)−1√κ(Aλ,t)+1≤√κ(λ)−1√κ(λ)+1≤κ(λ)−1κ(λ)+1=qλ. (18)

#### AID with the Fixed-Point Method (AID-FP).

In this paragraph we make a specific choice for the sequence in Assumption C3. We let Assumption B be satisfied and consider the following algorithm. For every and , we choose and,

 fork=1,2,…⌊vt,k(λ)=∂1Φ(wt(λ),λ)⊤vt,k−1(λ)+∇1E(wt(λ),λ). (19)

In such case the rate of convergence is linear. More precisely, since (from Assumption B), then the mapping

 T:v↦∂1Φ(wt(λ),λ)v+∇1E(wt(λ),λ)

is a contraction with constant . Moreover, the fixed-point of is the solution of (11). Therefore, . In the end the following result holds.

###### Theorem 2.3.

If Assumption B holds and is defined according to (19), then Assumption C3 is satisfied with .

Now, plugging the rate into the general bound (14) yields

 (20)

However, an analysis similar to the one in Section 2.1 shows that the above result can be slightly improved as follows.

###### Theorem 2.4.

(AID-FP bound) Suppose that Assumptions A134 and Assumption B hold. Suppose also that (19) holds. Let be defined according to Algorithm 2 and . Then, for every ,

 ^Δ≤(c1(λ)+c2(λ)1−qkλ1−qλ)ρλ(t)+c3(λ)qkλ, (21)

where , and are given in Theorem 2.1.

We end this section with a discussion about the consequences of the presented results.

### 2.3 Discussion

Theorem 2.2 shows that Algorithm 2 computes an approximate gradient of with a linear convergence rate (in and ), provided that the solvers for the lower-level problem and the linear system converge linearly. Furthermore, under Assumption B, both AID-FP and ITD converge linearly. However, if in Algorithm 2 we define as at point 1 in Algorithm 1 (so that ), and take , then the bound for AID-FP (21) is lower than that of ITD (8), since for every . This analysis suggests that AID-FP converge faster than ITD.

We now discuss the choice of the algorithm to solve the linear system (11) in Algorithm 2. Theorem 2.4 provides a bound for AID-FP, which considers procedure (19). However, we see from (14) in Theorem 2.2 that a solver for the linear system with rate of convergence faster than may give a better bound. The above discussion, together with (17) and (18), proves that AID-CG has a better asymptotic rate than AID-FP for instances of Problem (2) where the lower-level objective is Lipschitz smooth and strongly convex (case 1 outlined in Section 2.1).

Finally, we note that both ITD and AID consider the initialization . However, in a gradient-based bilevel optimization algorithm, it might be more convenient to use a warm start strategy where is set based on previous upper-level iterations. Our analysis can be applied also in this case, but the related upper bounds will depend on the upper-level dynamics. This aspect makes it difficult to theoretically analyse the benefit of a warm start strategy, which remains an open question.

## 3 Experiments

In the first part of this section we focus on the hypergradient approximation error and show that the upper bounds presented in the previous section give a good estimate of the actual convergence behaviour of ITD and AID strategies on a variety of settings. In the second part we present a series of experiments pertaining optimization on both the settings of hyperparameter optimization, as in Problem (2), and learning equilibrium models, as in Problem (3). The algorithms have been implemented333 The code is freely available at the following link. https://github.com/prolearner/hypertorch

in PyTorch

(Paszke et al., 2019). In the following, we shorthand AID-FP and AID-CG with FP and CG, respectively.

In this section, we consider several problems of type (2) with synthetic generated data (see Section C.1 for more details) where and are the training and validation sets respectively, with , , being the number of examples in each set and the number of features. Specifically we consider the following settings, which are representative instances of relevant bilevel problems in machine learning.

Logistic Regression with Regularization (LR). This setting is similar to the one in Pedregosa (2016), but we introduce multiple regularization parameters:

 f(λ) =∑(xe,ye)∈D′ψ(yex⊤ew(λ)), w(λ) =argminw∈Rp∑(xe,ye)∈Dψ(yex⊤ew)+12w⊤diag(λ)w,

where , and is the diagonal matrix formed by the elements of .

Kernel Ridge Regression (KRR).

We extend the setting presented by Pedregosa (2016) considering a -dimensional Gaussian kernel parameter in place of the usual one:

 f(β,γ) =12∥y′−K′(γ)w(β,γ)∥2, (22) w(β,γ) =argminw∈Rne12w⊤(K(γ)+βI)w−w⊤y,

where , and , are respectively the validation and training kernel matrices (see Section C.1).

Biased Regularization (BR). Inspired by Denevi et al. (2019); Rajeswaran et al. (2019), we consider the following.

 f(λ) =12∥X′w(λ)−y′∥2, w(λ) =argminw∈Rp12∥Xw−y∥2+β2∥w−λ∥2,

where and .

Hyper-representation (HR). The last setting, reminiscent of (Franceschi et al., 2018; Bertinetto et al., 2019)

, concerns learning a (common) linear transformation of the data and is formulated as

 f(H) =12∥X′Hw(H)−y′∥2 w(H) =argminw∈Rd12∥XHw−y∥2+β2∥w∥2

where and .

LR and KRR are high dimensional extensions of classical hyperperparameter optimization problems, while BR and HR, are typically encountered in multi-task/meta-learning as single task objectives444In multi-task/meta-learning the upper-level objectives are averaged over multiple tasks and the hypergradient is simply the average of the single task one.. Note that Assumption B (i.e. is a contraction) can be satisfied for each of the aforesaid scenarios, since they all belong to case 1 of Section 2.1 (KRR, BR and HR also to case 2).

We solve the lower-level problem in the same way for both ITD and AID methods. In particular, in LR we use the gradient descent method with optimal step size as in case 1 of Section 2.1, while for the other cases we use the heavy-ball method with optimal step size and momentum constants. Note that this last method is not a contraction in the original norm, but only in a suitable norm depending on the lower-level problem itself. To compute the exact hypergradient, we differentiate directly using RMAD for KRR, BR and HR, where the closed form expression for is available, while for LR we use CG with in place of the (unavailable) analytic gradient.

Figure 1 shows how the approximation error is affected by the number of lower-level iterations . As suggested by the iteration complexity bounds in Section 2, all the approximations, after a certain number of iterations, converge linearly to the true hypergradient555

The asymptotic error can be quite large probably due to numerical errors (more details in

Appendix C).. Furthermore, in line with our analysis (see Section 2.3), CG gives the best gradient estimate (on average), followed by FP, while ITD performs the worst. For HR, the error of all the methods increases significantly at the beginning, which can be explained by the fact that the heavy ball method is not a contraction in the original norm and may diverge at first. CG outperforms FP on 3 out of 4 settings but both remain far from convergence.

### 3.2 Bilevel Optimization

In this section, we aim to solve instances of the bilevel problem (1) in which has a high dimensionality.

Kernel Ridge Regression on Parkinson. We take as defined in Problem (22) where the data is taken from the UCI Parkinson dataset (Little et al., 2008), containing 195 biomedical voice measurements (22 features) from people with Parkinson’s disease. To avoid projections, we replace and respectively with and in the RHS of the two equations in (22). We split the data randomly into three equal parts to make the train, validation and test sets.

Logistic Regression on 20 Newsgroup666http://qwone.com/ jason/20Newsgroups/. This dataset contains 18000 news divided in 20 topics and the features consist in 101631 tf-idf sparse vectors. We split the data randomly into three equal parts for training, validation and testing. We aim to solve the bilevel problem

 minλ∈RpCE(X′w(λ),y′)w(λ)=argminw∈Rp×cCE(Xw,y)+12cpc∑i=1p∑j=1exp(λj)w2ij

where is the average cross-entropy loss, and . To improve the performance, we use warm-starts on the lower-level problem, i.e. we take for all methods, where are the upper-level iterates.

Training Data Optimization on Fashion MNIST. Similarly to (Maclaurin et al., 2015), we optimize the features of a set of 10 training points, each with a different class label on the Fashion MNIST dataset (Xiao et al., 2017). More specifically we define the bilevel problem as

 minX∈Rc×pCE(X′w(X),y′)w(X)=argminw∈Rp×cCE(Xw,y)+β2cp∥w∥2

where , , , and contains the usual training set.

We solve each problem using (hyper)gradient descent with fixed step size selected via grid search (additional details are provided in Section C.2). The results in Table 1 show the upper-level objective and test accuracy both computed on the approximate lower-level solution after bilevel optimization777For completeness, we also report in the Appendix (Table 2) the upper-level objective and test accuracy both computed on the exact lower-level solution .. For Parkinson and Fashion MNIST, there is little difference among the methods for a fixed . For 20 newsgroup, CG reaches the lowest objective value, followed by CG . We recall that for ITD we have cost in memory which is linear in and that, in the case of 20 newsgroup for some between and , this cost exceeded the 11GB on the GPU. AID methods instead, require little memory and, by setting , yield similar or even better performance at a lower computation time. Finally, we stress that since the upper-level objective is nonconvex, possibly with several minima, gradient descent with a more precise estimate of the hypergradient may get more easily trapped in a bad local minima.

Equilibrium Models. Our last set of experiments investigates the behaviour of the hypergradient approximation methods on a simple instance of EQM (see Problem (3)) on non-structured data. EQM are an attractive class of models due to their mathematical simplicity, enhanced interpretability and memory efficiency. A number of works (Miller and Hardt, 2019; Bai et al., 2019) have recently shown that EQMs can perform on par with standard deep nets on a variety of complex tasks, renewing the interest in these kind of models.

We use a subset of

instances randomly sampled from the MNIST dataset as training data and employ a multiclass logistic classifier paired with a cross-entropy loss. We picked a small training set and purposefully avoided stochastic optimization methods to better focus on issues related to the computation of the hypergradients itself, avoiding the introduction of other sources of noise. We parametrize

as

 ϕi(wi,γ)=tanh(Awi+Bxi+c)for1≤i≤ne (23)

where is the -th example, and

. Such a model may be viewed as a (infinite layers) feed-forward neural network with tied weights or as a recurrent neural network with static inputs. Additional experiments with convolutional equilibrium models may be found in Appendix

C.3. Imposing ensures that the transition functions (23), and hence

, are contractions. This can be achieved during optimization by projecting the singular values of

onto the interval for . We note that regularizing the norm of or adding or penalty terms on may encourage, but does not strictly enforce, .

We conducted a series of experiments to ascertain the importance of the contractiveness of the map , as well as to understand which of the analysed methods is to be preferred in this setting. Since here is not symmetric, the conjugate gradient method must be applied on the normal equations of Problem (15). We set and use fixed-point iterations to solve the lower-level problem in all the experiments. The first three plots of Figure 2 report training objectives, test accuracies and norms of the estimated hypergradient for each of the three methods, either applying or not the constraint on , while the last explores the sensitivity of the methods to the choice of the learning rate. Unconstrained runs are marked with . Referring to the rightmost plot, it is clear (large shaded regions) that not constraining the spectral norm results in unstable behaviour of the “memory-less” AID methods (green and blue lines) for all but a few learning rates, while ITD (violet), as expected, suffers comparatively less. On the contrary, when is enforced, all the approximation methods are successful and stable, with FP to be preferred being faster then CG on the normal equations and requiring substantially less memory than ITD. As a side note, referring to Figure 2 left and center-left, we observe that projecting onto the spectral ball acts as powerful regularizer, in line with the findings of Sedghi et al. (2019).

## 4 Conclusions

We studied a general class of bilevel problems where at the lower-level we seek for a solution to a parametric fixed point equation. This formulation encompasses several learning algorithms recently considered in the literature. We established results on the iteration complexity of two strategies to compute the hypergradient (ITD and AID) under the assumption that the fixed point equation is defined by a contraction mapping. Our practical experience with these methods on a number of bilevel problems indicates that there is a trade-off between the methods, with AID based on the conjugate gradient method being preferable due to a potentiality better approximation of the hypergradient and lower space complexity. When the contraction assumption is not satisfied, however, our experiments on equilibrium models suggest that ITD is more reliable than AID methods. In the future, it would be valuable to extend the ideas presented here to other challenging machine learning scenarios not covered by our theoretical analysis. These include bilevel problems in which the lower-level is only locally contactive, nonsmooth, possibly nonexpansive or can only be solved via a stochastic procedure. At the same time, there is a need to clarify the tightness of the iteration complexity bounds presented here.

#### Acknowledgments.

This work was supported in part by a grant from SAP SE.

## References

• A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter (2019) Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pp. 9558–9570. Cited by: §1.
• L. B. Almeida (1987)

A learning rule for asynchronous perceptrons with feedback in a combinatorial environment.

.
In Proceedings, 1st First International Conference on Neural Networks, Vol. 2, pp. 609–618. Cited by: item 2, §1, item 3.
• B. Amos and J. Z. Kolter (2017) Optnet: differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 136–145. Cited by: §1.
• M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas (2016) Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981–3989. Cited by: §1.
• S. Bai, J. Z. Kolter, and V. Koltun (2019) Deep equilibrium models. In Advances in Neural Information Processing Systems, pp. 688–699. Cited by: §1, §3.2.
• A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind (2018) Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research 18 (153). Cited by: §1.
• J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §1.
• L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi (2019) Meta-learning with differentiable closed-form solvers. ICLR. Cited by: §3.1.
• G. Denevi, C. Ciliberto, R. Grazzi, and M. Pontil (2019)

Learning-to-learn stochastic gradient descent with biased regularization

.
In Proc. 36th International Conference on Machine Learning, PMLR, Vol. 97, pp. 1566–1575. Cited by: §3.1.
• J. Domke (2012) Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pp. 318–326. Cited by: §1.
• T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey.. Journal of Machine Learning Research 20 (55), pp. 1–21. Cited by: §1.
• C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: item 1, §1.
• L. Franceschi, M. Donini, P. Frasconi, and M. Pontil (2017) Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1165–1173. Cited by: item 1, §1.
• L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018) Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1563–1572. Cited by: item 1, §1, §1, §3.1.
• A. Griewank and A. Walther (2008) Evaluating derivatives: principles and techniques of algorithmic differentiation. Vol. 105, Siam. Cited by: item 1, §1, §1, §1.
• R. Liao, Y. Xiong, E. Fetaya, L. Zhang, K. Yoon, X. Pitkow, R. Urtasun, and R. Zemel (2018) Reviving and improving recurrent back-propagation. In International Conference on Machine Learning, pp. 3088–3097. Cited by: §1.
• M. Little, P. McSharry, E. Hunter, J. Spielman, and L. Ramig (2008) Suitability of dysphonia measurements for telemonitoring of parkinson’s disease. Nature Precedings, pp. 1–1. Cited by: §3.2.
• H. Liu, K. Simonyan, and Y. Yang (2018) DARTS: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1.
• J. Lorraine, P. Vicol, and D. Duvenaud (2019) Optimizing millions of hyperparameters by implicit differentiation. arXiv preprint arXiv:1911.02590. Cited by: item 2, §1.
• D. Maclaurin, D. Duvenaud, and R. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122. Cited by: item 1, §1, §1, §3.2.
• A. Mehra and J. Hamm (2019) Penalty method for inversion-free deep bilevel optimization. arXiv preprint arXiv:1911.03432. Cited by: §1.
• J. Miller and M. Hardt (2019) Stable recurrent models. ICLR. Cited by: §1, §3.2.
• Y. E. Nesterov (1983) A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, Vol. 269, pp. 543–547. Cited by: item 2.
• A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

PyTorch: an imperative style, high-performance deep learning library

.
In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: §3.
• F. Pedregosa (2016) Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pp. 737–746. Cited by: item 2, §1, §1, §2.2, §2.2, §3.1, §3.1, footnote 1.
• F. J. Pineda (1987) Generalization of back-propagation to recurrent neural networks. Physical review letters 59 (19), pp. 2229. Cited by: item 2, §1, item 3.
• B. T. Polyak (1987) Introduction to optimization. Optimization Software Inc. Publication Division, New York, NY, USA. Cited by: Appendix A, item 2.
• A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine (2019) Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, pp. 113–124. Cited by: item 2, §1, §1, §3.1, footnote 1.
• F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1, §1, item 3.
• H. Sedghi, V. Gupta, and P. M. Long (2019) The singular values of convolutional layers. ICLR. Cited by: §C.3, §3.2, footnote 13.
• A. Shaban, C. Cheng, N. Hatch, and B. Boots (2019) Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1723–1732. Cited by: §1.
• J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1.
• H. Xiao, K. Rasul, and R. Vollgraf (2017) External Links: cs.LG/1708.07747 Cited by: §3.2.

## Appendix A Proofs of the Results in Section 2

In this section we provide complete proofs of the results presented in the main body, which are restated here for the convenience of the reader. We also report few necessary additional results.

###### Theorem A.1.

(Differentiability of ). Consider problem (1) and suppose that Assumption A1-2 holds. Then and are differentiable on and, for every

 w′(λ) =(I−∂1Φ(w(λ),λ))−1∂2Φ(w(λ),λ) (24) ∇f(λ) =∇2E(w(λ),λ)+w′(λ)⊤∇1E(w(λ),λ). (25)
###### Proof.

The function is continuously differentiable on . Then, we have

 ∂1G(w(λ),λ)=I−∂1Φ(w(λ),λ),

which is invertible due to Assumption A2. Thus, since , the implicit function theorem yields that is continuously differentiable with derivative

 w′(λ) =∂1G(w(λ),λ)−1∂2G(w(λ),λ) =(I−∂1Φ(w(λ),λ))−1∂2Φ(w(λ),λ).

Finally, (25

) follows from the chain rule for differentiation. ∎

###### Corollary A.1.

Suppose that in problem (2), the function is twice continuously differentiable and strongly convex w.r.t. the first variable. Let be a differentiable function. Then the conclusions of Theorem A.1 hold and

 (∀λ∈Λ)w′(λ)=−∇21ℓ(w,λ)−1∇221ℓ(w(λ),λ).
###### Proof.

Define . Then, Fermat’s rule for the lower-problem in (2) yields that is a fixed point for , while is invertible since

 I−∂1Φ(w(λ),λ)=α(λ)∇21ℓ(w,λ)

and . Therefore, Theorem A.1 applies and, since , (24) yields . ∎

See 2.2

###### Proof.

Let and . Since is Lipschitz continuous with constant , it follows that

 ∥∂1Φ(w,λ)∥≤qλ<1. (26)

Therefore,

 ∞∑k=0∥∂1Φ(w,λ)∥k≤∞∑k=0qkλ=11−qλ.

Thus, is invertible and and the bound follows. ∎

In the following technical lemma we give two results which are fundamental for the proofs of the ITD bound (Theorem 2.1) and the AID-FP bound (Theorem 2.4). The first result is standard (see (Polyak, 1987), Lemma 1, Section 2.2).