We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We investigate some popular approaches to compute the hypergradient, based on reverse mode iterative differentiation and approximate implicit differentiation. Under the hypothesis that the fixed point equation is defined by a contraction mapping, we present a unified analysis which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity. This analysis suggests a hierarchy in terms of computational efficiency among the above methods, with approximate implicit differentiation based on conjugate gradient performing best. We present an extensive experimental comparison among the methods which confirm the theoretical findings.READ FULL TEXT VIEW PDF
Bilevel optimization problems are receiving increasing attention in mach...
Bilevel optimization has become a powerful framework in various machine
We present two methods to algorithmically compute both least and greates...
We analyse parallel overlapping Schwarz domain decomposition methods for...
In view of training increasingly complex learning architectures, we esta...
We present a differentiable soft-body physics simulator that can be comp...
We define a new class of "implicit" deep learning prediction rules that
Several problems arising in machine learning and related disciplines can be formulated as bilevel problems, where the lower-level problem is a fixed point equation whose solution is part of an upper-level objective. Instances of this framework include hyperparameter optimization (Maclaurin et al., 2015; Franceschi et al., 2017; Liu et al., 2018; Lorraine et al., 2019; Elsken et al., 2019), meta-learning (Andrychowicz et al., 2016; Finn et al., 2017; Franceschi et al., 2018), as well as recurrent and graph neural networks (Almeida, 1987; Pineda, 1987; Scarselli et al., 2008).
In large scale scenarios, there are thousands or even millions of parameters to find in the upper-level problem, making black-box approaches like grid and random search (Bergstra and Bengio, 2012) or Bayesian optimization (Snoek et al., 2012) impractical. This has made gradient-based methods (Domke, 2012; Maclaurin et al., 2015; Pedregosa, 2016) popular in such settings, but also it has raised the issue of designing efficient procedures to approximate the gradient of the upper-level objective (hypergradient) when finding a solution to the lower-level problem is costly.
The principal goal of this paper is to study the degree of approximation to the hypergradient of certain iterative schemes based on iterative or implicit differentiation111 The reader interested in the convergence analysis of gradient-based algorithms for bilevel optimization is referred to (Pedregosa, 2016; Rajeswaran et al., 2019) and references therein.. In the rest of the introduction we present the bilevel framework, alongside some relevant examples in machine learning. We then outline the gradient approximation methods that we analyse in the paper and highlight our main contributions. Finally, we discuss and compare our results with previous work in the field.
In this work, we consider the following bilevel problem.
where is a closed convex subset of and and are continuously differentiable functions. We assume that the lower-level problem in (1) (which is a fixed point-equation) admits a unique solution. However, in general, explicitly computing such solution is either impossible or expensive. When is differentiable, this issue affects the evaluation of the hypergradient , which at best can only be approximately computed.
A prototypical example of the bilevel problem (1) is
is a loss function, twice continuously differentiable and strictly convex w.r.t. the first variable. Indeed if we letbe such that , where is differentiable, then Problem (2) and Problem (1) are equivalent. Specific examples of Problem (2) include hyperparameter optimization and meta-learning, and are discussed in Section 3.1.
Other instances of the bilevel problem (1), which are not of the special form of Problem (2), arise in the context of so-called equilibrium models (EQM). Notably, these comprise some types of connectionist models employed in domains with structured data. Stable recurrent neural networks (Miller and Hardt, 2019), graph neural networks (Scarselli et al., 2008) and the formulations by Bai et al. (2019) belong to this class. EQM differ from standard (deep) neural networks in that the internal representations are given by fixed points of learnable dynamics rather than compositions of a finite number of layers. The learning problem for such type of models can be written as
where the operators (here ) are associated to the training points ’s, and the error functions are the losses incurred by a standard supervised algorithm on the transformed dataset . A specific example is discussed in Section 3.2
In this paper, we present a unified analysis which allows for the first time to quantitatively compare popular methods to approximate in the general setting of Problem (1). The strategies we consider can be divided in two categories:
Iterative Differentiation (ITD) (Maclaurin et al., 2015; Franceschi et al., 2017, 2018; Finn et al., 2017). One defines the sequence of functions , where are the fixed-point iterates generated by the map . Then is approximated by , which in turn is computed using forward (FMAD) or reverse (RMAD) mode automatic differentiation (Griewank and Walther, 2008).
Approximate Implicit Differentiation (AID) (Pedregosa, 2016; Rajeswaran et al., 2019; Lorraine et al., 2019). First, an (implicit) equation for is obtained through the implicit function theorem. Then, this equation is approximately solved by using a two stage scheme. We analyse two specific methods in this class: the fixed-point method (Lorraine et al., 2019)
, also referred to as recurrent backpropagation in the context of recurrent neural networks(Almeida, 1987; Pineda, 1987), and the conjugate gradient method (Pedregosa, 2016).
Both schemes can be efficiently implemented using automatic differentiation (Griewank and Walther, 2008; Baydin et al., 2018) achieving similar cost in time, while ITD has usually a larger memory cost than AID222This is true when ITD is implemented using RMAD, which is the standard approach when is high dimensional. .
Although there is a vast amount of literature on the two hypergradient approximation strategies, it remains unclear whether it is better to use one or the other. In this work, we shed some light over this issue both theoretically and experimentally. Specifically our contributions are the following:
We provide iteration complexity results for ITD and AID when the mapping defining the fixed point equation is a contraction. In particular, we prove non-asymptotic linear rates for the approximation errors of both approaches.
We make a theoretical and numerical comparison among different ITD and AID strategies considering several experimental scenarios.
We note that, to the best of our knowledge, non-asymptoptic rates of convergence for AID were only recently given in the case of meta-learning (Rajeswaran et al., 2019). Furthermore, we are not aware of any previous results about non-asymptotic rates of convergence for ITD.
Iterative differentiation for functions defined implicitly has been extensively studied in the automatic differentiation literature. In particular (Griewank and Walther, 2008, Chap. 15) derives asymptotic linear rates for ITD under the assumption that is a contraction. Another attempt to theoretically analyse ITD is made by Franceschi et al. (2018) in the context of the bilevel problem (2). There, the authors provide sufficient conditions for the asymptotic convergence of the set of minimizers of the approximate problem to the set of minimizers of the bilevel problem. In contrast, in this work, we give rates for the convergence of the approximate hypergradient to the true one (i.e. ). ITD is also considered in (Shaban et al., 2019) where is approximated via a procedure which is reminiscent of truncated backpropagation. The authors bound the norm of the difference between
and its truncated version as a function of the truncation steps. This is different from our analysis which directly considers the problem of estimating the gradient of.
In the case of AID, an asymptotic analysis is presented in(Pedregosa, 2016), where the author proves the convergence of an inexact gradient projection algorithm for the minimization of the function defined in Problem (2), using increasingly accurate estimates of . Whereas Rajeswaran et al. (2019) present complexity results on the setting of meta-learning with biased regularization. Here, we extend this line of work by providing complexity results for AID in the more general setting of Problem (1).
We also mention the papers by Amos and Kolter (2017) and Agrawal et al. (2019), which present techniques to differentiate through the solutions of quadratic or convex problems respectively. Using such techniques allows one to treat these optimization problems as layers of a neural network and to use backpropagation for the end-to-end training of the resulting learning model. In the former work, the gradient is obtained by implicitly differentiating through the KKT conditions of the lower-level problem, while the latter performs implicit differentiation on the residual maps of Minty’s parametrization.
A different approach to solve bilevel problems of the form (2) is presented by Mehra and Hamm (2019), who consider a sequence of “single level” objectives involving a quadratic regularization term penalizing violations of the lower-level first-order stationary conditions. The authors provide asymptotic convergence guarantees for the method, as the regularization parameter tends to infinity, and show that it outperforms both ITD and AID on different settings where the lower-level problem is non-convex.
All previously mentioned works except (Griewank and Walther, 2008) consider bilevel problems of the form (2). Another exception is (Liao et al., 2018), which proposes two improvements to recurrent backpropagation, one based on conjugate gradient on the normal equations, and another based on Neumann series approximation of the inverse.
In this section we establish non-asymptotic bounds on the hypergradient (i.e. ) approximation errors for both ITD and AID schemes (proofs can be found in Appendix A). In particular, in Section 2.1 we address the iteration complexity of ITD, while in Section 2.2, after giving a general bound for AID, we focus on two popular implementations of the AID scheme: one based on the conjugate gradient method and the other on the fixed-point method.
We denote by
applied to a vector (matrix) the Euclidean (spectral) norm. For a differentiable functionwe denote by the derivative of at . When , we denote by the gradient of . For a real-valued function we denote by and the partial derivatives w.r.t. the first and second variable respectively. We also denote by and the second derivative of w.r.t. the first variable and the mixed second derivative of w.r.t. the first and second variable. For a vector-valued function we denote, for every , by and the partial Jacobians w.r.t. the first and second variable respectively. Finally, we indicate with the identity operator.
For every ,
is the unique fixed point of .
and are Lipschitz continuous with constants and respectively.
and are Lipschitz continuous with constants and respectively.
Before starting with the study of the two methods ITD and AID, we give a lemma which introduces three additional constants that will occur in the complexity bounds.
Let and let be such that . Then there exist s.t.
The proof exploits the fact that the image of a continuous function applied to a compact set remains compact.
In this section we replace in (1) by the -th iterate of , for which we additionally require the following.
For every , is a contraction with constant .
The approximation of the hypergradient is then obtained as in Algorithm 1.
Let , set , and compute,
Compute using automatic differentiation.
Assumption B looks quite restrictive, however it is satisfied in a number of interesting cases:
Leveraging Proposition 2.1, we give the main result of this section.
In this generality this is a new result that provides a non-asymptotic linear rate of convergence for the gradient of towards that of .
where is the solution of the linear system
However, in the above formulas is usually not known explicitly or is expensive to compute exactly. To solve this issue is estimated as in Algorithm 2. Note that, unlike ITD, this procedure is agnostic about the algorithms used to compute the sequences and . Interestingly, in the context of Problem (2), choosing in Algorithm 2 yields the same procedure studied by Pedregosa (2016).
Let and compute by steps of an algorithm converging to , starting from .
Compute after steps of a solver for the system
Compute the approximate gradient as
The number of iterations and in Algorithm 2 give a direct way of trading off accuracy and speed. To quantify this trade off we consider the following assumptions.
For every ,
, is invertible.
, , and as .
and as .
for some . We note that, in view of Lemma 2.2, Assumption B implies Assumption C1 (which in turn implies Assumption A2) and in (12) one can take . We stress that, Assumption C2-3 are general and do not specify the type of algorithms solving the fixed-point equation and the liner system (11). It is only required that such algorithms have explicit rates of convergence and respectively. Finally, we note that Assumption C2 is less restrictive than Assumption B and encompasses the procedure at point 1 in Algorithm 1: indeed in such case C2 holds with .
It is also worth noting that the AID procedure requires only to store the last lower-level iterate, i.e. . This is a considerable advantage over ITD, which instead requires to store the entire lower-level optimization trajectory , if implemented using RMAD.
The iteration complexity bound for AID is given below. This is a general bound which depends on the rate of convergence of the sequence and the rate of convergence of the sequence .
Theorem 2.2 provides a non-asymptotic rate of convergence for which contrasts with the asymptotic result given in Pedregosa (2016). In this respect, making Assumption C1 instead of the weaker Assumption A2 is critical.
Depending on the choice of the solver for the linear system (11) different AID methods are obtained. In the following we consider two cases.
For the sake of brevity we set and . Then, the linear system (11) is equivalent to the following minimization problem
which, if is symmetric (so that is also symmetric) is in turn equivalent to
Several first order methods solving problems (15) or (16) satisfy assumption C3 with linear rates and require only Jacobian-vector products. In particular, for the symmetric case (16), the conjugate gradient method features the following linear rate
Therefore the condition number of satisfies and hence
In such case the rate of convergence is linear. More precisely, since (from Assumption B), then the mapping
is a contraction with constant . Moreover, the fixed-point of is the solution of (11). Therefore, . In the end the following result holds.
Now, plugging the rate into the general bound (14) yields
However, an analysis similar to the one in Section 2.1 shows that the above result can be slightly improved as follows.
We end this section with a discussion about the consequences of the presented results.
Theorem 2.2 shows that Algorithm 2 computes an approximate gradient of with a linear convergence rate (in and ), provided that the solvers for the lower-level problem and the linear system converge linearly. Furthermore, under Assumption B, both AID-FP and ITD converge linearly. However, if in Algorithm 2 we define as at point 1 in Algorithm 1 (so that ), and take , then the bound for AID-FP (21) is lower than that of ITD (8), since for every . This analysis suggests that AID-FP converge faster than ITD.
We now discuss the choice of the algorithm to solve the linear system (11) in Algorithm 2. Theorem 2.4 provides a bound for AID-FP, which considers procedure (19). However, we see from (14) in Theorem 2.2 that a solver for the linear system with rate of convergence faster than may give a better bound. The above discussion, together with (17) and (18), proves that AID-CG has a better asymptotic rate than AID-FP for instances of Problem (2) where the lower-level objective is Lipschitz smooth and strongly convex (case 1 outlined in Section 2.1).
Finally, we note that both ITD and AID consider the initialization . However, in a gradient-based bilevel optimization algorithm, it might be more convenient to use a warm start strategy where is set based on previous upper-level iterations. Our analysis can be applied also in this case, but the related upper bounds will depend on the upper-level dynamics. This aspect makes it difficult to theoretically analyse the benefit of a warm start strategy, which remains an open question.
for CG and FP. Mean and standard deviation (shaded areas) are computed over 20 values ofsampled uniformly from .
In the first part of this section we focus on the hypergradient approximation error and show that the upper bounds presented in the previous section give a good estimate of the actual convergence behaviour of ITD and AID strategies on a variety of settings. In the second part we present a series of experiments pertaining optimization on both the settings of hyperparameter optimization, as in Problem (2), and learning equilibrium models, as in Problem (3). The algorithms have been implemented333 The code is freely available at the following link. https://github.com/prolearner/hypertorch
in PyTorch(Paszke et al., 2019). In the following, we shorthand AID-FP and AID-CG with FP and CG, respectively.
In this section, we consider several problems of type (2) with synthetic generated data (see Section C.1 for more details) where and are the training and validation sets respectively, with , , being the number of examples in each set and the number of features. Specifically we consider the following settings, which are representative instances of relevant bilevel problems in machine learning.
where , and is the diagonal matrix formed by the elements of .
where , and , are respectively the validation and training kernel matrices (see Section C.1).
|ITD||2.39 (75.8)||2.11 (69.7)|
|FP||2.37 (81.8)||2.20 (77.3)|
|CG||2.37 (78.8)||2.20 (77.3)|
|FP||2.71 (80.3)||2.60 (78.8)|
|CG||2.33 (77.3)||2.02 (77.3)|
|1.08 (61.3)||0.97 (62.8)||0.89 (64.2)|
|1.03 (62.1)||1.02 (62.3)||0.84 (64.4)|
|0.93 (63.7)||0.78 (63.3)||0.64 (63.1)|
|0.94 (63.6)||0.97 (63.0)|
|0.82 (64.3)||0.75 (64.2)|
|0.41 (84.1)||0.43 (83.8)|
|0.41 (84.1)||0.43 (83.8)|
|0.42 (83.9)||0.42 (84.0)|
where and .
, concerns learning a (common) linear transformation of the data and is formulated as
where and .
LR and KRR are high dimensional extensions of classical hyperperparameter optimization problems, while BR and HR, are typically encountered in multi-task/meta-learning as single task objectives444In multi-task/meta-learning the upper-level objectives are averaged over multiple tasks and the hypergradient is simply the average of the single task one.. Note that Assumption B (i.e. is a contraction) can be satisfied for each of the aforesaid scenarios, since they all belong to case 1 of Section 2.1 (KRR, BR and HR also to case 2).
We solve the lower-level problem in the same way for both ITD and AID methods. In particular, in LR we use the gradient descent method with optimal step size as in case 1 of Section 2.1, while for the other cases we use the heavy-ball method with optimal step size and momentum constants. Note that this last method is not a contraction in the original norm, but only in a suitable norm depending on the lower-level problem itself. To compute the exact hypergradient, we differentiate directly using RMAD for KRR, BR and HR, where the closed form expression for is available, while for LR we use CG with in place of the (unavailable) analytic gradient.
Figure 1 shows how the approximation error is affected by the number of lower-level iterations .
As suggested by the iteration complexity bounds in Section 2, all the approximations,
after a certain number of iterations,
converge linearly to the true hypergradient555 The asymptotic error can be quite large probably due to numerical errors (more details in
The asymptotic error can be quite large probably due to numerical errors (more details inAppendix C).. Furthermore, in line with our analysis (see Section 2.3), CG gives the best gradient estimate (on average), followed by FP, while ITD performs the worst. For HR, the error of all the methods increases significantly at the beginning, which can be explained by the fact that the heavy ball method is not a contraction in the original norm and may diverge at first. CG outperforms FP on 3 out of 4 settings but both remain far from convergence.
In this section, we aim to solve instances of the bilevel problem (1) in which has a high dimensionality.
Kernel Ridge Regression on Parkinson. We take as defined in Problem (22) where the data is taken from the UCI Parkinson dataset (Little et al., 2008), containing 195 biomedical voice measurements (22 features) from people with Parkinson’s disease. To avoid projections, we replace and respectively with and in the RHS of the two equations in (22). We split the data randomly into three equal parts to make the train, validation and test sets.
Logistic Regression on 20 Newsgroup666http://qwone.com/ jason/20Newsgroups/. This dataset contains 18000 news divided in 20 topics and the features consist in 101631 tf-idf sparse vectors. We split the data randomly into three equal parts for training, validation and testing. We aim to solve the bilevel problem
where is the average cross-entropy loss, and . To improve the performance, we use warm-starts on the lower-level problem, i.e. we take for all methods, where are the upper-level iterates.
Training Data Optimization on Fashion MNIST. Similarly to (Maclaurin et al., 2015), we optimize the features of a set of 10 training points, each with a different class label on the Fashion MNIST dataset (Xiao et al., 2017). More specifically we define the bilevel problem as
where , , , and contains the usual training set.
We solve each problem using (hyper)gradient descent with fixed step size selected via grid search (additional details are provided in Section C.2). The results in Table 1 show the upper-level objective and test accuracy both computed on the approximate lower-level solution after bilevel optimization777For completeness, we also report in the Appendix (Table 2) the upper-level objective and test accuracy both computed on the exact lower-level solution .. For Parkinson and Fashion MNIST, there is little difference among the methods for a fixed . For 20 newsgroup, CG reaches the lowest objective value, followed by CG . We recall that for ITD we have cost in memory which is linear in and that, in the case of 20 newsgroup for some between and , this cost exceeded the 11GB on the GPU. AID methods instead, require little memory and, by setting , yield similar or even better performance at a lower computation time. Finally, we stress that since the upper-level objective is nonconvex, possibly with several minima, gradient descent with a more precise estimate of the hypergradient may get more easily trapped in a bad local minima.
Equilibrium Models. Our last set of experiments investigates the behaviour of the hypergradient approximation methods on a simple instance of EQM (see Problem (3)) on non-structured data. EQM are an attractive class of models due to their mathematical simplicity, enhanced interpretability and memory efficiency. A number of works (Miller and Hardt, 2019; Bai et al., 2019) have recently shown that EQMs can perform on par with standard deep nets on a variety of complex tasks, renewing the interest in these kind of models.
We use a subset of
instances randomly sampled from the MNIST dataset as training data and employ a multiclass logistic classifier paired with a cross-entropy loss. We picked a small training set and purposefully avoided stochastic optimization methods to better focus on issues related to the computation of the hypergradients itself, avoiding the introduction of other sources of noise. We parametrizeas
where is the -th example, and
. Such a model may be viewed as a (infinite layers) feed-forward neural network with tied weights or as a recurrent neural network with static inputs. Additional experiments with convolutional equilibrium models may be found in AppendixC.3. Imposing ensures that the transition functions (23), and hence
, are contractions. This can be achieved during optimization by projecting the singular values ofonto the interval for . We note that regularizing the norm of or adding or penalty terms on may encourage, but does not strictly enforce, .
We conducted a series of experiments to ascertain the importance of the contractiveness of the map , as well as to understand which of the analysed methods is to be preferred in this setting. Since here is not symmetric, the conjugate gradient method must be applied on the normal equations of Problem (15). We set and use fixed-point iterations to solve the lower-level problem in all the experiments. The first three plots of Figure 2 report training objectives, test accuracies and norms of the estimated hypergradient for each of the three methods, either applying or not the constraint on , while the last explores the sensitivity of the methods to the choice of the learning rate. Unconstrained runs are marked with . Referring to the rightmost plot, it is clear (large shaded regions) that not constraining the spectral norm results in unstable behaviour of the “memory-less” AID methods (green and blue lines) for all but a few learning rates, while ITD (violet), as expected, suffers comparatively less. On the contrary, when is enforced, all the approximation methods are successful and stable, with FP to be preferred being faster then CG on the normal equations and requiring substantially less memory than ITD. As a side note, referring to Figure 2 left and center-left, we observe that projecting onto the spectral ball acts as powerful regularizer, in line with the findings of Sedghi et al. (2019).
We studied a general class of bilevel problems where at the lower-level we seek for a solution to a parametric fixed point equation. This formulation encompasses several learning algorithms recently considered in the literature. We established results on the iteration complexity of two strategies to compute the hypergradient (ITD and AID) under the assumption that the fixed point equation is defined by a contraction mapping. Our practical experience with these methods on a number of bilevel problems indicates that there is a trade-off between the methods, with AID based on the conjugate gradient method being preferable due to a potentiality better approximation of the hypergradient and lower space complexity. When the contraction assumption is not satisfied, however, our experiments on equilibrium models suggest that ITD is more reliable than AID methods. In the future, it would be valuable to extend the ideas presented here to other challenging machine learning scenarios not covered by our theoretical analysis. These include bilevel problems in which the lower-level is only locally contactive, nonsmooth, possibly nonexpansive or can only be solved via a stochastic procedure. At the same time, there is a need to clarify the tightness of the iteration complexity bounds presented here.
This work was supported in part by a grant from SAP SE.
A learning rule for asynchronous perceptrons with feedback in a combinatorial environment.. In Proceedings, 1st First International Conference on Neural Networks, Vol. 2, pp. 609–618. Cited by: item 2, §1, item 3.
Learning-to-learn stochastic gradient descent with biased regularization. In Proc. 36th International Conference on Machine Learning, PMLR, Vol. 97, pp. 1566–1575. Cited by: §3.1.
PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: §3.
In this section we provide complete proofs of the results presented in the main body, which are restated here for the convenience of the reader. We also report few necessary additional results.
Let and . Since is Lipschitz continuous with constant , it follows that
Thus, is invertible and and the bound follows. ∎
In the following technical lemma we give two results which are fundamental for the proofs of the ITD bound (Theorem 2.1) and the AID-FP bound (Theorem 2.4). The first result is standard (see (Polyak, 1987), Lemma 1, Section 2.2).