Truncated Back-propagation for Bilevel Optimization

10/25/2018
by   Amirreza Shaban, et al.
Georgia Institute of Technology
0

Bilevel optimization has been recently revisited for designing and analyzing algorithms in hyperparameter tuning and meta learning tasks. However, due to its nested structure, evaluating exact gradients for high-dimensional problems is computationally challenging. One heuristic to circumvent this difficulty is to use the approximate gradient given by performing truncated back-propagation through the iterative optimization procedure that solves the lower-level problem. Although promising empirical performance has been reported, its theoretical properties are still unclear. In this paper, we analyze the properties of this family of approximate gradients and establish sufficient conditions for convergence. We validate this on several hyperparameter tuning and meta learning tasks. We find that optimization with the approximate gradient computed using few-step back-propagation often performs comparably to optimization with the exact gradient, while requiring far less memory and half the computation time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/13/2018

Far-HO: A Bilevel Programming Package for Hyperparameter Optimization and Meta-Learning

In (Franceschi et al., 2018) we proposed a unified mathematical framewor...
06/13/2018

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

We introduce a framework based on bilevel programming that unifies gradi...
11/13/2020

Convergence Properties of Stochastic Hypergradients

Bilevel optimization problems are receiving increasing attention in mach...
10/03/2020

Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties

Many popular adaptive gradient methods such as Adam and RMSProp rely on ...
06/05/2020

UFO-BLO: Unbiased First-Order Bilevel Optimization

Bilevel optimization (BLO) is a popular approach with many applications ...
10/16/2021

Meta-Learning with Adjoint Methods

Model Agnostic Meta-Learning (MAML) is widely used to find a good initia...
06/08/2021

Provably Faster Algorithms for Bilevel Optimization

Bilevel optimization has been widely applied in many important machine l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bilevel optimization has been recently revisited as a theoretical framework for designing and analyzing algorithms for hyperparameter optimization [1] and meta learning [2]. Mathematically, these problems can be formulated as a stochastic optimization problem with an equality constraint (see Section 1.1):

(1)

where and are the parameter and the hyperparameter, and are the expected and the sampled upper-level objective, is the sampled lower-level objective, and

is a random variable called the

context. The notation means that is the return value of a prespecified iterative algorithm (e.g. gradient descent) that approximately finds a local minimum of . This algorithm is part of the problem definition and can also be parametrized by (e.g. step size). The motivation to explicitly considering the approximate solution rather than an exact minimizer of is that is usually not available in closed form. This setup enables to account for the imperfections of the lower-level optimization algorithm.

Solving the bilevel optimization problem in (1) is challenging due to the complicated dependency of the upper-level problem on induced by . This difficulty is further aggravated when and are high-dimensional, precluding the use of black-box optimization techniques such as grid/random search [3] and Bayesian optimization [4, 5].

Recently, first-order bilevel optimization techniques have been revisited to solve these problems. These methods rely on an estimate of the Jacobian

to optimize . Pedregosa [6] and Gould et al. [7] assume that and compute by implicit differentiation. By contrast, Maclaurin et al. [8] and Franceschi et al. [9] treat the iterative optimization algorithm in the lower-level problem as a dynamical system, and compute by automatic differentiation through the dynamical system. In comparison, the latter approach is less sensitive to the optimality of and can also learn hyperparameters that control the lower-level optimization process (e.g. step size). However, due to superlinear time or space complexity (see Section 2.2), neither of these methods is applicable when both and are high-dimensional [9].

Few-step reverse-mode automatic differentiation [10, 11] and few-step forward-mode automatic differentiation [9] have recently been proposed as heuristics to address this issue. By ignoring long-term dependencies, the time and space complexities to compute approximate gradients can be greatly reduced. While exciting empirical results have been reported, the theoretical properties of these methods remain unclear.

In this paper, we study the theoretical properties of these truncated back-propagation approaches. We show that, when the lower-level problem is locally strongly convex around , on-average convergence to an -approximate stationary point is guaranteed by -step truncated back-propagation. We also identify additional problem structures with which asymptotic convergence to an exact stationary point is guaranteed. Empirically, we verify the utility of this strategy for hyperparameter optimization and meta learning tasks. We find that, compared to optimization with full back-propagation, optimization with truncated back-propagation usually shows competitive performance while requiring half as much computation time and significantly less memory.

1.1 Applications

Hyperparameter Optimization

The goal of hyperparameter optimization [12, 13] is to find hyperparameters for an optimization problem such that the approximate solution of has low cost for some cost function . In general, can parametrize both the objective of and the algorithm used to solve . This setup is a special case of the bilevel optimization problem (1) where the upper-level objective does not depend directly on . In contrast to meta learning (discussed below), can be deterministic [9]. See Section 4.2 for examples.

Many low-dimensional problems, such as choosing the learning rate and regularization constant for training neural networks, can be effectively solved with grid search. However, problems with thousands of hyperparameters are increasingly common, for which gradient-based methods are more appropriate 

[8, 14].

Meta Learning

Another important application of bilevel optimization, meta learning (or learning-to-learn) uses statistical learning to optimize an algorithm over a distribution of tasks and contexts :

(2)

It treats as a parametric function, with hyperparameter , that takes task-specific context information as input and outputs a decision . The goal of meta learning is to optimize the algorithm’s performance (e.g. the generalization error) across tasks

through empirical observations. This general setup subsumes multiple problems commonly encountered in the machine learning literature, such as multi-task learning 

[15, 16] and few-shot learning [17, 18, 19].

Bilevel optimization emerges from meta learning when the algorithm computes by internally solving a lower-level minimization problem with variable . The motivation to use this class of algorithms is that the lower-level problem can be designed so that, even for tasks distant from the training set, falls back upon a sensible optimization-based approach [20, 11]. By contrast, treating as a general function approximator relies on the availability of a large amount of meta training data [21, 22].

In other words, the decision is where is an approximate minimizer of some function . Therefore, we can identify

(3)

and write (2) as (1).111 We have replaced with

, which is valid since both describe the expectation over the joint distribution. The algorithm

only perceives , not . Compared with , the lower-level variable is usually task-specific and fine-tuned based on the given context . For example, in few-shot learning, a warm start initialization or regularization function () can be learned through meta learning, so that a task-specific network () can be quickly trained using regularized empirical risk minimization with few examples . See Section 4.3 for an example.

2 Bilevel Optimization

2.1 Setup

Let and . We consider solving (1) with first-order methods that sample

(like stochastic gradient descent) and focus on the problem of computing the gradients for a given

. Therefore, we will simplify the notation below by omitting the dependency of variables and functions on and (e.g. we write as and as ). We use to denote the total derivative with respect to a variable , and to denote the partial derivative, with the convention that and .

To optimize , stochastic first-order methods use estimates of the gradient . Here we assume that both and

are available through a stochastic first-order oracle, and focus on the problem of computing the matrix-vector product

when both and are high-dimensional.

2.2 Computing the hypergradient

Like [8, 9], we treat the iterative optimization algorithm that solves the lower-level problem as a dynamical system. Given an initial condition at , the update rule can be written as222For notational simplicity, we consider the case where is the state of (4); our derivation can be easily generalized to include other internal states, e.g. momentum.

(4)

in which defines the transition and and is the number iterations performed. For example, in gradient descent, , where is the step size.

By unrolling the iterative update scheme (4) as a computational graph, we can view as a function of and compute the required derivative  [23]

. Specifically, it can be shown by the chain rule

333When , the empty product .

(5)

where , for , and .

The computation of (5) can be implemented either in reverse mode or forward mode [9]. Reverse-mode differentiation (RMD) computes (5) by back-propagation:

(6)

and finally . Forward-mode differentiation (FMD) computes (5) by forward propagation:

(7)
Method Time Space Exact
FMD
RMD
Checkpointing
every steps
-RMD
Table 1: Comparison of the additional time and space to compute , where , , and is the time complexity to compute the transition function . Checkpointing doubles the constant in time complexity, compared with other approaches.

The choice between RMD and FMD is a trade-off based on the size of and (see Table 1 for a comparison). For example, one drawback of RMD is that all the intermediate variables need to be stored in memory in order to compute and in the backward pass. Therefore, RMD is only applicable when is small, as in [20]. Checkpointing [24] can reduce this to , but it doubles the computation time. Complementary to RMD, FMD propagates the matrix in line with the forward evaluation of the dynamical system (4), and does not require any additional memory to save the intermediate variables. However, propagating the matrix instead of vectors requires memory of size and is -times slower compared with RMD.

3 Truncated Back-Propagation

In this paper, we investigate the idea of approximating (5) with partial sums, which was previously proposed a heuristic for bilevel optimization [10, 11]. Formally, we consider performing -step truncated back-propagation (-RMD) and then using the intermediate variable to construct an approximate gradient:

(8)

This approach requires storing only the last iterates , and it also saves computation time. Note that -RMD can be combined with checkpointing for further savings, although we do not investigate this.

3.1 General properties

We first establish some intuitions about why using -RMD to optimize is reasonable. While building up an approximate gradient by truncating back-propagation in general optimization problems can lead to large bias, the bilevel optimization problem in (1) has some nice structure. Here we show that if the lower-level objective is locally strongly convex around , then the bias of can be exponentially small in . That is, choosing a small would suffice to give a good gradient approximation in finite precision. The proof is given in Appendix A.

Proposition 3.1.

Assume is -smooth and locally -strongly convex in around . Let . For , it holds

(9)

where . In particular, if is globally -strongly convex, then

(10)

Note since . Therefore, Proposition 3.1 says that if converges to the neighborhood of a strict local minimum of the lower-level optimization, then the bias of using the approximate gradient of -RMD decays exponentially in . This exponentially decaying property is the main reason why using to update the hyperparameter works.

Next we show that, when the lower-level problem is second-order continuously differentiable, actually is a sufficient descent direction. This is a much stronger property than the small bias shown in Proposition 3.1, and it is critical in order to prove convergence to exact stationary points (cf. Theorem 3.4). To build intuition, here we consider a simpler problem where is globally strongly convex and . These assumptions will be relaxed in the next subsection.

Lemma 3.2.

Let be globally strongly convex and . Assume is second-order continuously differentiable and has full column rank for all . Let . For all , with large enough and small enough, there exists , s.t. . This implies is a sufficient descent direction, i.e. .

The full proof of this non-trivial result is given in Appendix B. Here we provide some ideas about why it is true. First, by Proposition 3.1, we know the bias decays exponentially. However, this alone is not sufficient to show that is a sufficient descent direction. To show the desired result, Lemma 3.2 relies on the assumption that is second-order continuously differentiable and the fact that using gradient descent to optimize a well-conditioned function has linear convergence [25]. These two new structural properties further reduce the bias in Proposition 3.1 and lead to Lemma 3.2. Here the full rank assumption for is made to simplify the proof. We conjecture that this condition can be relaxed when . We leave this to future work.

3.2 Convergence

With these insights, we analyze the convergence of bilevel optimization with truncated back-propagation. Using Proposition 3.1, we can immediately deduce that optimizing with converges on-average to an -approximate stationary point.

Theorem 3.3.

Suppose is smooth and bounded below, and suppose there is such that . Using as a stochastic first-order oracle with a decaying step size to update with gradient descent, it follows after iterations,

where is the hypergradient in the th iteration. That is, under the assumptions in Proposition 3.1, learning with converges to an -approximate stationary point, where .

We see that the bias becomes small as increases. As a result, it is sufficient to perform -step truncated back-propagation with to update .

Next, using Lemma 3.2, we show that the bias term in Theorem 3.3 can be removed if the problem is more structured. As promised, we relax the simplifications made in Lemma 3.2 into assumptions 2 and 3 below and only assume is locally strongly convex.

Theorem 3.4.

Under the assumptions in Proposition 3.1 and Theorem 3.3, if in addition

  1. [topsep=0pt,itemsep=-1pt]

  2. is second-order continuously differentiable

  3. has full column rank around

  4. the problem is deterministic (i.e. )

then for all , with large enough and small enough, the limit point is an exact stationary point, i.e. , where is the hypergradient in the th iteration.

Theorem 3.4 shows that if the partial derivative does not interfere strongly with the partial derivative computed through back-propagating the lower-level optimization procedure (assumption 3), then optimizing with converges to an exact stationary point. This is a very strong result for an interesting special case. It shows that even with one-step back-propagation , updating can converge to a stationary point.

This non-interference assumption unfortunately is necessary; otherwise, truncating the full RMD leads to constant bias, as we show below (proved in Appendix E).

Theorem 3.5.

There is a problem, satisfying all but assumption 3 in Theorem 3.4, such that optimizing with does not converge to a stationary point.

Note however that the non-interference assumption is satisfied when , i.e. when the upper-level problem does not directly depend on the hyperparameter. This is the case for many practical applications: e.g. hyperparameter optimization, meta-learning regularization models, image desnosing [26, 14], data hyper-cleaning [9], and task interaction [27].

3.3 Relationship with implicit differentiation

The gradient estimate is related to implicit differentiation, which is a classical first-order approach to solving bilevel optimization problems [12, 13]. Assume is second-order continuously differentiable and that its optimal solution uniquely exists such that . By the implicit function theorem [28], the total derivative of with respect to can be written as

(11)

where all derivatives are evaluated at and .

Here we show that, in the limit where converges to , can be viewed as approximating the matrix inverse in (11) with an order- Taylor series. This can be seen from the next proposition.

Proposition 3.6.

Under the assumptions in Proposition 3.1, suppose converges to a stationary point . Let and . For , it satisfies that

(12)

By Proposition 3.6, we can write in (11) as

That is, captures the first terms in the Taylor series, and the residue term has an upper bound as in Proposition 3.1.

Given this connection, we can compare the use of and solving (11) inexactly using steps of conjugate gradient descent for high-dimensional problems [6]. First, both approaches require local strong-convexity to ensure a good approximation. Specifically, let locally around the limit. Using has a bias in , whereas using (11) and inverting the matrix with iterations of conjugate gradient has a bias in  [29]. Therefore, when is available, solving (11) with conjugate gradient descent is preferable. However, in practice, this is hardly true. When an approximate solution to the lower-level problem is used, adopting (11) has no control on the approximate error, nor does it necessarily yield a descent direction. On the contrary, is based on Proposition 3.1, which uses a weaker assumption and does not require the convergence of to a stationary point. Truncated back-propagation can also optimize the hyperparameters that control the lower-level optimization process, which the implicit differentiation approach cannot do.

4 Experiments

4.1 Toy problem

Consider the following simple problem for :

s.t.

where is the norm, sine is applied elementwise, , and we define ase the result of steps of gradient descent on with learning rate , initialized at . A plot of is shown in Figure. 1. We will use this problem to visualize the theorems and explore the empirical properties of truncated back-propagation.

This deterministic problem satisfies all of the assumptions in the previous section, particularly those of Theorem 3.4: is -smooth and -strongly convex, with

and . Although is somewhat complicated, with many saddle points, it satisfies the non-interference assumption because .

Figure 1: Graph of and visualization of Prop. 3.1.
Figure 2: The ratio at various , for and respectively.

Figure 1 visualizes Proposition 3.1 by plotting the approximation error and the theoretical bound at . For this problem, , , and can be found analytically from , where . Figure 4 (left) plots the iterates when optimizing using -RMD and a decaying meta-learning rate .444 Because varies widely with , we tune to ensure that the first update has norm . In comparison with the true gradient at these points, we see that is indeed a descent direction. Figure 2 (left) visualizes this in a different way, by plotting for various at each point along the trajectory. As promised by Lemma 3.2, this ratio stays well away from zero.

To demonstrate the biased convergence of Theorem 3.3, we break assumption 3 of Theorem 3.4 by changing the upper objective to so that . The guarantee of Lemma 3.2 no longer applies, and we see in Figure 2 (right) that can become negative. Indeed, Figure 3 shows that optimizing with converges to a suboptimal point. However, it also shows that using larger rapidly decreases the bias.

For the original objective , Theorem 3.4 guarantees exact convergence. Figure 4 shows optimization trajectories for various , and a log-scale plot of their convergence rates. Note that, because the lower-level problem cannot be perfectly solved within steps, the optimal is offset from the origin. Truncated back-propagation can handle this, but it breaks the assumptions required by the implicit differentiation approach to bilevel optimization.

Figure 3: Biased convergence for . The red X marks the optimal .
Figure 4: Convergence for .

4.2 Hyperparameter optimization problems

4.2.1 Data hypercleaning

In this section, we evaluate -RMD on a hyperparameter optimization problem. The goal of data hypercleaning [9]

is to train a linear classifier for MNIST 

[30], with the complication that half of our training labels have been corrupted. To do this with hyperparameter optimization, let be the weights of the classifier, with the outer objective measuring the cross-entropy loss on a cleanly labeled validation set. The inner objective is defined as weighted cross-entropy training loss plus regularization:

where are the training examples,

denotes the sigmoid function,

, and is the Frobenius norm. We optimize to minimize validation loss, presumably by decreasing the weight of the corrupted examples. The optimization dimensions are , . Franceschi et al. [9] previously solved this problem with full RMD, and it happens to satisfy many of our theoretical assumptions, making it an interesting case for empirical study.555 We have reformulated the constrained problem from [9] as an unconstrained one that more closely matches our theoretical assumptions. For the same reason, we regularized to make it strongly convex. Finally, we do not retrain on the hypercleaned training + validation data. This is because, for our purposes, comparing the performance of across is sufficient.

We optimize the lower-level problem through steps of gradient descent, and consider how adjusting changes the performance of -RMD.666 See Appendix G.1 for more experimental setup. Our hypothesis is that -RMD for small works almost as well as full RMD in terms of validation and test accuracy, while requiring less time and far less memory. We also hypothesize that -RMD does almost as well as full RMD in identifying which samples were corrupted [9]. Because our formulation of the problem is unconstrained, the weights are never exactly zero. However, we can calculate an F1 score by setting a threshold on : if , then the hyper-cleaner has marked example as corrupted.777 F1 scores for other choices of the threshold were very similar. See Appendix G.1 for details.

Table 2 reports these metrics for various . We see that -RMD is somewhat worse than the others, and that validation error (the outer objective ) decreases with more quickly than generalization error. The F1 score is already maximized at . These preliminary results indicate that in situations with limited memory, -RMD for small (e.g. ) may be a reasonable fallback: it achieves results close to full backprop, and it runs about twice as fast.

Test Acc. Val. Acc. Val. Error F1
1 87.50 89.32 0.413 0.85
5 88.05 89.90 0.383 0.89
25 88.12 89.94 0.382 0.89
50 88.17 90.18 0.381 0.89
100 88.33 90.24 0.380 0.88
Table 2: Hypercleaning metrics after 1000 hyperiters.

From a theoretical optimization perspective, we wonder whether -RMD converges to a stationary point of . Data hypercleaning satisfies all of the assumptions of Theorem 3.4 except that is not full column rank (since ). In particular, the validation loss is deterministic and satisfies . Figure 5 plots the norm of the true gradient on a log scale at the -RMD iterates for various . We see that, despite satisfying almost all assumptions, this problem exhibits biased convergence. The limit of decreases slowly with , but recall from Table 2 that practical metrics improve more quickly.

Figure 5: vs. hyperiteration for hypercleaning.

4.2.2 Task interaction

We next consider the problem of multitask learning [27]. Similar to [9], we formulate this as a hyperparameter optimization problem as follows. The lower-level objective learns different linear models with parameter set :

where

is the training loss of the multi-class linear logistic regression model,

is a regularization constant, and is a nonnegative, symmetric hyperparameter matrix that encodes the similarity between each pair of tasks. After iterations of gradient decent with learning rate , this yields . The upper-level objective

estimates the linear regression loss of the learned model

on a validation set. Presumably, this will be improved by tuning to reflect the true similarities between the tasks. The tasks that we consider are image recognition trained on very small subsets of the datasets CIFAR- and CIFAR-.888 See Appendix G.2 for more details.

From an optimization standpoint, we are most interested in the upper-level loss on the validation set, since that is what is directly optimized, and its value is a good indication of the performance of the inexact gradient. Figure 6

plots this learning curve along with two other metrics of theoretical interest: norm of the true gradient, and cosine similarity between the true and approximate gradients. In CIFAR100, the validation error and gradient norm plots show that

-RMD converges to an approximate stationary point with a bias that rapidly decreases as increases, agreeing with Proposition 3.1. Also, we find that negative values exist in the cosine similarity of -RMD, which implies that not all the assumptions in Theorem 3.4 hold for this problem (e.g. might not be full rank, or the the inner problem might not be locally strong convex around .) In CIFAR10, some unusal behavior happens. For , the truncated gradient and the full gradient directions eventually become almost the same. We believe this is a very interesting observation but beyond the scope of the paper to explain.

In Table 3, we report the testing accuracy over 10 trials. While in general increasing the number of back-propagation steps improves accuracy, the gaps are small. A thorough investigation of the relationship between convergence and generalization is an interesting open question of both theoretical and practical importance.

Method Avg. Acc. Avg. Iter. Sec/iter.

CIFAR-10

-RMD
-RMD
-RMD
Full RMD

CIFAR-100

-RMD
-RMD
-RMD
Full RMD
Table 3: Test accuracy for task interaction. Few-step -RMD achieves similar performance as full RMD.
Figure 6: Upper-level objective loss (first column), norm of the exact gradient (second column), and cosine similarity (last column) vs. hyper-iteration on CIFAR10 (first row) and CIFAR100 (second row) datasets.

4.3 Meta-learning: One-shot classification

The aim of this experiment is to evaluate the performance of truncated back-propagation in multi-task, stochastic optimization problems. We consider in particular the one-shot classification problem [20], where each task is a -way classification problem and the goal is learn a hyperparameter such that each task can be solved with few training samples.

In each hyper-iteration, we sample a task, a training set, and a validation set as follows: First, classes are randomly chosen from a pool of classes to define the sampled task . Then the training set is created by randomly drawing one training example from each of the classes and assigning label to it. The validation set is constructed similarly, but with more examples from each class. The lower-level objective is defined as:

where is the -way cross-entropy loss, and is a deep neural network parametrized by and optionally hyperparameter . To prevent overfitting in the lower-level optimization, we regularize each parameter to be close to center with weight . Both and are hyperparameters. The upper-level objective is the loss of the trained network on the sampled validation set . In contrast to other experiments, this is a stochastic optimization problem. Also, depends directly on the hyperparameter , in addition to the indirect dependence through (i.e. ).

We use the Omniglot dataset [31] and a similar neural network as used in [20] with small modifications. Please refer to Appendix G.3 for more details about the model and the data splits. We set and optimize over the hyperparameter . The average accuracy of each model is evaluated over randomly sampled training and validation sets from the meta-testing dataset. For comparison, we also try using a very short horizon , which is common in recent work on few-shot learning [20].

The statistics are shown in Table 4 and the learning curves in Figure 7. In addition to saving memory, all truncated methods are faster than full RMD, sometimes even five times faster. These results suggest that running few-step back-propagation with more hyper-iterations can be more efficient than the full RMD. To support this hypothesis, we also ran -RMD and -RMD for an especially large number of hyper-iterations (k). Even with this many hyper-iterations, the total runtime is less than full RMD with iterations, and the results are significantly improved. We also find that while using a short horizon () is faster, it achieves a lower accuracy at the same number of iterations.

Finally, we verify some of our theorems in practice. Figure 7 (fourth plot) shows that when the lower-level problem is regularized, the relative error between the -RMD approximate gradient and the exact gradient decays exponentially as increases. This was guaranteed by Proposition 3.1. However, this exponential decay is not seen for the non-regularized model (). This suggests that the local strong convexity assumption is essential in order to have exponential decay in practice. Figure 7 (third plot) shows the cosine similarity between the inexact gradient and full gradient over the course of meta-training. Note that the cosine similarity measures are always positive, indicating that the inexact gradients are indeed descent directions. It also seems that the cosine similarities show a slight decay over time.

Method Accuracy iter. Sec/iter.
-RMD
-RMD
-RMD
Full RMD
-RMD
-RMD
Short horizon
Table 4: Results for one-shot learning on Omniglot dataset. -RMD reaches similar performance as full RMD, is considerably faster, and requires less memory.
Figure 7: Plots 1 and 2: Testing accuracy and error vs. number of hyper-iterations for different RMD depths. -RMD methods show similar performance as the full RMD. Plot 3: Cosine similarity between inexact gradient and full RMD over hyper-iterations. Plot 4: Relative error of inexact gradient and full RMD vs. reverse depth. Regularized version shows exponential decay.

5 Conclusion

We analyze -RMD, a first-order heuristic for solving bilevel optimization problems when the lower-level optimization is itself approximated in an iterative way. We show that -RMD is a valid alternative to full RMD from both theoretical and empirical standpoints. Theoretically, we identify sufficient conditions for which the hyperparameters converge to an approximate or exact stationary point of the upper-level objective. The key observation is that when is near a strict local minimum of the lower-level objective, gradient approximation error decays exponentially with reverse depth. Empirically, we explore the properties of this optimization method with four proof-of-concept experiments. We find that although exact convergence appears to be uncommon in practice, the performance of -RMD is close to full RMD in terms of application-specific metrics (such as generalization error). It is also roughly twice as fast. These results suggest that in hyperparameter optimization or meta learning applications with memory constraints, truncated back-propagation is a reasonable choice.

Our experiments use a modest number of parameters , hyperparameters , and horizon length . This is because we need to be able to calculate both -RMD and full RMD in order to compare their performance. One promising direction for future research is to use -RMD for bilevel optimization problems that require powerful function approximators at both levels of optimization. Truncated RMD makes this approach feasible and enables comparing bilevel optimization to other meta-learning methods on difficult benchmarks.

References

  • Domke [2012] Justin Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pages 318–326, 2012.
  • Franceschi et al. [2017a] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. A bridge between hyperparameter optimization and larning-to-learn. arXiv preprint arXiv:1712.06283, 2017a.
  • Bergstra and Bengio [2012] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
  • Srinivas et al. [2010] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010.
  • Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
  • Pedregosa [2016] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737–746, 2016.
  • Gould et al. [2016] Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. arXiv preprint arXiv:1607.05447, 2016.
  • Maclaurin et al. [2015] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015.
  • Franceschi et al. [2017b] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on International Conference on Machine Learning, 2017b.
  • Luketina et al. [2016] Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. In International Conference on Machine Learning, pages 2952–2960, 2016.
  • Baydin et al. [2018] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. In International Conference on Learning Representations, 2018.
  • Larsen et al. [1996] Jan Larsen, Lars Kai Hansen, Claus Svarer, and M Ohlsson. Design and regularization of neural networks: the optimal use of a validation set. In Neural Networks for Signal Processing [1996] VI. IEEE Signal Processing Society Workshop, pages 62–71. IEEE, 1996.
  • Bengio [2000] Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
  • Chen et al. [2014] Yunjin Chen, Rene Ranftl, and Thomas Pock. Insights into analysis operator learning: From patch-based sparse models to higher order mrfs. IEEE Transactions on Image Processing, 23(3):1060–1072, 2014.
  • Caruana [1998] Rich Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
  • Ranjan et al. [2017] Rajeev Ranjan, Vishal M Patel, and Rama Chellappa.

    Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • Fei-Fei et al. [2006] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
  • Ravi and Larochelle [2017] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017.
  • Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017.
  • Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), 2017.
  • Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
  • Li and Malik [2017] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017.
  • Baydin et al. [2017] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18:153:1–153:43, 2017.
  • Hascoet and Araya-Polo [2006] Laurent Hascoet and Mauricio Araya-Polo. Enabling user-driven checkpointing strategies in reverse-mode automatic differentiation. arXiv preprint cs/0606042, 2006.
  • Hazan et al. [2016] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • Roth and Black [2005] Stefan Roth and Michael J Black. Fields of experts: A framework for learning image priors. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 860–867. IEEE, 2005.
  • Evgeniou et al. [2005] Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(Apr):615–637, 2005.
  • Rudin et al. [1964] Walter Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964.
  • Shewchuk et al. [1994] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Horn and Johnson [1990] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 1990.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

References

  • Domke [2012] Justin Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pages 318–326, 2012.
  • Franceschi et al. [2017a] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. A bridge between hyperparameter optimization and larning-to-learn. arXiv preprint arXiv:1712.06283, 2017a.
  • Bergstra and Bengio [2012] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
  • Srinivas et al. [2010] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010.
  • Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
  • Pedregosa [2016] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737–746, 2016.
  • Gould et al. [2016] Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. arXiv preprint arXiv:1607.05447, 2016.
  • Maclaurin et al. [2015] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015.
  • Franceschi et al. [2017b] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on International Conference on Machine Learning, 2017b.
  • Luketina et al. [2016] Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. In International Conference on Machine Learning, pages 2952–2960, 2016.
  • Baydin et al. [2018] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. In International Conference on Learning Representations, 2018.
  • Larsen et al. [1996] Jan Larsen, Lars Kai Hansen, Claus Svarer, and M Ohlsson. Design and regularization of neural networks: the optimal use of a validation set. In Neural Networks for Signal Processing [1996] VI. IEEE Signal Processing Society Workshop, pages 62–71. IEEE, 1996.
  • Bengio [2000] Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
  • Chen et al. [2014] Yunjin Chen, Rene Ranftl, and Thomas Pock. Insights into analysis operator learning: From patch-based sparse models to higher order mrfs. IEEE Transactions on Image Processing, 23(3):1060–1072, 2014.
  • Caruana [1998] Rich Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
  • Ranjan et al. [2017] Rajeev Ranjan, Vishal M Patel, and Rama Chellappa.

    Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • Fei-Fei et al. [2006] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
  • Ravi and Larochelle [2017] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017.
  • Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017.
  • Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), 2017.
  • Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
  • Li and Malik [2017] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017.
  • Baydin et al. [2017] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18:153:1–153:43, 2017.
  • Hascoet and Araya-Polo [2006] Laurent Hascoet and Mauricio Araya-Polo. Enabling user-driven checkpointing strategies in reverse-mode automatic differentiation. arXiv preprint cs/0606042, 2006.
  • Hazan et al. [2016] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • Roth and Black [2005] Stefan Roth and Michael J Black. Fields of experts: A framework for learning image priors. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 860–867. IEEE, 2005.
  • Evgeniou et al. [2005] Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(Apr):615–637, 2005.
  • Rudin et al. [1964] Walter Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964.
  • Shewchuk et al. [1994] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Horn and Johnson [1990] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 1990.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

Appendix

Appendix A Proof of Proposition 3.1

See 3.1

Proof.

Let . By definition of ,

Therefore, when is locally -strongly convex with respect to in the neighborhood of ,

Suppose is

-smooth but nonconvex. In the worst case, if the smallest eigenvalue of

is , then for . This gives the bound in (9). However, if is globally strongly convex, then

The bound (10) uses the fact that

Appendix B Proof of Lemma 3.2

See 3.2

Proof.

To illustrate the idea, here we prove the case where . For , similar steps can be applied. To prove the statement, we first expand the inner product by definition

where we recall as by assumption.

Next we show a technical lemma, which provides a critical tool to bound the second term above; its proof is given in the next section.

Lemma B.1.

Let be -strongly convex and -smooth. Assume and are Lipschitz continuous in , and assume has full column rank. For ,

By Lemma B.1, we can then write

Because

()

and is non-singular by assumption,

for some , when is large enough and is small enough. The implication holds because . ∎

b.1 Proof of Lemma b.1

Proof.

Let and be the Lipschitz constant of and . First, we see that the inner product can be lower bounded by the following terms

where

The above lower bounds can be shown by the following inequalities: