1 Introduction
Bilevel optimization has been recently revisited as a theoretical framework for designing and analyzing algorithms for hyperparameter optimization [1] and meta learning [2]. Mathematically, these problems can be formulated as a stochastic optimization problem with an equality constraint (see Section 1.1):
(1) 
where and are the parameter and the hyperparameter, and are the expected and the sampled upperlevel objective, is the sampled lowerlevel objective, and
is a random variable called the
context. The notation means that is the return value of a prespecified iterative algorithm (e.g. gradient descent) that approximately finds a local minimum of . This algorithm is part of the problem definition and can also be parametrized by (e.g. step size). The motivation to explicitly considering the approximate solution rather than an exact minimizer of is that is usually not available in closed form. This setup enables to account for the imperfections of the lowerlevel optimization algorithm.Solving the bilevel optimization problem in (1) is challenging due to the complicated dependency of the upperlevel problem on induced by . This difficulty is further aggravated when and are highdimensional, precluding the use of blackbox optimization techniques such as grid/random search [3] and Bayesian optimization [4, 5].
Recently, firstorder bilevel optimization techniques have been revisited to solve these problems. These methods rely on an estimate of the Jacobian
to optimize . Pedregosa [6] and Gould et al. [7] assume that and compute by implicit differentiation. By contrast, Maclaurin et al. [8] and Franceschi et al. [9] treat the iterative optimization algorithm in the lowerlevel problem as a dynamical system, and compute by automatic differentiation through the dynamical system. In comparison, the latter approach is less sensitive to the optimality of and can also learn hyperparameters that control the lowerlevel optimization process (e.g. step size). However, due to superlinear time or space complexity (see Section 2.2), neither of these methods is applicable when both and are highdimensional [9].Fewstep reversemode automatic differentiation [10, 11] and fewstep forwardmode automatic differentiation [9] have recently been proposed as heuristics to address this issue. By ignoring longterm dependencies, the time and space complexities to compute approximate gradients can be greatly reduced. While exciting empirical results have been reported, the theoretical properties of these methods remain unclear.
In this paper, we study the theoretical properties of these truncated backpropagation approaches. We show that, when the lowerlevel problem is locally strongly convex around , onaverage convergence to an approximate stationary point is guaranteed by step truncated backpropagation. We also identify additional problem structures with which asymptotic convergence to an exact stationary point is guaranteed. Empirically, we verify the utility of this strategy for hyperparameter optimization and meta learning tasks. We find that, compared to optimization with full backpropagation, optimization with truncated backpropagation usually shows competitive performance while requiring half as much computation time and significantly less memory.
1.1 Applications
Hyperparameter Optimization
The goal of hyperparameter optimization [12, 13] is to find hyperparameters for an optimization problem such that the approximate solution of has low cost for some cost function . In general, can parametrize both the objective of and the algorithm used to solve . This setup is a special case of the bilevel optimization problem (1) where the upperlevel objective does not depend directly on . In contrast to meta learning (discussed below), can be deterministic [9]. See Section 4.2 for examples.
Many lowdimensional problems, such as choosing the learning rate and regularization constant for training neural networks, can be effectively solved with grid search. However, problems with thousands of hyperparameters are increasingly common, for which gradientbased methods are more appropriate
[8, 14].Meta Learning
Another important application of bilevel optimization, meta learning (or learningtolearn) uses statistical learning to optimize an algorithm over a distribution of tasks and contexts :
(2) 
It treats as a parametric function, with hyperparameter , that takes taskspecific context information as input and outputs a decision . The goal of meta learning is to optimize the algorithm’s performance (e.g. the generalization error) across tasks
through empirical observations. This general setup subsumes multiple problems commonly encountered in the machine learning literature, such as multitask learning
[15, 16] and fewshot learning [17, 18, 19].Bilevel optimization emerges from meta learning when the algorithm computes by internally solving a lowerlevel minimization problem with variable . The motivation to use this class of algorithms is that the lowerlevel problem can be designed so that, even for tasks distant from the training set, falls back upon a sensible optimizationbased approach [20, 11]. By contrast, treating as a general function approximator relies on the availability of a large amount of meta training data [21, 22].
In other words, the decision is where is an approximate minimizer of some function . Therefore, we can identify
(3) 
and write (2) as (1).^{1}^{1}1 We have replaced with
, which is valid since both describe the expectation over the joint distribution. The algorithm
only perceives , not . Compared with , the lowerlevel variable is usually taskspecific and finetuned based on the given context . For example, in fewshot learning, a warm start initialization or regularization function () can be learned through meta learning, so that a taskspecific network () can be quickly trained using regularized empirical risk minimization with few examples . See Section 4.3 for an example.2 Bilevel Optimization
2.1 Setup
Let and . We consider solving (1) with firstorder methods that sample
(like stochastic gradient descent) and focus on the problem of computing the gradients for a given
. Therefore, we will simplify the notation below by omitting the dependency of variables and functions on and (e.g. we write as and as ). We use to denote the total derivative with respect to a variable , and to denote the partial derivative, with the convention that and .To optimize , stochastic firstorder methods use estimates of the gradient . Here we assume that both and
are available through a stochastic firstorder oracle, and focus on the problem of computing the matrixvector product
when both and are highdimensional.2.2 Computing the hypergradient
Like [8, 9], we treat the iterative optimization algorithm that solves the lowerlevel problem as a dynamical system. Given an initial condition at , the update rule can be written as^{2}^{2}2For notational simplicity, we consider the case where is the state of (4); our derivation can be easily generalized to include other internal states, e.g. momentum.
(4) 
in which defines the transition and and is the number iterations performed. For example, in gradient descent, , where is the step size.
By unrolling the iterative update scheme (4) as a computational graph, we can view as a function of and compute the required derivative [23]
. Specifically, it can be shown by the chain rule
^{3}^{3}3When , the empty product .(5) 
where , for , and .
The computation of (5) can be implemented either in reverse mode or forward mode [9]. Reversemode differentiation (RMD) computes (5) by backpropagation:
(6) 
and finally . Forwardmode differentiation (FMD) computes (5) by forward propagation:
(7) 
Method  Time  Space  Exact 

FMD  ✓  
RMD  ✓  
Checkpointing  ✓  
every steps  
RMD 
The choice between RMD and FMD is a tradeoff based on the size of and (see Table 1 for a comparison). For example, one drawback of RMD is that all the intermediate variables need to be stored in memory in order to compute and in the backward pass. Therefore, RMD is only applicable when is small, as in [20]. Checkpointing [24] can reduce this to , but it doubles the computation time. Complementary to RMD, FMD propagates the matrix in line with the forward evaluation of the dynamical system (4), and does not require any additional memory to save the intermediate variables. However, propagating the matrix instead of vectors requires memory of size and is times slower compared with RMD.
3 Truncated BackPropagation
In this paper, we investigate the idea of approximating (5) with partial sums, which was previously proposed a heuristic for bilevel optimization [10, 11]. Formally, we consider performing step truncated backpropagation (RMD) and then using the intermediate variable to construct an approximate gradient:
(8) 
This approach requires storing only the last iterates , and it also saves computation time. Note that RMD can be combined with checkpointing for further savings, although we do not investigate this.
3.1 General properties
We first establish some intuitions about why using RMD to optimize is reasonable. While building up an approximate gradient by truncating backpropagation in general optimization problems can lead to large bias, the bilevel optimization problem in (1) has some nice structure. Here we show that if the lowerlevel objective is locally strongly convex around , then the bias of can be exponentially small in . That is, choosing a small would suffice to give a good gradient approximation in finite precision. The proof is given in Appendix A.
Proposition 3.1.
Assume is smooth and locally strongly convex in around . Let . For , it holds
(9) 
where . In particular, if is globally strongly convex, then
(10) 
Note since . Therefore, Proposition 3.1 says that if converges to the neighborhood of a strict local minimum of the lowerlevel optimization, then the bias of using the approximate gradient of RMD decays exponentially in . This exponentially decaying property is the main reason why using to update the hyperparameter works.
Next we show that, when the lowerlevel problem is secondorder continuously differentiable, actually is a sufficient descent direction. This is a much stronger property than the small bias shown in Proposition 3.1, and it is critical in order to prove convergence to exact stationary points (cf. Theorem 3.4). To build intuition, here we consider a simpler problem where is globally strongly convex and . These assumptions will be relaxed in the next subsection.
Lemma 3.2.
Let be globally strongly convex and . Assume is secondorder continuously differentiable and has full column rank for all . Let . For all , with large enough and small enough, there exists , s.t. . This implies is a sufficient descent direction, i.e. .
The full proof of this nontrivial result is given in Appendix B. Here we provide some ideas about why it is true. First, by Proposition 3.1, we know the bias decays exponentially. However, this alone is not sufficient to show that is a sufficient descent direction. To show the desired result, Lemma 3.2 relies on the assumption that is secondorder continuously differentiable and the fact that using gradient descent to optimize a wellconditioned function has linear convergence [25]. These two new structural properties further reduce the bias in Proposition 3.1 and lead to Lemma 3.2. Here the full rank assumption for is made to simplify the proof. We conjecture that this condition can be relaxed when . We leave this to future work.
3.2 Convergence
With these insights, we analyze the convergence of bilevel optimization with truncated backpropagation. Using Proposition 3.1, we can immediately deduce that optimizing with converges onaverage to an approximate stationary point.
Theorem 3.3.
Suppose is smooth and bounded below, and suppose there is such that . Using as a stochastic firstorder oracle with a decaying step size to update with gradient descent, it follows after iterations,
where is the hypergradient in the th iteration. That is, under the assumptions in Proposition 3.1, learning with converges to an approximate stationary point, where .
We see that the bias becomes small as increases. As a result, it is sufficient to perform step truncated backpropagation with to update .
Next, using Lemma 3.2, we show that the bias term in Theorem 3.3 can be removed if the problem is more structured. As promised, we relax the simplifications made in Lemma 3.2 into assumptions 2 and 3 below and only assume is locally strongly convex.
Theorem 3.4.
Under the assumptions in Proposition 3.1 and Theorem 3.3, if in addition

[topsep=0pt,itemsep=1pt]

is secondorder continuously differentiable

has full column rank around


the problem is deterministic (i.e. )
then for all , with large enough and small enough, the limit point is an exact stationary point, i.e. , where is the hypergradient in the th iteration.
Theorem 3.4 shows that if the partial derivative does not interfere strongly with the partial derivative computed through backpropagating the lowerlevel optimization procedure (assumption 3), then optimizing with converges to an exact stationary point. This is a very strong result for an interesting special case. It shows that even with onestep backpropagation , updating can converge to a stationary point.
This noninterference assumption unfortunately is necessary; otherwise, truncating the full RMD leads to constant bias, as we show below (proved in Appendix E).
Theorem 3.5.
There is a problem, satisfying all but assumption 3 in Theorem 3.4, such that optimizing with does not converge to a stationary point.
Note however that the noninterference assumption is satisfied when , i.e. when the upperlevel problem does not directly depend on the hyperparameter. This is the case for many practical applications: e.g. hyperparameter optimization, metalearning regularization models, image desnosing [26, 14], data hypercleaning [9], and task interaction [27].
3.3 Relationship with implicit differentiation
The gradient estimate is related to implicit differentiation, which is a classical firstorder approach to solving bilevel optimization problems [12, 13]. Assume is secondorder continuously differentiable and that its optimal solution uniquely exists such that . By the implicit function theorem [28], the total derivative of with respect to can be written as
(11) 
where all derivatives are evaluated at and .
Here we show that, in the limit where converges to , can be viewed as approximating the matrix inverse in (11) with an order Taylor series. This can be seen from the next proposition.
Proposition 3.6.
Under the assumptions in Proposition 3.1, suppose converges to a stationary point . Let and . For , it satisfies that
(12) 
By Proposition 3.6, we can write in (11) as
That is, captures the first terms in the Taylor series, and the residue term has an upper bound as in Proposition 3.1.
Given this connection, we can compare the use of and solving (11) inexactly using steps of conjugate gradient descent for highdimensional problems [6]. First, both approaches require local strongconvexity to ensure a good approximation. Specifically, let locally around the limit. Using has a bias in , whereas using (11) and inverting the matrix with iterations of conjugate gradient has a bias in [29]. Therefore, when is available, solving (11) with conjugate gradient descent is preferable. However, in practice, this is hardly true. When an approximate solution to the lowerlevel problem is used, adopting (11) has no control on the approximate error, nor does it necessarily yield a descent direction. On the contrary, is based on Proposition 3.1, which uses a weaker assumption and does not require the convergence of to a stationary point. Truncated backpropagation can also optimize the hyperparameters that control the lowerlevel optimization process, which the implicit differentiation approach cannot do.
4 Experiments
4.1 Toy problem
Consider the following simple problem for :
s.t. 
where is the norm, sine is applied elementwise, , and we define ase the result of steps of gradient descent on with learning rate , initialized at . A plot of is shown in Figure. 1. We will use this problem to visualize the theorems and explore the empirical properties of truncated backpropagation.
This deterministic problem satisfies all of the assumptions in the previous section, particularly those of Theorem 3.4: is smooth and strongly convex, with
and . Although is somewhat complicated, with many saddle points, it satisfies the noninterference assumption because .
Figure 1 visualizes Proposition 3.1 by plotting the approximation error and the theoretical bound at . For this problem, , , and can be found analytically from , where . Figure 4 (left) plots the iterates when optimizing using RMD and a decaying metalearning rate .^{4}^{4}4 Because varies widely with , we tune to ensure that the first update has norm . In comparison with the true gradient at these points, we see that is indeed a descent direction. Figure 2 (left) visualizes this in a different way, by plotting for various at each point along the trajectory. As promised by Lemma 3.2, this ratio stays well away from zero.
To demonstrate the biased convergence of Theorem 3.3, we break assumption 3 of Theorem 3.4 by changing the upper objective to so that . The guarantee of Lemma 3.2 no longer applies, and we see in Figure 2 (right) that can become negative. Indeed, Figure 3 shows that optimizing with converges to a suboptimal point. However, it also shows that using larger rapidly decreases the bias.
For the original objective , Theorem 3.4 guarantees exact convergence. Figure 4 shows optimization trajectories for various , and a logscale plot of their convergence rates. Note that, because the lowerlevel problem cannot be perfectly solved within steps, the optimal is offset from the origin. Truncated backpropagation can handle this, but it breaks the assumptions required by the implicit differentiation approach to bilevel optimization.
4.2 Hyperparameter optimization problems
4.2.1 Data hypercleaning
In this section, we evaluate RMD on a hyperparameter optimization problem. The goal of data hypercleaning [9]
is to train a linear classifier for MNIST
[30], with the complication that half of our training labels have been corrupted. To do this with hyperparameter optimization, let be the weights of the classifier, with the outer objective measuring the crossentropy loss on a cleanly labeled validation set. The inner objective is defined as weighted crossentropy training loss plus regularization:where are the training examples,
denotes the sigmoid function,
, and is the Frobenius norm. We optimize to minimize validation loss, presumably by decreasing the weight of the corrupted examples. The optimization dimensions are , . Franceschi et al. [9] previously solved this problem with full RMD, and it happens to satisfy many of our theoretical assumptions, making it an interesting case for empirical study.^{5}^{5}5 We have reformulated the constrained problem from [9] as an unconstrained one that more closely matches our theoretical assumptions. For the same reason, we regularized to make it strongly convex. Finally, we do not retrain on the hypercleaned training + validation data. This is because, for our purposes, comparing the performance of across is sufficient.We optimize the lowerlevel problem through steps of gradient descent, and consider how adjusting changes the performance of RMD.^{6}^{6}6 See Appendix G.1 for more experimental setup. Our hypothesis is that RMD for small works almost as well as full RMD in terms of validation and test accuracy, while requiring less time and far less memory. We also hypothesize that RMD does almost as well as full RMD in identifying which samples were corrupted [9]. Because our formulation of the problem is unconstrained, the weights are never exactly zero. However, we can calculate an F1 score by setting a threshold on : if , then the hypercleaner has marked example as corrupted.^{7}^{7}7 F1 scores for other choices of the threshold were very similar. See Appendix G.1 for details.
Table 2 reports these metrics for various . We see that RMD is somewhat worse than the others, and that validation error (the outer objective ) decreases with more quickly than generalization error. The F1 score is already maximized at . These preliminary results indicate that in situations with limited memory, RMD for small (e.g. ) may be a reasonable fallback: it achieves results close to full backprop, and it runs about twice as fast.
Test Acc.  Val. Acc.  Val. Error  F1  

1  87.50  89.32  0.413  0.85 
5  88.05  89.90  0.383  0.89 
25  88.12  89.94  0.382  0.89 
50  88.17  90.18  0.381  0.89 
100  88.33  90.24  0.380  0.88 
From a theoretical optimization perspective, we wonder whether RMD converges to a stationary point of . Data hypercleaning satisfies all of the assumptions of Theorem 3.4 except that is not full column rank (since ). In particular, the validation loss is deterministic and satisfies . Figure 5 plots the norm of the true gradient on a log scale at the RMD iterates for various . We see that, despite satisfying almost all assumptions, this problem exhibits biased convergence. The limit of decreases slowly with , but recall from Table 2 that practical metrics improve more quickly.
4.2.2 Task interaction
We next consider the problem of multitask learning [27]. Similar to [9], we formulate this as a hyperparameter optimization problem as follows. The lowerlevel objective learns different linear models with parameter set :
where
is the training loss of the multiclass linear logistic regression model,
is a regularization constant, and is a nonnegative, symmetric hyperparameter matrix that encodes the similarity between each pair of tasks. After iterations of gradient decent with learning rate , this yields . The upperlevel objectiveestimates the linear regression loss of the learned model
on a validation set. Presumably, this will be improved by tuning to reflect the true similarities between the tasks. The tasks that we consider are image recognition trained on very small subsets of the datasets CIFAR and CIFAR.^{8}^{8}8 See Appendix G.2 for more details.From an optimization standpoint, we are most interested in the upperlevel loss on the validation set, since that is what is directly optimized, and its value is a good indication of the performance of the inexact gradient. Figure 6
plots this learning curve along with two other metrics of theoretical interest: norm of the true gradient, and cosine similarity between the true and approximate gradients. In CIFAR100, the validation error and gradient norm plots show that
RMD converges to an approximate stationary point with a bias that rapidly decreases as increases, agreeing with Proposition 3.1. Also, we find that negative values exist in the cosine similarity of RMD, which implies that not all the assumptions in Theorem 3.4 hold for this problem (e.g. might not be full rank, or the the inner problem might not be locally strong convex around .) In CIFAR10, some unusal behavior happens. For , the truncated gradient and the full gradient directions eventually become almost the same. We believe this is a very interesting observation but beyond the scope of the paper to explain.In Table 3, we report the testing accuracy over 10 trials. While in general increasing the number of backpropagation steps improves accuracy, the gaps are small. A thorough investigation of the relationship between convergence and generalization is an interesting open question of both theoretical and practical importance.
Method  Avg. Acc.  Avg. Iter.  Sec/iter.  

CIFAR10 
RMD  
RMD  
RMD  
Full RMD  
CIFAR100 
RMD  
RMD  
RMD  
Full RMD 
4.3 Metalearning: Oneshot classification
The aim of this experiment is to evaluate the performance of truncated backpropagation in multitask, stochastic optimization problems. We consider in particular the oneshot classification problem [20], where each task is a way classification problem and the goal is learn a hyperparameter such that each task can be solved with few training samples.
In each hyperiteration, we sample a task, a training set, and a validation set as follows: First, classes are randomly chosen from a pool of classes to define the sampled task . Then the training set is created by randomly drawing one training example from each of the classes and assigning label to it. The validation set is constructed similarly, but with more examples from each class. The lowerlevel objective is defined as:
where is the way crossentropy loss, and is a deep neural network parametrized by and optionally hyperparameter . To prevent overfitting in the lowerlevel optimization, we regularize each parameter to be close to center with weight . Both and are hyperparameters. The upperlevel objective is the loss of the trained network on the sampled validation set . In contrast to other experiments, this is a stochastic optimization problem. Also, depends directly on the hyperparameter , in addition to the indirect dependence through (i.e. ).
We use the Omniglot dataset [31] and a similar neural network as used in [20] with small modifications. Please refer to Appendix G.3 for more details about the model and the data splits. We set and optimize over the hyperparameter . The average accuracy of each model is evaluated over randomly sampled training and validation sets from the metatesting dataset. For comparison, we also try using a very short horizon , which is common in recent work on fewshot learning [20].
The statistics are shown in Table 4 and the learning curves in Figure 7. In addition to saving memory, all truncated methods are faster than full RMD, sometimes even five times faster. These results suggest that running fewstep backpropagation with more hyperiterations can be more efficient than the full RMD. To support this hypothesis, we also ran RMD and RMD for an especially large number of hyperiterations (k). Even with this many hyperiterations, the total runtime is less than full RMD with iterations, and the results are significantly improved. We also find that while using a short horizon () is faster, it achieves a lower accuracy at the same number of iterations.
Finally, we verify some of our theorems in practice. Figure 7 (fourth plot) shows that when the lowerlevel problem is regularized, the relative error between the RMD approximate gradient and the exact gradient decays exponentially as increases. This was guaranteed by Proposition 3.1. However, this exponential decay is not seen for the nonregularized model (). This suggests that the local strong convexity assumption is essential in order to have exponential decay in practice. Figure 7 (third plot) shows the cosine similarity between the inexact gradient and full gradient over the course of metatraining. Note that the cosine similarity measures are always positive, indicating that the inexact gradients are indeed descent directions. It also seems that the cosine similarities show a slight decay over time.
Method  Accuracy  iter.  Sec/iter. 

RMD  
RMD  
RMD  
Full RMD  
RMD  
RMD  
Short horizon 
5 Conclusion
We analyze RMD, a firstorder heuristic for solving bilevel optimization problems when the lowerlevel optimization is itself approximated in an iterative way. We show that RMD is a valid alternative to full RMD from both theoretical and empirical standpoints. Theoretically, we identify sufficient conditions for which the hyperparameters converge to an approximate or exact stationary point of the upperlevel objective. The key observation is that when is near a strict local minimum of the lowerlevel objective, gradient approximation error decays exponentially with reverse depth. Empirically, we explore the properties of this optimization method with four proofofconcept experiments. We find that although exact convergence appears to be uncommon in practice, the performance of RMD is close to full RMD in terms of applicationspecific metrics (such as generalization error). It is also roughly twice as fast. These results suggest that in hyperparameter optimization or meta learning applications with memory constraints, truncated backpropagation is a reasonable choice.
Our experiments use a modest number of parameters , hyperparameters , and horizon length . This is because we need to be able to calculate both RMD and full RMD in order to compare their performance. One promising direction for future research is to use RMD for bilevel optimization problems that require powerful function approximators at both levels of optimization. Truncated RMD makes this approach feasible and enables comparing bilevel optimization to other metalearning methods on difficult benchmarks.
References
 Domke [2012] Justin Domke. Generic methods for optimizationbased modeling. In Artificial Intelligence and Statistics, pages 318–326, 2012.
 Franceschi et al. [2017a] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. A bridge between hyperparameter optimization and larningtolearn. arXiv preprint arXiv:1712.06283, 2017a.
 Bergstra and Bengio [2012] James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
 Srinivas et al. [2010] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010.
 Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 Pedregosa [2016] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737–746, 2016.
 Gould et al. [2016] Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bilevel optimization. arXiv preprint arXiv:1607.05447, 2016.
 Maclaurin et al. [2015] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015.
 Franceschi et al. [2017b] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradientbased hyperparameter optimization. In Proceedings of the 34th International Conference on International Conference on Machine Learning, 2017b.
 Luketina et al. [2016] Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradientbased tuning of continuous regularization hyperparameters. In International Conference on Machine Learning, pages 2952–2960, 2016.
 Baydin et al. [2018] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. In International Conference on Learning Representations, 2018.
 Larsen et al. [1996] Jan Larsen, Lars Kai Hansen, Claus Svarer, and M Ohlsson. Design and regularization of neural networks: the optimal use of a validation set. In Neural Networks for Signal Processing [1996] VI. IEEE Signal Processing Society Workshop, pages 62–71. IEEE, 1996.
 Bengio [2000] Yoshua Bengio. Gradientbased optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
 Chen et al. [2014] Yunjin Chen, Rene Ranftl, and Thomas Pock. Insights into analysis operator learning: From patchbased sparse models to higher order mrfs. IEEE Transactions on Image Processing, 23(3):1060–1072, 2014.
 Caruana [1998] Rich Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.

Ranjan et al. [2017]
Rajeev Ranjan, Vishal M Patel, and Rama Chellappa.
Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.  FeiFei et al. [2006] Li FeiFei, Rob Fergus, and Pietro Perona. Oneshot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
 Ravi and Larochelle [2017] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. In International Conference on Learning Representations, 2017.
 Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, 2017.
 Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), 2017.
 Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 Li and Malik [2017] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017.
 Baydin et al. [2017] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18:153:1–153:43, 2017.
 Hascoet and ArayaPolo [2006] Laurent Hascoet and Mauricio ArayaPolo. Enabling userdriven checkpointing strategies in reversemode automatic differentiation. arXiv preprint cs/0606042, 2006.
 Hazan et al. [2016] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Roth and Black [2005] Stefan Roth and Michael J Black. Fields of experts: A framework for learning image priors. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 860–867. IEEE, 2005.
 Evgeniou et al. [2005] Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(Apr):615–637, 2005.
 Rudin et al. [1964] Walter Rudin et al. Principles of mathematical analysis, volume 3. McGrawhill New York, 1964.
 Shewchuk et al. [1994] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 Horn and Johnson [1990] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 1990.
 Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
References
 Domke [2012] Justin Domke. Generic methods for optimizationbased modeling. In Artificial Intelligence and Statistics, pages 318–326, 2012.
 Franceschi et al. [2017a] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. A bridge between hyperparameter optimization and larningtolearn. arXiv preprint arXiv:1712.06283, 2017a.
 Bergstra and Bengio [2012] James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
 Srinivas et al. [2010] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010.
 Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 Pedregosa [2016] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737–746, 2016.
 Gould et al. [2016] Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bilevel optimization. arXiv preprint arXiv:1607.05447, 2016.
 Maclaurin et al. [2015] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015.
 Franceschi et al. [2017b] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradientbased hyperparameter optimization. In Proceedings of the 34th International Conference on International Conference on Machine Learning, 2017b.
 Luketina et al. [2016] Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradientbased tuning of continuous regularization hyperparameters. In International Conference on Machine Learning, pages 2952–2960, 2016.
 Baydin et al. [2018] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. In International Conference on Learning Representations, 2018.
 Larsen et al. [1996] Jan Larsen, Lars Kai Hansen, Claus Svarer, and M Ohlsson. Design and regularization of neural networks: the optimal use of a validation set. In Neural Networks for Signal Processing [1996] VI. IEEE Signal Processing Society Workshop, pages 62–71. IEEE, 1996.
 Bengio [2000] Yoshua Bengio. Gradientbased optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
 Chen et al. [2014] Yunjin Chen, Rene Ranftl, and Thomas Pock. Insights into analysis operator learning: From patchbased sparse models to higher order mrfs. IEEE Transactions on Image Processing, 23(3):1060–1072, 2014.
 Caruana [1998] Rich Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.

Ranjan et al. [2017]
Rajeev Ranjan, Vishal M Patel, and Rama Chellappa.
Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.  FeiFei et al. [2006] Li FeiFei, Rob Fergus, and Pietro Perona. Oneshot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
 Ravi and Larochelle [2017] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. In International Conference on Learning Representations, 2017.
 Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, 2017.
 Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), 2017.
 Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 Li and Malik [2017] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017.
 Baydin et al. [2017] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18:153:1–153:43, 2017.
 Hascoet and ArayaPolo [2006] Laurent Hascoet and Mauricio ArayaPolo. Enabling userdriven checkpointing strategies in reversemode automatic differentiation. arXiv preprint cs/0606042, 2006.
 Hazan et al. [2016] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Roth and Black [2005] Stefan Roth and Michael J Black. Fields of experts: A framework for learning image priors. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 860–867. IEEE, 2005.
 Evgeniou et al. [2005] Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(Apr):615–637, 2005.
 Rudin et al. [1964] Walter Rudin et al. Principles of mathematical analysis, volume 3. McGrawhill New York, 1964.
 Shewchuk et al. [1994] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 Horn and Johnson [1990] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 1990.
 Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
Appendix
Appendix A Proof of Proposition 3.1
See 3.1
Proof.
Let . By definition of ,
Therefore, when is locally strongly convex with respect to in the neighborhood of ,
Suppose is
smooth but nonconvex. In the worst case, if the smallest eigenvalue of
is , then for . This gives the bound in (9). However, if is globally strongly convex, thenThe bound (10) uses the fact that ∎
Appendix B Proof of Lemma 3.2
See 3.2
Proof.
To illustrate the idea, here we prove the case where . For , similar steps can be applied. To prove the statement, we first expand the inner product by definition
where we recall as by assumption.
Next we show a technical lemma, which provides a critical tool to bound the second term above; its proof is given in the next section.
Lemma B.1.
Let be strongly convex and smooth. Assume and are Lipschitz continuous in , and assume has full column rank. For ,
By Lemma B.1, we can then write
Because
() 
and is nonsingular by assumption,
for some , when is large enough and is small enough. The implication holds because . ∎
b.1 Proof of Lemma b.1
Proof.
Let and be the Lipschitz constant of and . First, we see that the inner product can be lower bounded by the following terms
where
The above lower bounds can be shown by the following inequalities:
Comments
There are no comments yet.