1 Introduction
Metalearning (Schmidhuber, 1987; Thrun and Pratt, 1998) aims to learn a learning process itself over a task distribution. Many gradientbased metalearning approaches assume a set of parameters that do not participate in inneroptimization (Lee and Choi, 2018; Flennerhag et al., 2020; Raghu et al., 2020) which can be seen as hyperparameters. Those hyperparameters are important in helping the innerlearner converge faster and generalize better. As they are usually very highdimensional such as elementwise learning rates (Li et al., 2017), we cannot metalearn them with simple hyperparameter optimization (HO) techniques such as random search (Bergstra and Bengio, 2012) or Bayesian optimization (Snoek et al., 2012) due to the too extensive search space.
In this case, we can use gradientbased HO methods that can directly optimize the highdimensional hyperparameters by minimizing the validation loss w.r.t. the hyperparameters (Bengio, 2000). Due to the expensive computational cost of evaluating the hypergradients (i.e. the gradient w.r.t. the hyperparameters), there has been a lot of efforts to improve the effectiveness and the efficiency of the algorithms. However, unfortunately, none of the existing algorithms satisfy the following criteria at the same time that should be met for their practical use: 1) scalable to hyperparameter dimension, 2) online optimization, 3) memoryefficient, 4) avoid shorthorizon bias. Please See Table 1 for the comparison of existing gradientbased HO algorithms in the above four criteria.
ForwardMode Differentiation (FMD) (Franceschi et al., 2017) in Table 1 is an algorithm that forwardpropagates Jacobians (i.e. derivatives of the update function) from the first to the last step, which is analogous to realtime recurrent learning (RTRL) (Williams and Zipser, 1989)
in recurrent neural networks. FMD allows online optimization
(i.e. update hyperparameters every innerstep) with the intermediate Jacobians and also computes the hypergradients over the entire horizon. However, a critical limitation is that the time and space complexity linearly increases w.r.t. the hyperparameter dimension. Thus, we cannot use FMD for solving many practical metalearning problems that come with millions of hyperparameters, which is the main problem we tackle in this paper.Secondly, ReverseMode Differentiation (RMD) (Maclaurin et al., 2015) backpropagates the Jacobianvector products (JVPs) from the last to the initial step, which is structurally identical to backprop through time (BPTT) (Werbos, 1990)
. RMD is scalable to the hyperparameter dimension, but the space complexity linearly increases w.r.t. the horizon length (i.e., the number of innergradient steps used to compute the hypergradient). It is possible to reduce the memory burden by checkpointing some of the previous weights and further interpolating between the weights to approximate the trajectory
(Fu et al., 2016). However, RMD and its variants are not scalable for online optimization. This is because they do not retain the intermediate Jacobians unlike FMD and thus need to recompute the whole secondorder term for every online HO step.Thirdly, algorithms based on Implicit Function Theorem (IFT) are applicable to highdimensional HO (Bengio, 2000; Pedregosa, 2016). Under the assumption that the main model parameters have arrived at convergence, the bestresponse Jacobian, i.e. how the converged model parameters change w.r.t. the hyperparameters, can be expressed by only the information available at the last step, such as the inverse of Hessian at convergence. Thus, we do not have to explicitly unroll the previous update steps. Due to the heavy cost of computing inverseHessianvector product, Lorraine et al. (2020a) propose to approximate it by an iterative method, which works well for highdimensional HO problems. However, still it is not straightforward to use the method for online optimization because of the convergence assumption. That is, computing hypergradients before convergence does not guarantee the quality of the hypergradients.
To our knowledge, the short horizon approximation such as onestep lookahead (1step in Table 1) (Luketina et al., 2016) is the only existing method that fully supports online optimization, while being scalable to the hyperparameter dimension at the same time. It computes hypergradients only over a single update step and ignores the past learning trajectory, which is computationally efficient as only a single JVP is computed per each online HO step. However, this approximation suffers from the short horizon bias (Wu et al., 2018) by definition.
In this paper, we propose a novel HO algorithm that can simultaneously satisfy all the aforementioned criteria for practical HO. The key idea is to distill the entire secondorder term into a single JVP. As a result, we only need to compute the single JVP for each online HO step, and at the same time the distilled JVP can consider longer horizon than short horizon approximations such as onestep lookahead or firstorder method. We summarize the contribution of this paper as follows:

We propose HyperDistill, a novel HO algorithm that satisfies the aforementioned four criteria for practical HO, each of which is crucial for a HO algorithm to be applied to the current metalearning frameworks.

We show how to efficiently distill the hypergradient secondorder term into a single JVP.

We empirically demonstrate that our algorithm converges faster and provides better generalization performance at convergence, with three recent metalearning models and on two benchmark image datasets.
2 Related Work
Hyperparameter optimization
When the hyperparameter dimension is small (e.g. less than ), random search (Bergstra and Bengio, 2012) or Bayesian optimization (Snoek et al., 2012) works well. However, when the hyperparameter is highdimensional, gradientbased HO is often preferred since random or Bayesian search could become infeasible. One of the most well known methods for gradientbased HO are based on Implicit Function Theorem which compute or approximate the inverse Hessian only at convergence. Bengio (2000) computes the exact inverse Hessian, and Luketina et al. (2016)
approximate the inverse Hessian with the identity matrix, which is identitcal to the onestep lookahead approximation.
Pedregosa (2016) approximates the inverse Hessian with conjugate gradients (CG) method. Lorraine et al. (2020b) propose Neumann approximation, which is numerically more stable than CG approximation. On the other hand, Domke (2012) proposes unrolled differentiation for solving bilevel optimization, and Shaban et al. (2019) analyzes the truncated unrolled differentiation, which is computationally more efficient. Unrolled diffrentiation can be further categorized into forward (FMD) and reverse mode (RMD) (Franceschi et al., 2017). FMD is more suitable for optimizing lowdimensional hyperparamters (Im et al., 2021; Micaelli and Storkey, 2020), but RMD is more scalable to the hyperparameter dimension. Maclaurin et al. (2015) proposes a more memoryefficient RMD, which reverses the SGD trajectory with momentum. Fu et al. (2016) further reduce memory burden of RMD by approximating the learning trajectory with linear interpolation. Luketina et al. (2016) can also be understood as a short horizon approximation of RMD for online optimization. Our method also supports online optimization, but the critical difference is that our algorithm can alleviate the short horizon bias (Wu et al., 2018). RMD is basically a type of backpropagation and it is available in deep learning libraries
(Grefenstette et al., 2020).Metalearning
Metalearning (Schmidhuber, 1987; Thrun and Pratt, 1998) aims to learn a model that generalizes over a distribution of tasks (Vinyals et al., 2016; Ravi and Larochelle, 2017). While there exists a variety of approaches, in this paper we focus on gradientbased metalearning (Finn et al., 2017b), especially the methods with highdimensional hyperparameters that do not participate in inneroptimization. For instance, there have been many attempts to precondition the innergradients for faster inneroptimization, either by warping the parameter space with every pair of consecutive layers interleaved with a warp layer (Lee and Choi, 2018; Flennerhag et al., 2020) or directly modulating the innergradients with diagonal (Li et al., 2017) or blockdiagonal matrix (Park and Oliva, 2019). Perturbation function is another form of hyperparameters that help the innerlearner generalize better (Lee et al., 2020; Ryu et al., 2020; Tseng et al., 2020). It is also possible to let the whole feature extractor be hyperparameters and only adapt the last fullyconnected layer (Raghu et al., 2020). On the other hand, some of the metalearning literatures do not assume a task distribution, but tune their hyperparameters with a holdouot validation set, similarly to the conventional HO setting. In this case, the onestep lookahead method (Luketina et al., 2016) is mostly used for scalable online HO, in context of domain generalization (Li et al., 2018), handling class imbalance (Ren et al., 2018; Shu et al., 2019), gradientbased neural architecture search (Liu et al., 2019), and coefficient of normbased regularizer (Balaji et al., 2018). Although we mainly focus on metalearning setting in this work, whose goal is to transfer knowledge through a task distribution, it is straightforward to apply our method to any conventional HO problems.
3 Background
In this section, we first introduce RMD and its approximations for efficient computation. We then introduce our novel algorithm that supports highdimensional online HO over the entire horizon.
3.1 Hyperparameter unrolled differentiation
We first introduce notations. Throughout this paper, we will specifiy as weight and as hyperparameter. The series of weights evolve with the update function over steps . The function takes the previous weight and the hyperparameter as inputs and its form depends on the current minibatch . Note that are functions w.r.t. the hyperparameter . The question is how to find a good hyperparameter that yields a good response at the last step. In gradientbased HO, we find the optimal by minimizing the validation loss as a function of .
(1) 
Note that we let the loss function
itself be modulated byfor generality. According to the chain rule, the hypergradient is decomposed into
(2) 
On the right hand side, the firstorder (FO) term directly computes the gradient w.r.t by fixing . The secondorder (SO) term computes the indirect effect of through the response . can be easily computed similarly to , but the response Jacobian is more computationally challenging as it is unrolled into the following form.
(3) 
Eq. (3) involves the Jacobians and at the intermediate steps. Evaluating them or their vector products are computationally expensive in terms of either time (FMD) or space (FMD, RMD) (Franceschi et al., 2017). Therefore, how to approximate Eq. (3) is the key to developing an efficient and effective HO algorithm.
3.2 Reversemode differentiation and its approximations
Basically, RMD is structurally analogous to backpropagation through time (BPTT) (Werbos, 1990). In RMD, we first obtain and backpropagate and from the last to the first step in the form of JVPs (See Algorithm 1). Whereas RMD is much faster than FMD as we only need to compute one or two JVPs per each step, it usually requires to store all the previous weights to compute the previousstep JVPs, unless we consider reversible training with momentum optimizer (Maclaurin et al., 2015). Therefore, when is highdimensional, RMD is only applicable to shorthorizon problems such as fewshot learning (e.g. in Finn et al. (2017a)).
Trajectory approximation.
Instead of storing all the previous weights for computing and , we can approximate the learning trajectory by linearly interpolating between the last weight and the initial weight . Algorithm 2 illustrates the procedure called DrMAD (Fu et al., 2016), where each intermediate weight is approximated by for . and are also approximated by and , respectively. However, although DrMAD dramatically lower the space complexity, it does not reduce the number of JVPs per each hypergradient step. For each online HO step we need to compute JVPs, thus the number of total JVPs to complete a single trajectory accumulates up to , which is definitely not scalable as an online optimization algorithm.
Shorthorizon approximations.
Onestep lookahead approximation (Luketina et al., 2016) is currently one of the most popular highdimensional online HO method that can avoid computing the excessive number of JVPs (Li et al., 2018; Ren et al., 2018; Shu et al., 2019; Liu et al., 2019; Balaji et al., 2018). The idea is very simple; for each online HO step we only care about the last previous step and ignore the rest of the learning trajectory for computational efficiency. Specifically, for each step we compute the hypergradient by viewing as constant, which yields (See Eq. (3)). Or, we may completely ignore all the secondorder derivatives for computational efficiency, such that (Flennerhag et al., 2020; Ryu et al., 2020). While those approximations enable online HO with low cost, they are intrinsically vulnerable to shorthorizon bias (Wu et al., 2018) by definition.
4 Approach
We next introduce our novel online HO method based on knowledge distillation. Our method can overcome all the aforementioned limitations at the same time.
4.1 Hypergradient distillation
The key idea is to distill the whole secondorder term in Eq. (2) into a single JVP evaluated at a distilled weight point and with a distilled dataset . We denote the normalized JVP as with . Specifically, we want to solve the following knowledge distillation problem for each online HO step :
(4) 
so that we use instead of . Online optimization is now feasible because for each online HO step we only need to compute the single JVP rather than computing JVPs for RMD or DrMAD. Also, unlike short horizon approximations, the whole trajectory information is distilled into the JVP, alleviating the short horizon bias (Wu et al., 2018).
Notice that solving Eq. (4) only w.r.t. is simply a vector projection.
(5) 
Then, plugging Eq. (5) into in Eq. (4) and making use of , we can easily convert the optimization problem Eq. (4) into the following equivalent problem (See Appendix A).
(6) 
and match the hypergradient direction and matches the size.
Technical challenge.
4.2 Distilling the hypergradient direction
Hessian approximation.
In order to circumvent the technical difficulty, we start from making the optimization objective in Eq. (5) simpler. We approximate as
(7) 
with , which we tune on a metavalidation set. Note that Eq. (7) is yet too expensive to use for online optimization as it consists of JVPs. We thus need further distillation, which we will explain later. Eq. (7) is simply a Hessian identity approximation. For instance, vanilla SGD with learning rate corresponds to . Approximating the Hessian as , we have . Plugging Eq. (7) to Eq. (5) and letting , we have
(8) 
where . Instead of maximizing directly, we now maximize w.r.t. and as a proxy objective.
Lipschitz continuity assumption.
Now we are ready to see how to distill the hypergradient direction and without evaluating . The important observation is that the maximum of in Eq. (8) is achieved when is wellaligned to the other . This intuition is directly related to the following Lipschitz continuity assumption on .
(9) 
where is the Lipschitz constant. Eq. (9) captures which can minimize over , which is equivalent to maximizing since . For the metric , we let where are additional constants that we introduce for notational convenience. Taking square of the both sides of Eq. (9) and summing over all , we can easily derive the following lower bound of (See Appendix B).
(10) 
We now maximize this lower bound instead of directly maximizing . Interestingly, it corresponds to the following simple minimization problems for and .
(11) 
Efficient sequential update.
Eq. (11) tells how to determine the distilled and for each HO step . Since is expensive to compute, we approximate as , yielding the following weighted average as the approximated solution for .
(12) 
The following sequential update allows to efficiently evaluate Eq. (12) for each HO step. Denoting , we have
(13) 
Note that the online update in Eq. (13) does not require to evaluate . It only requires to incorporate the past learning trajectory through the sequential updates. Therefore, the only additional cost is the memory for storing and updating the weighted running average .
For , we have assumed Euclidean distance metric as with , but it is not straightforward to think of Euclidean distance between datasets. Instead, we simply interpret and
as probabilities with which we proportionally subsample each dataset.
(14) 
where denotes random SubSampling of instances from . There may be a better distance metric for datasets and a corresponding solution, but we leave it as a future work. See Algorithm 3 for the overall description of our algorithm, which we name as HyperDistill.
Role of
Note that Eq. (12) tells us the role of as a decaying factor. The larger the , the longer the past learning trajectory we consider. In this sense, our method is a generalization of the onestep lookahead approximation, i.e. , which yields and , ignoring the whole information about the past learning trajectory except the last step. may be too pessimistic for most of the cases, so we need to find better performing for each task carefully.
4.3 Distilling the hypergradient size
Now we need to plug the distilled and into in Eq. (5) to obtain the scaling factor , for online HO steps . However, whereas evaluating the single JVP is tolerable, again, evaluating is misleading as it is the target we aim to approximate. Also, it is not straightforward for to apply a similar trick we used in Sec. 4.2.
Linear estimator.
We thus introduce a linear function
that estimates
by periodically fitting , the parameter of the estimator. Then for each HO step we could use instead of fully evaluating . Based on the observation that the form of lower bound in Eq. (10) is roughly proportional to , we conveniently set to as follows:(15) 
Collecting samples.
We next see how to collect samples for fitting the parameter , where and . For this, we need to efficiently collect:

, the secondorder term computed over the horizon of size .

, the distilled JVP computed over the horizon of size .
1) : Note that DrMAD in Algorithm 2 (line 6) sequentially backpropagates for . The important observation is that, at step , this incomplete secondorder term can be seen as the valid secondorder term computed over the horizon of size . This is because the reparameterization gives
(16) 
for , nothing but shifting the trajectory index by steps so that the last step is always . Therefore, we can efficiently obtain the valid secondorder term for all through the single backward travel along the interpolated trajectory (See Figure 1). Each requires to compute only one or two additional JVPs. Also, as we use DrMAD instead of RMD, we only store such that the memory cost is constant w.r.t. the total horizon size .
Estimating .
Now we are ready to estimate . For , we have and collect . For , we have and collect . Finally, we estimate . See Algorithm 3 and Algorithm 4 for the details. Practically, we set EstimationPeriod in Algorithm 3 to every completions of the inneroptimizations, i.e. . Thus, the computational cost of LinearEstimation is marginal in terms of wallclock time (see Table 3).
5 Experiments
Baselines.
We demonstrate the efficacy of our algorithm by comparing to the following baselines.
1) FirstOrder Approximation (FO). Computationally the most efficient HO algorithm that completely ignore the secondorder term, i.e. . 2) Onestep Lookahead Approximation (1step). (Luketina et al., 2016) The shorthorizon approximation where only a single step is unrolled to compute each hypergradient. 3) DrMAD. (Fu et al., 2016) An approximation of RMD that linearly interpolates between the initial and the last weight to save memory (see Algorithm 2). 4) Neumann IFT (N.IFT). (Lorraine et al., 2020a) An IFT based method that approximates the inverseHessianvector product by Neumann series. Note that this method supports online optimization around convergence. Specifically, among total innersteps, N.IFT means for the last steps we perform online HO each with inversion steps. It requires total JVPs. We tune among , roughly computing JVPs per inneropt. 5) HyperDistill. Our highdimensional online HO algorithm based on the idea of knowledge distillation. We tune the decaying factor within
. The linear regression is done every
inneroptimization problems.innersteps. We report mean and and 95% confidence intervals over 5 metatraining runs.
Target metalearning models.
We test on the following three metalearning models.
1) Almost No Inner Loop (ANIL). (Raghu et al., 2020). The intuition of ANIL is that the need for taskspecific adaptation diminishes when the task distribution is homogeneous. Following this intuition, based on a typical layer convolutional network with channels (Finn et al., 2017a), we designate the three bottom layers as the highdimensional hyperparameter and the 4th convolutional layer and the last fully connected layer as the weight, similarly to Javed and White (2019).
2) WarpGrad. (Flennerhag et al., 2020) Secondly, we consider WarpGrad, whose goal is to metalearn nonlinear warp layers that facilitate fast inneroptimization and better generalization. We use 3layer convolutional network with 32 channels. Every layer is interleaved with two warp layers that do not participate in the inneroptimization, which is the highdimensional hyperparameter.
3) MetaWeightNet. (Shu et al., 2019) Lastly, we consider solving the label corruption problem with MetaWeightNet, which metalearns a small MLP taking a 1D loss as an input and output a reweighted loss. The parameter of the MLP is considered as a highdimensional hyperparameter. Labels are independently corrupted to random classes with probability . Note that we aim to metalearn the MLP over a task distribution and apply to diverse unseen tasks, instead of solving a single task. Also, in this meta model the direct gradient is zero, . In this case, in HyperDistill has a meaning of nothing but rescaling the learning rate, so we simply set .
Use of Reptile.
Note that for all the above metalearning models, we metalearn the weight initialization with Reptile (Nichol et al., 2018) as well, representing a more practical metalearning scenario than learning from random initialization. We use the Reptile learning rate . Note that in Algorithm 3 and Algorithm 4 denotes the Reptile initialization parameter.
Task distribution.
We consider the following two datasets. To generate each task, we randomly sample classes from each dataset and further randomly split into training and test examples. 1) TinyImageNet. (1) This dataset contains classes of general categories. We split them into , , and classes for metatraining, metavalidation, and metatest. Each class has examples of size . 2) CIFAR100. (Krizhevsky et al., 2009) This dataset contains classes of general categories. We split them into , , and classes for metatraining, metavalidation, and metatest. Each class has examples of size .
Experimental setup.
Metatraining: For inneroptimization of the weights, we use SGD with momentum and set the learning rate for MetaWeightNet and for the others. The number of innersteps is and batchsize is . We use random cropping and horizontal flipping as data augmentations. For the hyperparameter optimization, we also use SGD with momentum with learning rate for MetaWeightNet and for the others, which we linearly decay toward over total inneroptimizations. We perform parallel metalearning with metabatchsize set to . Metatesting: We solve tasks to measure average performance, with exactly the same inneroptimization setup as metatraining. We repeat this over different metatraining runs and report mean and confidence intervals (see Table 2).
5.1 Analysis
We perform the following analysis together with the WarpGrad model and CIFAR100 dataset.
HyperDistill provides faster convergence and better generalization.
Figure 2 shows that HyperDistill shows much faster metatraining convergence than the baselines for all the metalearning models and datasets we considered. We see that the convergence of offline method such as DrMAD is significantly worse than a simple firstorder method, demonstrating the importance of frequent update via online optimization. HyperDistill shows significantly better convergence than FO and 1step because it is online and at the same time alleviates the short horizon bias. As a result, Table 2 shows that the metatest performance of HyperDistill is significantly better than the baselines, although it requires comparable number of JVPs per each inneroptimiztion.
HyperDistill is a reasonable approximation of the true hypergradient.
We see from Figure 3 that the hypergradient obtained from HyperDistill is more similar to the exact RMD than those obtained from FO and 1step, demonstrating that HyperDistill can actually alleviate the short horizon bias. HyperDistill is even comparable to N.IFT that computes JVPs, whereas HyperDistill computes only a single JVP. Such results indicate that the approximation we used in Eq. (7) and DrMAD in Eq. (16) are accurate enough. Figure 3 shows that with careful tuning of (e.g. ), the direction of the approximated secondorder term in Eq. (7) can be much more accurate than the secondorder term of 1step (). In Figure 3, as HyperDistill distills such a good approximation, it can provide a better direction of the secondorder term than 1step. Although the gap may seem marginal, even N.IFT performs similarly, showing that matching the direction of the secondorder term without unrolling the full gradient steps is inherently a challenging problem. Figure 4 and 4 show that the samples collected according to Algorithm 4 is largely linear, supporting our choice of Eq. (15). Figure 4 and 4 show that the range of fitted is accurate and stable, explaining why we do not have to perform the estimation frequently. Note that DrMAD approximation (Eq. (16)) is accurate (Figure 3 and 3), helping to predict the hypergradient size.
HyperDistill is compuatationally efficient.
Figure 4 shows the superior computational efficiency of HyperDistill in terms of the tradeoff between metatest performance and the amount of JVP computations. Note that wallclock time is roughly proportional to the number of JVPs per inneroptimization. In Appendix F, we can see that the actual increase in memory cost and wallcock time is very marginal compared to 1step approximation.
6 Conclusion
In this work, we proposed a novel HO method, HyperDistill, that can optimize highdimensional hyperparameters in an online manner. It was done by approximating the exact secondorder term with knowledge distillation. We demonstrated that HyperDistill provides faster metaconvergence and better generalization performance based on realistic metalearning methods and datasets. We also verified that it is thanks to the accurate approximations we proposed.
References
 [1] Note: https://tinyimagenet.herokuapp.com/ Cited by: §5.
 MetaReg: towards domain generalization using metaregularization. In NeurIPS, Cited by: §2, §3.2.
 Gradientbased optimization of hyperparameters. Neural Computation 12 (8), pp. 1889–1900. Cited by: §1, §1, §2.
 Random search for hyperparameter optimization.. J. Mach. Learn. Res. 13. Cited by: §1, §2.
 Generic methods for optimizationbased modeling. In AISTATS, Cited by: §2.
 ModelAgnostic MetaLearning for Fast Adaptation of Deep Networks. In ICML, Cited by: §3.2, §5.
 Modelagnostic metalearning for fast adaptation of deep networks. In ICML, Cited by: §2.
 Metalearning with warped gradient descent. In ICLR, Cited by: §1, §2, §3.2, §5.
 Forward and reverse gradientbased hyperparameter optimization. In ICML, Cited by: §1, §2, §3.1.

DrMAD: distilling reversemode automatic differentiation for optimizing hyperparameters of deep neural networks
. In IJCAI, Cited by: §1, §2, §3.2, §5, Algorithm 2.  Generalized inner loop metalearning. Cited by: §2.
 Online hyperparameter optimization by realtime recurrent learning. External Links: 2102.07813 Cited by: §2.
 Metalearning representations for continual learning. In NeurIPS, Cited by: §5.
 Learning Multiple Layers of features from Tiny Images. Cited by: §5.
 Meta Dropout: Learning to Perturb Latent Features for Generalization. In ICLR, Cited by: §2.
 Gradientbased metalearning with learned layerwise metric and subspace. In ICML, Cited by: §1, §2.
 Learning to generalize: metalearning for domain generalization. In AAAI, Cited by: §2, §3.2.
 Metasgd: learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835. Cited by: §1, §2.
 DARTS: differentiable architecture search. In ICLR, Cited by: §2, §3.2.
 Optimizing millions of hyperparameters by implicit differentiation. In AISTATS, Cited by: §1, §5.
 Optimizing millions of hyperparameters by implicit differentiation. In AISTATS, Cited by: §2.
 Scalable gradientbased tuning of continuous regularization hyperparameters. In ICML, Cited by: Table 1, §1, §2, §2, §3.2, §5.
 Gradientbased hyperparameter optimization through reversible learning. In ICML, Cited by: §1, §2, §3.2.
 Nongreedy gradientbased hyperparameter optimization over long horizons. External Links: 2007.07869 Cited by: §2.
 On FirstOrder MetaLearning Algorithms. arXiv eprints. Cited by: §5.
 Metacurvature. In NeurIPS, Cited by: §2.
 Hyperparameter optimization with approximate gradient. In ICML, Cited by: §1, §2.
 Rapid learning or feature reuse? towards understanding the effectiveness of maml. In ICLR, Cited by: §1, §2, §5.
 Optimization as a model for fewshot learning. In ICLR, Cited by: §2.
 Learning to Reweight Examples for Robust Deep Learning. ICML. Cited by: §2, §3.2.
 MetaPerturb: transferable regularizer for heterogeneous tasks and architectures. In NeurIPS, Cited by: §2, §3.2.
 Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. Ph.D. Thesis, Technische Universität München. Cited by: §1, §2.
 Truncated backpropagation for bilevel optimization. In AISTATS, Cited by: §2.
 MetaWeightNet: Learning an Explicit Mapping For Sample Weighting. In NeurIPS, Cited by: §2, §3.2, §5.

Practical bayesian optimization of machine learning algorithms
. In NIPS, Cited by: §1, §2.  Learning to learn. Kluwer Academic Publishers, Norwell, MA, USA. External Links: ISBN 0792380479 Cited by: §1, §2.
 Crossdomain fewshot classification via learned featurewise transformation. In ICLR, Cited by: §2.
 Matching Networks for One Shot Learning. In NIPS, Cited by: §2.
 Backpropagation through time: what it does and how to do it. Proceedings of the IEEE. Cited by: §1, §3.2.
 A learning algorithm for continually running fully recurrent neural networks. Neural Computation. Cited by: §1.
 Understanding shorthorizon bias in stochastic metaoptimization. In ICLR, Cited by: §1, §2, §3.2, §4.1.
Appendix A Derivation of Equation (6)
Let and for notational simplicity. Note that . Then,
(19)  
Plugging this into in Eq. (19) and with the assumption , we have
(20) 
Note that Eq. (20) results from encoding the closedform solution already. Therefore, the above is a joint optimization so that we do not have to repeat alternating optimizations between and .
Appendix B Derivation of Equation (10)
Let and for notational simplicity. Note that and we are given the following inequalities.
Taking square of both sides and multiplying ,
Summing the inequalities over all ,
Rearranging the terms,
Appendix C Metavalidation Performance
Appendix D Hyperhyperparameter Analysis
Our algorithm, HyperDistill has a hyperhyperparamter that we tune with a metavalidation set in the range . Figure 6 shows that with all the values of and for all the experimental setups we consider, HyperDistill outperforms all the baselines with significant margins. This demonstrates that the performance of HyperDistill is not much sensitive to the value of .
Appendix E More Details of MetaWeightNet Experiments
For the additional experimental setup, note that we use architecture recommended by the original paper. Also, we found that lower bounding the output of the weighting function with can stabilize the training. Figure 7 shows the resultant loss weighting function learned with each algorithm. We see that the learned weighting function with HyperDistill tend to output lower values than the baselines.
Appendix F Computational Efficiency
Table 3 shows the computational efficiency measured in actual memory usage and average wallclock time required to complete a single inneroptimization. We can see from the table that whereas our HyperDistill requires slightly more memory and wallclock time than 1step or Neumann IFT method, the additional cost is definitely tolerable considering the superior metatest performance shown in Table 2.
Comments
There are no comments yet.