# Online Hyperparameter Meta-Learning with Hypergradient Distillation

Many gradient-based meta-learning methods assume a set of parameters that do not participate in inner-optimization, which can be considered as hyperparameters. Although such hyperparameters can be optimized using the existing gradient-based hyperparameter optimization (HO) methods, they suffer from the following issues. Unrolled differentiation methods do not scale well to high-dimensional hyperparameters or horizon length, Implicit Function Theorem (IFT) based methods are restrictive for online optimization, and short horizon approximations suffer from short horizon bias. In this work, we propose a novel HO method that can overcome these limitations, by approximating the second-order term with knowledge distillation. Specifically, we parameterize a single Jacobian-vector product (JVP) for each HO step and minimize the distance from the true second-order term. Our method allows online optimization and also is scalable to the hyperparameter dimension and the horizon length. We demonstrate the effectiveness of our method on two different meta-learning methods and three benchmark datasets.

## Authors

• 15 publications
• 6 publications
• 4 publications
• 51 publications
• 46 publications
• 77 publications
06/19/2021

Gradient-based meta-learning and hyperparameter optimization have seen s...
03/06/2018

### Understanding Short-Horizon Bias in Stochastic Meta-Optimization

Careful tuning of the learning rate, or even schedules thereof, can be c...
07/15/2020

### Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons

Gradient-based hyperparameter optimization is an attractive way to perfo...
11/02/2021

### Meta-Learning to Improve Pre-Training

Pre-training (PT) followed by fine-tuning (FT) is an effective method fo...
10/13/2021

### ES-Based Jacobian Enables Faster Bilevel Optimization

Bilevel optimization (BO) has arisen as a powerful tool for solving many...
12/18/2017

### A Bridge Between Hyperparameter Optimization and Larning-to-learn

We consider a class of a nested optimization problems involving inner an...
06/13/2018

### Far-HO: A Bilevel Programming Package for Hyperparameter Optimization and Meta-Learning

In (Franceschi et al., 2018) we proposed a unified mathematical framewor...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Meta-learning (Schmidhuber, 1987; Thrun and Pratt, 1998) aims to learn a learning process itself over a task distribution. Many gradient-based meta-learning approaches assume a set of parameters that do not participate in inner-optimization (Lee and Choi, 2018; Flennerhag et al., 2020; Raghu et al., 2020) which can be seen as hyperparameters. Those hyperparameters are important in helping the inner-learner converge faster and generalize better. As they are usually very high-dimensional such as element-wise learning rates (Li et al., 2017), we cannot meta-learn them with simple hyperparameter optimization (HO) techniques such as random search (Bergstra and Bengio, 2012) or Bayesian optimization (Snoek et al., 2012) due to the too extensive search space.

In this case, we can use gradient-based HO methods that can directly optimize the high-dimensional hyperparameters by minimizing the validation loss w.r.t. the hyperparameters (Bengio, 2000). Due to the expensive computational cost of evaluating the hypergradients (i.e. the gradient w.r.t. the hyperparameters), there has been a lot of efforts to improve the effectiveness and the efficiency of the algorithms. However, unfortunately, none of the existing algorithms satisfy the following criteria at the same time that should be met for their practical use: 1) scalable to hyperparameter dimension, 2) online optimization, 3) memory-efficient, 4) avoid short-horizon bias. Please See Table 1 for the comparison of existing gradient-based HO algorithms in the above four criteria.

Forward-Mode Differentiation (FMD) (Franceschi et al., 2017) in Table 1 is an algorithm that forward-propagates Jacobians (i.e. derivatives of the update function) from the first to the last step, which is analogous to real-time recurrent learning (RTRL) (Williams and Zipser, 1989)

in recurrent neural networks. FMD allows online optimization

(i.e. update hyperparameters every inner-step) with the intermediate Jacobians and also computes the hypergradients over the entire horizon. However, a critical limitation is that the time and space complexity linearly increases w.r.t. the hyperparameter dimension. Thus, we cannot use FMD for solving many practical meta-learning problems that come with millions of hyperparameters, which is the main problem we tackle in this paper.

Secondly, Reverse-Mode Differentiation (RMD) (Maclaurin et al., 2015) back-propagates the Jacobian-vector products (JVPs) from the last to the initial step, which is structurally identical to backprop through time (BPTT) (Werbos, 1990)

. RMD is scalable to the hyperparameter dimension, but the space complexity linearly increases w.r.t. the horizon length (i.e., the number of inner-gradient steps used to compute the hypergradient). It is possible to reduce the memory burden by checkpointing some of the previous weights and further interpolating between the weights to approximate the trajectory

(Fu et al., 2016). However, RMD and its variants are not scalable for online optimization. This is because they do not retain the intermediate Jacobians unlike FMD and thus need to recompute the whole second-order term for every online HO step.

Thirdly, algorithms based on Implicit Function Theorem (IFT) are applicable to high-dimensional HO (Bengio, 2000; Pedregosa, 2016). Under the assumption that the main model parameters have arrived at convergence, the best-response Jacobian, i.e. how the converged model parameters change w.r.t. the hyperparameters, can be expressed by only the information available at the last step, such as the inverse of Hessian at convergence. Thus, we do not have to explicitly unroll the previous update steps. Due to the heavy cost of computing inverse-Hessian-vector product, Lorraine et al. (2020a) propose to approximate it by an iterative method, which works well for high-dimensional HO problems. However, still it is not straightforward to use the method for online optimization because of the convergence assumption. That is, computing hypergradients before convergence does not guarantee the quality of the hypergradients.

To our knowledge, the short horizon approximation such as one-step lookahead (1-step in Table 1(Luketina et al., 2016) is the only existing method that fully supports online optimization, while being scalable to the hyperparameter dimension at the same time. It computes hypergradients only over a single update step and ignores the past learning trajectory, which is computationally efficient as only a single JVP is computed per each online HO step. However, this approximation suffers from the short horizon bias (Wu et al., 2018) by definition.

In this paper, we propose a novel HO algorithm that can simultaneously satisfy all the aforementioned criteria for practical HO. The key idea is to distill the entire second-order term into a single JVP. As a result, we only need to compute the single JVP for each online HO step, and at the same time the distilled JVP can consider longer horizon than short horizon approximations such as one-step lookahead or first-order method. We summarize the contribution of this paper as follows:

• We propose HyperDistill, a novel HO algorithm that satisfies the aforementioned four criteria for practical HO, each of which is crucial for a HO algorithm to be applied to the current meta-learning frameworks.

• We show how to efficiently distill the hypergradient second-order term into a single JVP.

• We empirically demonstrate that our algorithm converges faster and provides better generalization performance at convergence, with three recent meta-learning models and on two benchmark image datasets.

## 2 Related Work

#### Hyperparameter optimization

When the hyperparameter dimension is small (e.g. less than ), random search (Bergstra and Bengio, 2012) or Bayesian optimization (Snoek et al., 2012) works well. However, when the hyperparameter is high-dimensional, gradient-based HO is often preferred since random or Bayesian search could become infeasible. One of the most well known methods for gradient-based HO are based on Implicit Function Theorem which compute or approximate the inverse Hessian only at convergence. Bengio (2000) computes the exact inverse Hessian, and Luketina et al. (2016)

approximate the inverse Hessian with the identity matrix, which is identitcal to the one-step lookahead approximation.

Pedregosa (2016) approximates the inverse Hessian with conjugate gradients (CG) method. Lorraine et al. (2020b) propose Neumann approximation, which is numerically more stable than CG approximation. On the other hand, Domke (2012) proposes unrolled differentiation for solving bi-level optimization, and Shaban et al. (2019) analyzes the truncated unrolled differentiation, which is computationally more efficient. Unrolled diffrentiation can be further categorized into forward (FMD) and reverse mode (RMD) (Franceschi et al., 2017). FMD is more suitable for optimizing low-dimensional hyperparamters (Im et al., 2021; Micaelli and Storkey, 2020), but RMD is more scalable to the hyperparameter dimension. Maclaurin et al. (2015) proposes a more memory-efficient RMD, which reverses the SGD trajectory with momentum. Fu et al. (2016) further reduce memory burden of RMD by approximating the learning trajectory with linear interpolation. Luketina et al. (2016) can also be understood as a short horizon approximation of RMD for online optimization. Our method also supports online optimization, but the critical difference is that our algorithm can alleviate the short horizon bias (Wu et al., 2018)

. RMD is basically a type of backpropagation and it is available in deep learning libraries

(Grefenstette et al., 2020).

#### Meta-learning

Meta-learning (Schmidhuber, 1987; Thrun and Pratt, 1998) aims to learn a model that generalizes over a distribution of tasks (Vinyals et al., 2016; Ravi and Larochelle, 2017). While there exists a variety of approaches, in this paper we focus on gradient-based meta-learning (Finn et al., 2017b), especially the methods with high-dimensional hyperparameters that do not participate in inner-optimization. For instance, there have been many attempts to precondition the inner-gradients for faster inner-optimization, either by warping the parameter space with every pair of consecutive layers interleaved with a warp layer (Lee and Choi, 2018; Flennerhag et al., 2020) or directly modulating the inner-gradients with diagonal (Li et al., 2017) or block-diagonal matrix (Park and Oliva, 2019). Perturbation function is another form of hyperparameters that help the inner-learner generalize better (Lee et al., 2020; Ryu et al., 2020; Tseng et al., 2020). It is also possible to let the whole feature extractor be hyperparameters and only adapt the last fully-connected layer (Raghu et al., 2020). On the other hand, some of the meta-learning literatures do not assume a task distribution, but tune their hyperparameters with a holdouot validation set, similarly to the conventional HO setting. In this case, the one-step lookahead method (Luketina et al., 2016) is mostly used for scalable online HO, in context of domain generalization (Li et al., 2018), handling class imbalance (Ren et al., 2018; Shu et al., 2019), gradient-based neural architecture search (Liu et al., 2019), and coefficient of norm-based regularizer (Balaji et al., 2018). Although we mainly focus on meta-learning setting in this work, whose goal is to transfer knowledge through a task distribution, it is straightforward to apply our method to any conventional HO problems.

## 3 Background

In this section, we first introduce RMD and its approximations for efficient computation. We then introduce our novel algorithm that supports high-dimensional online HO over the entire horizon.

### 3.1 Hyperparameter unrolled differentiation

We first introduce notations. Throughout this paper, we will specifiy as weight and as hyperparameter. The series of weights evolve with the update function over steps . The function takes the previous weight and the hyperparameter as inputs and its form depends on the current mini-batch . Note that are functions w.r.t. the hyperparameter . The question is how to find a good hyperparameter that yields a good response at the last step. In gradient-based HO, we find the optimal by minimizing the validation loss as a function of .

 minλLval(wT(λ),λ) (1)

Note that we let the loss function

itself be modulated by

for generality. According to the chain rule, the hypergradient is decomposed into

 dLval(wT,λ)dλ=∂Lval(wT,λ)∂λgFOT: First-order term+∂Lval(wT,λ)∂wTdwTdλgSOT: Second-order term (2)

On the right hand side, the first-order (FO) term directly computes the gradient w.r.t by fixing . The second-order (SO) term computes the indirect effect of through the response . can be easily computed similarly to , but the response Jacobian is more computationally challenging as it is unrolled into the following form.

 dwTdλ=T∑t=1(T∏s=t+1As)Bt,whereAs=∂Φ(ws−1,λ;Ds)∂ws−1,Bt=∂Φ(wt−1,λ;Dt)∂λ (3)

Eq. (3) involves the Jacobians and at the intermediate steps. Evaluating them or their vector products are computationally expensive in terms of either time (FMD) or space (FMD, RMD) (Franceschi et al., 2017). Therefore, how to approximate Eq. (3) is the key to developing an efficient and effective HO algorithm.

### 3.2 Reverse-mode differentiation and its approximations

Basically, RMD is structurally analogous to backpropagation through time (BPTT) (Werbos, 1990). In RMD, we first obtain and back-propagate and from the last to the first step in the form of JVPs (See Algorithm 1). Whereas RMD is much faster than FMD as we only need to compute one or two JVPs per each step, it usually requires to store all the previous weights to compute the previous-step JVPs, unless we consider reversible training with momentum optimizer (Maclaurin et al., 2015). Therefore, when is high-dimensional, RMD is only applicable to short-horizon problems such as few-shot learning (e.g. in Finn et al. (2017a)).

#### Trajectory approximation.

Instead of storing all the previous weights for computing and , we can approximate the learning trajectory by linearly interpolating between the last weight and the initial weight . Algorithm 2 illustrates the procedure called DrMAD (Fu et al., 2016), where each intermediate weight is approximated by for . and are also approximated by and , respectively. However, although DrMAD dramatically lower the space complexity, it does not reduce the number of JVPs per each hypergradient step. For each online HO step we need to compute JVPs, thus the number of total JVPs to complete a single trajectory accumulates up to , which is definitely not scalable as an online optimization algorithm.

#### Short-horizon approximations.

One-step lookahead approximation (Luketina et al., 2016) is currently one of the most popular high-dimensional online HO method that can avoid computing the excessive number of JVPs (Li et al., 2018; Ren et al., 2018; Shu et al., 2019; Liu et al., 2019; Balaji et al., 2018). The idea is very simple; for each online HO step we only care about the last previous step and ignore the rest of the learning trajectory for computational efficiency. Specifically, for each step we compute the hypergradient by viewing as constant, which yields (See Eq. (3)). Or, we may completely ignore all the second-order derivatives for computational efficiency, such that  (Flennerhag et al., 2020; Ryu et al., 2020). While those approximations enable online HO with low cost, they are intrinsically vulnerable to short-horizon bias (Wu et al., 2018) by definition.

## 4 Approach

We next introduce our novel online HO method based on knowledge distillation. Our method can overcome all the aforementioned limitations at the same time.

The key idea is to distill the whole second-order term in Eq. (2) into a single JVP evaluated at a distilled weight point and with a distilled dataset . We denote the normalized JVP as with . Specifically, we want to solve the following knowledge distillation problem for each online HO step :

 π∗t,w∗t,D∗t=argminπ,w,D∥∥πft(w,D)−gSOt∥∥ (4)

so that we use instead of . Online optimization is now feasible because for each online HO step we only need to compute the single JVP rather than computing JVPs for RMD or DrMAD. Also, unlike short horizon approximations, the whole trajectory information is distilled into the JVP, alleviating the short horizon bias (Wu et al., 2018).

Notice that solving Eq. (4) only w.r.t. is simply a vector projection.

 ~πt(w,D)=ft(w,D)TgSOt. (5)

Then, plugging Eq. (5) into in Eq. (4) and making use of , we can easily convert the optimization problem Eq. (4) into the following equivalent problem (See Appendix A).

 w∗t,D∗t=argmaxw,D~πt(w,D),π∗t=~πt(w∗t,D∗t). (6)

and match the hypergradient direction and matches the size.

#### Technical challenge.

However, solving Eq. (6) requires to evaluate for , which is tricky as is the target we aim to approximate. We next show how to roughly solve Eq. (6) even without evaluting (for ) or by sparsely evaluating an approximation of (for ).

### 4.2 Distilling the hypergradient direction

#### Hessian approximation.

In order to circumvent the technical difficulty, we start from making the optimization objective in Eq. (5) simpler. We approximate as

 gSOt=αtt∑i=1(t∏j=i+1Aj)Bi≈t∑i=1γt−iαtBi. (7)

with , which we tune on a meta-validation set. Note that Eq. (7) is yet too expensive to use for online optimization as it consists of JVPs. We thus need further distillation, which we will explain later. Eq. (7) is simply a Hessian identity approximation. For instance, vanilla SGD with learning rate corresponds to . Approximating the Hessian as , we have . Plugging Eq. (7) to Eq. (5) and letting , we have

 ~πt(w,D)≈^πt(w,D)=t∑i=1δt,i⋅ft(w,D)Tft(wi−1,Di) (8)

where . Instead of maximizing directly, we now maximize w.r.t. and as a proxy objective.

#### Lipschitz continuity assumption.

Now we are ready to see how to distill the hypergradient direction and without evaluating . The important observation is that the maximum of in Eq. (8) is achieved when is well-aligned to the other . This intuition is directly related to the following Lipschitz continuity assumption on .

 ∥ft(w,D)−ft(wi−1,Di)∥≤K∥(w,D)−(wi−1,Di)∥X,fori=1,…,t. (9)

where is the Lipschitz constant. Eq. (9) captures which can minimize over , which is equivalent to maximizing since . For the metric , we let where are additional constants that we introduce for notational convenience. Taking square of the both sides of Eq. (9) and summing over all , we can easily derive the following lower bound of (See Appendix B).

 2t∑i=1δt,i−K21t∑i=1δt,i∥w−wi−1∥2−K22t∑i=1δt,i∥D−Di∥2≤^πt(w,D) (10)

We now maximize this lower bound instead of directly maximizing . Interestingly, it corresponds to the following simple minimization problems for and .

 minwt∑i=1δt,i∥w−wi−1∥2,minDt∑i=1δt,i∥D−Di∥2. (11)

#### Efficient sequential update.

Eq. (11) tells how to determine the distilled and for each HO step . Since is expensive to compute, we approximate as , yielding the following weighted average as the approximated solution for .

 w∗t≈γt−1∑ti=1γt−iw0+γt−2∑ti=1γt−iw1+⋯+γ0∑ti=1γt−iwt−1 (12)

The following sequential update allows to efficiently evaluate Eq. (12) for each HO step. Denoting , we have

 t=1: w∗1←w0,t≥2: w∗t←ptw∗t−1+(1−pt)wt−1 (13)

Note that the online update in Eq. (13) does not require to evaluate . It only requires to incorporate the past learning trajectory through the sequential updates. Therefore, the only additional cost is the memory for storing and updating the weighted running average .

For , we have assumed Euclidean distance metric as with , but it is not straightforward to think of Euclidean distance between datasets. Instead, we simply interpret and

as probabilities with which we proportionally subsample each dataset.

 t=1: D∗1←D1,t≥2: D∗t←SS(D∗t−1,pt)∪SS(Dt,1−pt) (14)

where denotes random SubSampling of instances from . There may be a better distance metric for datasets and a corresponding solution, but we leave it as a future work. See Algorithm 3 for the overall description of our algorithm, which we name as HyperDistill.

#### Role of γ

Note that Eq. (12) tells us the role of as a decaying factor. The larger the , the longer the past learning trajectory we consider. In this sense, our method is a generalization of the one-step lookahead approximation, i.e. , which yields and , ignoring the whole information about the past learning trajectory except the last step. may be too pessimistic for most of the cases, so we need to find better performing for each task carefully.

### 4.3 Distilling the hypergradient size

Now we need to plug the distilled and into in Eq. (5) to obtain the scaling factor , for online HO steps . However, whereas evaluating the single JVP is tolerable, again, evaluating is misleading as it is the target we aim to approximate. Also, it is not straightforward for to apply a similar trick we used in Sec. 4.2.

#### Linear estimator.

We thus introduce a linear function

that estimates

by periodically fitting , the parameter of the estimator. Then for each HO step we could use instead of fully evaluating . Based on the observation that the form of lower bound in Eq. (10) is roughly proportional to , we conveniently set to as follows:

 cγ(t;θ)=θ⋅∥vt∥⋅t∑i=1γt−i,wherevt:=αt∂Φ(w∗t,λ;D∗t)∂λ. (15)

#### Collecting samples.

We next see how to collect samples for fitting the parameter , where and . For this, we need to efficiently collect:

1. , the second-order term computed over the horizon of size .

2. , the distilled JVP computed over the horizon of size .

1) : Note that DrMAD in Algorithm 2 (line 6) sequentially back-propagates for . The important observation is that, at step , this incomplete second-order term can be seen as the valid second-order term computed over the horizon of size . This is because the reparameterization gives

 gSOs=s∑i=1αs+(T−s)^As+(T−s)^As−1+(T−s)⋯^Ai+1+(T−s)^Bi+(T−s) (16)

for , nothing but shifting the trajectory index by steps so that the last step is always . Therefore, we can efficiently obtain the valid second-order term for all through the single backward travel along the interpolated trajectory (See Figure 1). Each requires to compute only one or two additional JVPs. Also, as we use DrMAD instead of RMD, we only store such that the memory cost is constant w.r.t. the total horizon size .

2) : For computing the distilled JVP , we first compute the distilled and as below, similarly to Eq. (13) and (14). Denoting , we have

 s=1: w∗1←wT−1, s≥2: w∗s←psw∗s−1+(1−ps)wT−s (17) s=1: D∗1←DT,s≥2: D∗s←SS(D∗s−1,ps)∪SS(DT−s+1,1−ps) (18)

We then compute the unnormalized distilled JVP as .

#### Estimating θ.

Now we are ready to estimate . For , we have and collect . For , we have and collect . Finally, we estimate . See Algorithm 3 and Algorithm 4 for the details. Practically, we set EstimationPeriod in Algorithm 3 to every completions of the inner-optimizations, i.e. . Thus, the computational cost of LinearEstimation is marginal in terms of wall-clock time (see Table 3).

## 5 Experiments

#### Baselines.

We demonstrate the efficacy of our algorithm by comparing to the following baselines.

1) First-Order Approximation (FO). Computationally the most efficient HO algorithm that completely ignore the second-order term, i.e. . 2) One-step Look-ahead Approximation (1-step). (Luketina et al., 2016) The short-horizon approximation where only a single step is unrolled to compute each hypergradient. 3) DrMAD. (Fu et al., 2016) An approximation of RMD that linearly interpolates between the initial and the last weight to save memory (see Algorithm 2). 4) Neumann IFT (N.IFT). (Lorraine et al., 2020a) An IFT based method that approximates the inverse-Hessian-vector product by Neumann series. Note that this method supports online optimization around convergence. Specifically, among total inner-steps, N.IFT means for the last steps we perform online HO each with inversion steps. It requires total JVPs. We tune among , roughly computing JVPs per inner-opt. 5) HyperDistill. Our high-dimensional online HO algorithm based on the idea of knowledge distillation. We tune the decaying factor within

. The linear regression is done every

inner-optimization problems.

#### Target meta-learning models.

We test on the following three meta-learning models.

1) Almost No Inner Loop (ANIL). (Raghu et al., 2020). The intuition of ANIL is that the need for task-specific adaptation diminishes when the task distribution is homogeneous. Following this intuition, based on a typical -layer convolutional network with channels (Finn et al., 2017a), we designate the three bottom layers as the high-dimensional hyperparameter and the 4th convolutional layer and the last fully connected layer as the weight, similarly to Javed and White (2019).

2) WarpGrad. (Flennerhag et al., 2020) Secondly, we consider WarpGrad, whose goal is to meta-learn non-linear warp layers that facilitate fast inner-optimization and better generalization. We use 3-layer convolutional network with 32 channels. Every layer is interleaved with two warp layers that do not participate in the inner-optimization, which is the high-dimensional hyperparameter.

3) MetaWeightNet. (Shu et al., 2019) Lastly, we consider solving the label corruption problem with MetaWeightNet, which meta-learns a small MLP taking a 1D loss as an input and output a reweighted loss. The parameter of the MLP is considered as a high-dimensional hyperparameter. Labels are independently corrupted to random classes with probability . Note that we aim to meta-learn the MLP over a task distribution and apply to diverse unseen tasks, instead of solving a single task. Also, in this meta model the direct gradient is zero, . In this case, in HyperDistill has a meaning of nothing but rescaling the learning rate, so we simply set .

#### Use of Reptile.

Note that for all the above meta-learning models, we meta-learn the weight initialization with Reptile (Nichol et al., 2018) as well, representing a more practical meta-learning scenario than learning from random initialization. We use the Reptile learning rate . Note that in Algorithm 3 and Algorithm 4 denotes the Reptile initialization parameter.

We consider the following two datasets. To generate each task, we randomly sample classes from each dataset and further randomly split into training and test examples. 1) TinyImageNet. (1) This dataset contains classes of general categories. We split them into , , and classes for meta-training, meta-validation, and meta-test. Each class has examples of size . 2) CIFAR100. (Krizhevsky et al., 2009) This dataset contains classes of general categories. We split them into , , and classes for meta-training, meta-validation, and meta-test. Each class has examples of size .

#### Experimental setup.

Meta-training: For inner-optimization of the weights, we use SGD with momentum and set the learning rate for MetaWeightNet and for the others. The number of inner-steps is and batchsize is . We use random cropping and horizontal flipping as data augmentations. For the hyperparameter optimization, we also use SGD with momentum with learning rate for MetaWeightNet and for the others, which we linearly decay toward over total inner-optimizations. We perform parallel meta-learning with meta-batchsize set to . Meta-testing: We solve tasks to measure average performance, with exactly the same inner-optimization setup as meta-training. We repeat this over different meta-training runs and report mean and confidence intervals (see Table 2).

### 5.1 Analysis

We perform the following analysis together with the WarpGrad model and CIFAR100 dataset.

#### HyperDistill provides faster convergence and better generalization.

Figure 2 shows that HyperDistill shows much faster meta-training convergence than the baselines for all the meta-learning models and datasets we considered. We see that the convergence of offline method such as DrMAD is significantly worse than a simple first-order method, demonstrating the importance of frequent update via online optimization. HyperDistill shows significantly better convergence than FO and 1-step because it is online and at the same time alleviates the short horizon bias. As a result, Table 2 shows that the meta-test performance of HyperDistill is significantly better than the baselines, although it requires comparable number of JVPs per each inner-optimiztion.

#### HyperDistill is a reasonable approximation of the true hypergradient.

We see from Figure 3 that the hypergradient obtained from HyperDistill is more similar to the exact RMD than those obtained from FO and 1-step, demonstrating that HyperDistill can actually alleviate the short horizon bias. HyperDistill is even comparable to N.IFT that computes JVPs, whereas HyperDistill computes only a single JVP. Such results indicate that the approximation we used in Eq. (7) and DrMAD in Eq. (16) are accurate enough. Figure 3 shows that with careful tuning of (e.g. ), the direction of the approximated second-order term in Eq. (7) can be much more accurate than the second-order term of 1-step (). In Figure 3, as HyperDistill distills such a good approximation, it can provide a better direction of the second-order term than 1-step. Although the gap may seem marginal, even N.IFT performs similarly, showing that matching the direction of the second-order term without unrolling the full gradient steps is inherently a challenging problem. Figure 4 and 4 show that the samples collected according to Algorithm 4 is largely linear, supporting our choice of Eq. (15). Figure 4 and 4 show that the range of fitted is accurate and stable, explaining why we do not have to perform the estimation frequently. Note that DrMAD approximation (Eq. (16)) is accurate (Figure 3 and 3), helping to predict the hypergradient size.

#### HyperDistill is compuatationally efficient.

Figure 4 shows the superior computational efficiency of HyperDistill in terms of the trade-off between meta-test performance and the amount of JVP computations. Note that wall-clock time is roughly proportional to the number of JVPs per inner-optimization. In Appendix F, we can see that the actual increase in memory cost and wall-cock time is very marginal compared to 1-step approximation.

## 6 Conclusion

In this work, we proposed a novel HO method, HyperDistill, that can optimize high-dimensional hyperparameters in an online manner. It was done by approximating the exact second-order term with knowledge distillation. We demonstrated that HyperDistill provides faster meta-convergence and better generalization performance based on realistic meta-learning methods and datasets. We also verified that it is thanks to the accurate approximations we proposed.

## References

• [1] Cited by: §5.
• Y. Balaji, S. Sankaranarayanan, and R. Chellappa (2018) MetaReg: towards domain generalization using meta-regularization. In NeurIPS, Cited by: §2, §3.2.
• Y. Bengio (2000) Gradient-based optimization of hyperparameters. Neural Computation 12 (8), pp. 1889–1900. Cited by: §1, §1, §2.
• J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization.. J. Mach. Learn. Res. 13. Cited by: §1, §2.
• J. Domke (2012) Generic methods for optimization-based modeling. In AISTATS, Cited by: §2.
• C. Finn, P. Abbeel, and S. Levine (2017a) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML, Cited by: §3.2, §5.
• C. Finn, P. Abbeel, and S. Levine (2017b) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §2.
• S. Flennerhag, A. A. Rusu, R. Pascanu, F. Visin, H. Yin, and R. Hadsell (2020) Meta-learning with warped gradient descent. In ICLR, Cited by: §1, §2, §3.2, §5.
• L. Franceschi, M. Donini, P. Frasconi, and M. Pontil (2017) Forward and reverse gradient-based hyperparameter optimization. In ICML, Cited by: §1, §2, §3.1.
• J. Fu, H. Luo, J. Feng, K. H. Low, and T. Chua (2016)

DrMAD: distilling reverse-mode automatic differentiation for optimizing hyperparameters of deep neural networks

.
In IJCAI, Cited by: §1, §2, §3.2, §5, Algorithm 2.
• E. Grefenstette, B. Amos, D. Yarats, P. M. Htut, A. Molchanov, F. Meier, D. Kiela, K. Cho, and S. Chintala (2020) Generalized inner loop meta-learning. Cited by: §2.
• D. J. Im, C. Savin, and K. Cho (2021) Online hyperparameter optimization by real-time recurrent learning. External Links: 2102.07813 Cited by: §2.
• K. Javed and M. White (2019) Meta-learning representations for continual learning. In NeurIPS, Cited by: §5.
• A. Krizhevsky, G. Hinton, et al. (2009) Learning Multiple Layers of features from Tiny Images. Cited by: §5.
• H. B. Lee, T. Nam, E. Yang, and S. J. Hwang (2020) Meta Dropout: Learning to Perturb Latent Features for Generalization. In ICLR, Cited by: §2.
• Y. Lee and S. Choi (2018) Gradient-based meta-learning with learned layerwise metric and subspace. In ICML, Cited by: §1, §2.
• D. Li, Y. Yang, Y. Song, and T. Hospedales (2018) Learning to generalize: meta-learning for domain generalization. In AAAI, Cited by: §2, §3.2.
• Z. Li, F. Zhou, F. Chen, and H. Li (2017) Meta-sgd: learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835. Cited by: §1, §2.
• H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In ICLR, Cited by: §2, §3.2.
• J. Lorraine, P. Vicol, and D. Duvenaud (2020a) Optimizing millions of hyperparameters by implicit differentiation. In AISTATS, Cited by: §1, §5.
• J. Lorraine, P. Vicol, and D. Duvenaud (2020b) Optimizing millions of hyperparameters by implicit differentiation. In AISTATS, Cited by: §2.
• J. Luketina, M. Berglund, K. Greff, and T. Raiko (2016) Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML, Cited by: Table 1, §1, §2, §2, §3.2, §5.
• D. Maclaurin, D. Duvenaud, and R. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In ICML, Cited by: §1, §2, §3.2.
• P. Micaelli and A. J. Storkey (2020) Non-greedy gradient-based hyperparameter optimization over long horizons. External Links: 2007.07869 Cited by: §2.
• A. Nichol, J. Achiam, and J. Schulman (2018) On First-Order Meta-Learning Algorithms. arXiv e-prints. Cited by: §5.
• E. Park and J. B. Oliva (2019) Meta-curvature. In NeurIPS, Cited by: §2.
• F. Pedregosa (2016) Hyperparameter optimization with approximate gradient. In ICML, Cited by: §1, §2.
• A. Raghu, M. Raghu, S. Bengio, and O. Vinyals (2020) Rapid learning or feature reuse? towards understanding the effectiveness of maml. In ICLR, Cited by: §1, §2, §5.
• S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In ICLR, Cited by: §2.
• M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to Reweight Examples for Robust Deep Learning. ICML. Cited by: §2, §3.2.
• J. U. Ryu, J. Shin, H. B. Lee, and S. J. Hwang (2020) MetaPerturb: transferable regularizer for heterogeneous tasks and architectures. In NeurIPS, Cited by: §2, §3.2.
• J. Schmidhuber (1987) Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. Ph.D. Thesis, Technische Universität München. Cited by: §1, §2.
• A. Shaban, C. Cheng, N. Hatch, and B. Boots (2019) Truncated back-propagation for bilevel optimization. In AISTATS, Cited by: §2.
• J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng (2019) Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting. In NeurIPS, Cited by: §2, §3.2, §5.
• J. Snoek, H. Larochelle, and R. P. Adams (2012)

Practical bayesian optimization of machine learning algorithms

.
In NIPS, Cited by: §1, §2.
• S. Thrun and L. Pratt (Eds.) (1998) Learning to learn. Kluwer Academic Publishers, Norwell, MA, USA. External Links: ISBN 0-7923-8047-9 Cited by: §1, §2.
• H. Tseng, H. Lee, J. Huang, and M. Yang (2020) Cross-domain few-shot classification via learned feature-wise transformation. In ICLR, Cited by: §2.
• O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching Networks for One Shot Learning. In NIPS, Cited by: §2.
• P. J. Werbos (1990) Backpropagation through time: what it does and how to do it. Proceedings of the IEEE. Cited by: §1, §3.2.
• R. J. Williams and D. Zipser (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Computation. Cited by: §1.
• Y. Wu, M. Ren, R. Liao, and R. Grosse. (2018) Understanding short-horizon bias in stochastic meta-optimization. In ICLR, Cited by: §1, §2, §3.2, §4.1.

## Appendix A Derivation of Equation (6)

Let and for notational simplicity. Note that . Then,

 ~πt(w,D) =argminπ∥πf−g∥ =argminπ∥πf−g∥2 =argminππ2fTf−2πfTg+gTg (19) =fTg

Plugging this into in Eq. (19) and with the assumption , we have

 w∗,D∗ =argminw,D(fTg)2⋅fTf−2(fTg)⋅fTg+gTg =argmaxw,D(fTg)2 =argmaxw,DfTg =argmaxw,D~π(w,D). (20)

Note that Eq. (20) results from encoding the closed-form solution already. Therefore, the above is a joint optimization so that we do not have to repeat alternating optimizations between and .

## Appendix B Derivation of Equation (10)

Let and for notational simplicity. Note that and we are given the following inequalities.

 ∥f−fi∥  ≤  K∥(w,D)−(wi−1,Di)∥X,fori=1,…,t.

Taking square of both sides and multiplying ,

 2δt,i−δt,ifTfi  ≤  K21δt,i∥w−wi−1∥2+K22δt,i∥D−Di∥2,fori=1,…,t.

Summing the inequalities over all ,

 ≤K21t∑i=1δt,i∥w−wi−1∥2+K22t∑i=1δt,i∥D−Di∥2.

Rearranging the terms,

 2t∑i=1δt,i−K21t∑i=1δt,i∥w−wi−1∥2−K22t∑i=1δt,i∥D−Di∥2 ≤t∑i=1δt,ifTfi =^π(w,D)

## Appendix C Meta-validation Performance

Figure 5 shows the meta-validation performance as the meta-training proceeds. We can see that our HyperDistill shows much faster meta-convergence and shows better generalization at convergence than the baselines, which is consistent with the meta-training convergence shown in Figure 2.

## Appendix D Hyper-hyperparameter Analysis

Our algorithm, HyperDistill has a hyper-hyperparamter that we tune with a meta-validation set in the range . Figure 6 shows that with all the values of and for all the experimental setups we consider, HyperDistill outperforms all the baselines with significant margins. This demonstrates that the performance of HyperDistill is not much sensitive to the value of .

## Appendix E More Details of MetaWeightNet Experiments

For the additional experimental setup, note that we use architecture recommended by the original paper. Also, we found that lower bounding the output of the weighting function with can stabilize the training. Figure 7 shows the resultant loss weighting function learned with each algorithm. We see that the learned weighting function with HyperDistill tend to output lower values than the baselines.

## Appendix F Computational Efficiency

Table 3 shows the computational efficiency measured in actual memory usage and average wall-clock time required to complete a single inner-optimization. We can see from the table that whereas our HyperDistill requires slightly more memory and wall-clock time than 1-step or Neumann IFT method, the additional cost is definitely tolerable considering the superior meta-test performance shown in Table 2.