1 Introduction
Many successes of deep learning today rely on enormous amounts of labeled data, which is not practical for problems with small data, in which acquiring enough labeled data is expensive and laborious. Moreover, in many mission critical applications, such autonomous vehicles and drones, an agent needs to adapt rapidly to unseen environments. The ability to learn rapidly is a key characteristic of humans that distinguishes them from artificial agents. Humans leverage prior knowledge learned earlier when adapting to a changing task, an ability that can be explained as Bayesian inference (
Brenden M Lake and Tenenbaum (2015)). Pioneered by (Schmidhuber (1987)), metalearning has emerged recently as a novel field of study for learning from small amounts of data. Metalearning algorithms learn their base models by sampling many different smaller tasks from a large data source instead trying to emulate a computationally intractable Bayesian inference. As a result, one might expect that the metalearned model is capable of generalizing well to new unseen tasks because of taskagnostic way of training.In this paper, we focus on a family of metalearning algorithms that aim to learn the initialization of a network, which is then finetuned at test time on the new task by few steps of gradient descent. The most prominent work in this family is MAML (Finn et al. (2017)). MAML directly optimizes the performance of the model with respect to the test time finetuning procedure leading to a better initialization point.
In the parallel or batched version of MAML style algorithms (e.g. Reptile (Nichol et al. (2018)), at each iteration we optimize on tasks. However, the contribution of each task to the model parameter updates is assumed to be the same, i.e., the update rule is merely the average of the tasks losses. As an example, suppose a scenario in which a batch of tasks is sampled and tasks have strong agreement on the gradient direction but one task has a large disagreeing gradient. In this case, the disagreeing task can adversely affect the next model parameter update.
The contribution of this paper is introducing an optimization objective, i.e., gradient agreement by which a model can generalize better across tasks. The proposed algorithm and its mathematical analysis are illustrated in Sections 2 and 3, respectively. In Section 4, we provide some empirical insights by evaluations on miniImageNet (Vinyals et al. (2016)) and Omniglot (Lake et al. (2011)) classification tasks, and a toy example regression task.
2 Gradient Agreement
In this work, we consider the optimization problem of MAML which is finding a good initial set of parameters, . The goal is that for a randomly sampled task with corresponding loss , the model converges to a low loss point after number of gradient steps. This objective can be presented as , where is a function performing gradient steps that updates by samples drawn from task . Minimizing this objective function, requires two steps in MAML style works: 1) Adapting to a new task in which the model parameters become by steps of gradient descent on task , referred to as innerloop optimization, 2) Updating the model parameters by optimizing for the performance of with respect to across all sampled tasks performed as follows: , and referred to as outerloop optimization.
The assumption in the prior arts (Finn et al. (2017); Nichol et al. (2018)) is that each task should contribute equally for the direction of the outerloop optimization step. However, in our approach, we associate a weight to the loss of each task, and the metaoptimization rule will become: . Assuming that the gradient update vector for each task is presented by , we define as:
(1) 
The proposed is proportional to the inner product of the gradient of a task and the average of the gradients of all tasks in a batch. Therefore, if the gradient of a task is aligned with the average of the gradients of all tasks, it contributes more than other tasks to the next parameter update. With this insight, the proposed optimization method, pushes the model parameters towards initial parameters that more tasks agree upon. The full procedure of the proposed optimization is illustrated in Algorithm 1.
3 Analysis
In this section, inspired by (Jenni and Favaro (2018)), we provide a mathematical analysis for the proposed method. Denoting a batch of sampled tasks as , the outerloop update in batched version of MAML with current parameters will be:
(2) 
Instead, in gradient agreement approach, we look for a linear combination of task losses that lead to a better approximation of the tasks errors. We use
per each task which is estimated at each iteration. Therefore, the goal is to find parameters
and coefficients so that the model performs well on the sampled batch of tasks. We thus aim to optimize the following objective function:(3) 
(4) 
where, w is the vector of all weights, and , associated with each task, and is a regularization parameter for the distribution of weights. In this section we show that the introduced inner product metric in 1 is corresponding to minimizing the objective function of Equation 3. Note that the solution of Equation 3 does not change if we multiply all the weights, , by the same positive constant. As a result, we normalize the magnitude of w to one. A typical method to solve the above optimization is the socalled implicit differentiation (Domke (2012)) method to solve a linear system in the second order derivatives of loss functions, which leads to solving a very highdimensional linear system. To avoid computational complexities of the aforementioned method, we apply proximal approximation using Taylor series.
The loss function of th task at can be approximated using Taylor series as:
(5) 
where is the update vector obtained in the innerloop optimization for th task. By plugging in the first order approximation of (5) to (3) and (4), we obtain the following objective function:
(6) 
(7) 
The closedform solution to the quadratic problem in (7) is a classical SGD update rule of:
(8) 
(9) 
We compute the derivative of (9) with respect to for all and set it to zero, temporarily ignoring the normalization constraint on w:
(10) 
Therefore, will be:
(11) 
Now, by applying the constraint on the norm of w, we obtain as:
(12) 
As a result, will be:
(13) 
4 Experiments and Results
4.1 OneDimensional Sine Wave Regression
As a simple case study, consider the 1D sine wave regression problem. A task is defined by the amplitude and phase of a sine wave function , and are sampled from and , respectively. The average of loss over 1000 unseen sine wave tasks is 0.13 and 0.08 for Reptile and gradient agreement method, respectively.
4.2 Fewshot Classification
We evaluate our method on two popular fewshot classification tasks, miniImageNet (Vinyals et al. (2016)) and Omniglot (Lake et al. (2011)). The Omniglot dataset consists of 20 instances of 1623 characters from 50 different alphabets. The miniImagenet dataset includes 64 training classes, 12 validation classes, and 24 test classes, each class contains 600 examples. The problem of Nway classification is set up as follows: we sample N classes from the total C classes in the dataset and then selecting K examples for each class. For our experiments, we use the same CNN architectures and data preprocessing as (Finn et al. (2017)) and (Nichol et al. (2018)). The results for miniImageNet and Omniglot are presented in Tables 1 and 2, respectively. The models learned by our approach compares well to the stateoftheart results on these two tasks. It substantially outperforms the prior methods on the challenging task of miniImageNet.
5 Conclusion and Future Works
In this work, we presented a generalization method for metalearning algorithms that adjusts the model parameters by introducing a set of weights over the loss functions of tasks in a batch in order to maximize the dot products between the gradients of different tasks. The higher the inner products between the gradients of different tasks, the more agreement they have upon the model parameters update. We also presented the objective function of this optimization method by a theoretical analysis using first order Taylor series approximation. This geometrical interpretation of optimization in metalearning studies can be an interesting future direction.
References
 Brenden M Lake and Tenenbaum [2015] Ruslan Salakhutdinov Brenden M Lake and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. In Science, 2015.
 Schmidhuber [1987] Jurgen Schmidhuber. Evolutionary principles in selfreferential learning. Master’s thesis, 1987.
 Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. CoRR, abs/1703.03400, 2017. URL http://arxiv.org/abs/1703.03400.
 Nichol et al. [2018] Alex Nichol, Joshua Achiam, and John Schulman. On firstorder metalearning algorithms. CoRR, abs/1803.02999, 2018. URL http://arxiv.org/abs/1803.02999.
 Vinyals et al. [2016] Oriol Vinyals, Charles Blundell, Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. CoRR, abs/1606.04080, 2016. URL http://arxiv.org/abs/1606.04080.
 Lake et al. [2011] Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum. One shot learning of simple visual concepts. In CogSci, 2011.
 Jenni and Favaro [2018] Simon Jenni and Paolo Favaro. Deep bilevel learning. volume abs/1809.01465, 2018.

Domke [2012]
Justin Domke.
Generic methods for optimizationbased modeling.
In
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics
, 2012.