1 Introduction
A key challenge in metalearning is fast adaptation: learning on previously unseen tasks fast and with little data. In principle, this can be achieved by leveraging knowledge obtained in other, related tasks. However, the best way to do so remains an open question.
A popular recent method for fast adaptation is model agnostic meta learning (MAML) [Finn et al., 2017a], which learns a model initialisation, such that at test time the model can be adapted to solve the new task in only a few gradient steps. MAML has an interleaved training procedure, comprised of inner loop and outer loop updates that operate on a batch of tasks at each iteration. In the inner loop, MAML learns taskspecific parameters by performing one gradient step on a taskspecific loss. Then, in the outer loop, the model parameters from before the inner loop update are updated to reduce the expected loss across tasks after the inner loop update on the individual tasks. Hence, MAML learns a model initialisation that, at test time, can generalise to a new task after only a few gradient updates.
However, while MAML adapts the entire model to the new task, many transfer learning algorithms adapt only a fraction of the model
[Kokkinos, 2017], keeping the rest fixed across tasks. For example, representations learned for image classification [He et al., 2016] can be reused for semantic segmentation [He et al., 2017] or tracking [Wojke et al., 2017]. This suggests that some model parameters can be considered task independent and others task specific. Adapting only some model parameters can make learning faster and easier, as well as mitigate overfitting and catastrophic forgetting.To this end, we propose context adaptation for metalearning (CAML), a new method for fast adaptation via metalearning. Like MAML, we learn a model initialisation that can quickly be adapted to new tasks. However, unlike MAML, we adapt only a subset of the model parameters to the new task. While restricting adaptation in this way is straightforward, it raises a key question: how should we decide which parameters to adapt and which to keep fixed? The main insight behind CAML is that, for many fast adaptation problems, the inner loop reduces to a task identification problem, rather than learning how to solve the whole task, which is typically infeasible with only a few gradient updates. Thus, it suffices if the part of the model that varies across tasks is an additional input to the model, and is independent of its other inputs.
These additional inputs, which we call context parameters (see Figure 1), can be interpreted as a task embedding that modulates the behaviour of the model. This embedding is learned via backpropagation during the inner loop of a metalearning procedure similar to MAML, while the rest of the model is updated only in the outer loop. This allows CAML to explicitly optimise the taskindependent parameters for good performance across tasks, while ensuring that the taskspecific context parameters can quickly adapt to new tasks.
This separation of task solver and task embedding has several advantages. First, the size of both components can be chosen appropriately for the task. In particular, the network can be made expressive enough without overfitting to a single task in the inner loop, which we show empirically MAML is prone to. Model design and architecture choice also benefit from this separation, since for many practical problems we have prior knowledge of which aspects vary across tasks and hence how much capacity the context parameter should have. Like MAML, our method is modelagnostic, i.e., it can be applied to any model that is trained via gradient descent. However, CAML is easier to implement: assigning the correct computational graphs for higher order gradients is done only on the level of the context parameters, avoiding manual access and operations on the network weights and biases. Furthermore, parameter copies are not necessary which saves memory writes, a common bottleneck for running on GPUs. CAML can also help distributed machine learning systems, where the same model is deployed to different machines and we wish to learn different contexts concurrently. Network communication is often the bottleneck, which is mitigated by only sharing the (gradients of) context parameters.
We show empirically that CAML outperforms MAML on a regression and classification task and performs similarly on a reinforcement learning problem, while adapting significantly fewer parameters at test time. We observe that CAML is less sensitive to the innerloop learning rate, and can be scaled up to larger networks without overfitting. We also demonstrate that the context parameters represent meaningful embeddings of tasks, confirming that the inner loop acts as a task identification step.
2 Background: MetaLearning for Fast Adaptation
We consider settings where the goal is to learn models that can quickly adapt to a new task with only little data. To this end, learning on the new task is preceded by metalearning on a set of related tasks. Here we describe the metalearning problem for supervised and reinforcement learning, as well as MAML.
2.1 Problem Setting
In fewshot learning problems, we are given distributions over training tasks and test tasks . Training tasks can be used to learn how to adapt fast to any of the tasks with little pertask data, and evaluation is then done on (previously unseen) test tasks. Unless stated otherwise, we assume that and refer to both as . Tasks in typically share some structure, so that transferring knowledge between tasks speeds learning. During each metatraining iteration, a batch of tasks is sampled from .
Supervised Learning.
In a supervised learning setting, we learn a model
that maps data points that have a true label to predictions . A task is defined as a tuple , where is the input space, is the output space,is a taskspecific loss function, and
is a distribution over labelled data points. We assume that all data points are drawn i.i.d. from . Different tasks can be created by changing any element of .Training in the supervised metalearning setting proceeds over metatraining iterations, where for every task from the current batch, we sample two datasets (for training) and from :
(1) 
where . and are the number of training and test datapoints, respectively. The training data is used to update , and the test data is then used to evaluate how good this update was, and adjust or the update rule accordingly.
Reinforcement Learning. In a reinforcement learning (RL) setting, we aim to learn a policy that maps states to actions . Each task corresponds to a Markov decision process (MDP): a tuple , where is a set of states, is a set of actions, is a reward function, is a transition function, and is an initial state distribution. The goal is to maximise the expected cumulative reward under ,
(2) 
where is the horizon and is the discount factor. Again, different tasks can be created by changing any element of .
During each metatraining iteration, for every task from the current batch, we first collect a trajectory
(3) 
where the initial state is sampled from , the actions are chosen by the current policy , the state transitions according to , and is the number of environment interactions available. We unify several episodes in this formulation: if the horizon is reached within the trajectory, the environment is reset using . Once the trajectory is collected, this data is used to update the policy. Another trajectory is then collected by rolling out the updated policy for time steps. This test trajectory is used to evaluate the quality of the update on that task, and to adjust or the update rule accordingly.
Evaluation for both supervised and reinforcement learning problems is done on a new (unseen) set of tasks drawn from (or if the test distribution of task is different). For each such task, the model is updated using or and only few datapoints ( or ). Performance of the updated model is reported on or .
2.2 ModelAgnostic MetaLearning
One method for fewshot learning is modelagnostic metalearning [Finn et al., 2017a, MAML]. Here, we describe the application of MAML to a supervised learning setting. MAML learns an initialisation for the parameters of a model such that, given a new task, a good model for that task can be learned with only a small number of gradient steps and data points. In the inner loop, MAML computes new taskspecific parameters (starting from ) via one^{1}^{1}1We outline the method for one gradient update here, but several gradient steps can be performed at this point as well. gradient update,
(4) 
For the metaupdate in the outer loop, the original model parameters are then updated with respect to the performance after the innerloop update, i.e.,
(5) 
The result of training is a model initialisation that can be adapted with just a few gradient steps to any new task that we draw from . Since the gradient is taken with respect to the parameters before the innerloop update (4), the outerloop update (5) involves higher order derivatives in .
3 Fast Context Adaptation via MetaLearning
We propose to partition the model parameters into two parts: context parameters that are adapted in the inner loop on an individual task, and parameters that are shared across tasks and metalearned in the outer loop. In the following we describe the training procedure for supervised and reinforcement learning problems. Pseudocode is provided in Appendix A.
3.1 Supervised Learning
At every metatraining iteration and for the current batch of tasks, we use the training data of each task as follows. Starting from , which can either be fixed or metalearned as well (we typically choose ; see Section 3.4), we learn taskspecific parameters via one gradient update:
(6) 
While we only take the gradient with respect to , the updated parameter is also a function of , since during backpropagation, the gradients flow through the model. Once we have collected the updated parameters for all sampled tasks, we proceed to the metalearning step, in which is updated:
(7) 
This update includes higher order gradients in due to the dependency on (6).
3.2 Reinforcement Learning
During each iteration, for a current batch of MDPs , we proceed as follows. Given (see Section 3.4), we collect a rollout by executing the policy . We then compute taskspecific parameters via one gradient update:
(8) 
where is an objective function given by any gradientbased reinforcement learning method that uses trajectories produced by a parameterised policy to update that policy’s parameters, such as TRPO [Schulman et al., 2015] or DQN [Mnih et al., 2015]. After updating the policy, we collect another trajectory to evaluate the updated policy, where actions are chosen according to the updated policy .
After doing this for all tasks in , we continue with the metaupdate step. Here, we update the parameters to maximise the average performance across tasks (after individually updating for them),
(9) 
This update includes higher order gradients in due to the dependency on (8).
3.3 Conditioning on Context Parameters
Since are independent of the network input, we need to decide where and how to condition the network on them. For an output node at a fully connected layer , this can for example be done by simply concatenating to the inputs to that layer:
(10) 
where
is a nonlinear activation function,
is a bias parameter, are the weights associated with layer input , and are the weights associated with the context parameter . This is illustrated in Figure 1. In our experiments, for fully connected networks, we add the context parameter at the first layer, i.e., concatenate them to the input.Other conditioning methods can be used with CAML as well. E.g., for convolutional networks, we use the featurewise linear modulation FiLM method [Perez et al., 2017] for image classification experiments (Section 5.2). FiLM conditions by doing an affine transformation on the feature maps: given context parameters and a convolutional layer that outputs feature maps
, FiLM applies a linear transformation to each feature map
, where the parameters are a function of the context parameters. We use a fully connected layer with the identity function at the output. In our experiments, we found it helps performance to add the context parameters not at the first layer (in our case, after the third out of four convolutional operations).3.4 Context Parameter Initialisation
When learning a new task, the context parameters have to be initialised to some value, . We argue that, instead of metalearning this initialisation as well, a fixed is sufficient: in (10), if both and are metalearned, the learned initialisation of can be subsumed into the bias parameter , and can be set to a fixed value. The same holds for conditioning when using FiLM layers. A key benefit of CAML is therefore that it is easy to implement, since the initialisation of the context parameters does not have to be metalearned and parameter copies are not required. We set in our implementation, which gives the additional opportunity for visual inspection of the learned context parameters (see Sections 5.1 and 5.3).
3.5 Learning Rate
Since the context parameters are inputs to the model, the gradients at this point are not backpropagated further through any other part of the model. Furthermore, because learning and is decoupled, the inner loop learning rate can effectively be metalearned by the rest of the model. This makes the method robust to the initial learning rate that is chosen for the inner loop, as we show empirically in Sections 5.1 and 5.3.
4 Related Work
Metalearning, or learning to learn, has been explored in various ways in the literature. One general approach is to learn the algorithm or update function itself (a.o., Schmidhuber [1987], Bengio et al. [1992], Andrychowicz et al. [2016], Ravi and Larochelle [2017]). Another approach is to metalearn a model initialisation such that the model can perform well on a new task after only few gradient steps, such as MAML [Finn et al., 2017a]. Other such methods are REPTILE [Nichol and Schulman, 2018] which does not require second order gradient computation and MetaSGD [Li et al., 2017], which attempts to learn the perparameter inner loop learning rate. Recent work by Grant et al. [2018] also considers a Bayesian interpretation of MAML. The main difference to our work is that we consider to only adapt a small number of parameters in the inner learning loop, and that these parameters come in the form of input context parameters.
In Finn et al. [2017b] the authors augment the model with additional biases to improve the performance of MAML in a robotic manipulation setting. In contrast, we update only the context parameters in the inner loop, and they are initialised to before adaptation to a new task. In the context of neural language models, Rei [2015] also considers adapting only context parameters in the inner loop, but they do not consider the benefits of initialising and resetting their value to . In addition is does not consider the application of the method to the variety of domains covered by this paper (something briefly explored in the appendix of Finn et al. [2017a]).
Other metalearning methods are also motivated by the fact that learning in a highdimensional parameter space can pose practical difficulties, and fast adaptation in lower dimensional space can be better (e.g., Sæmundsson et al. [2018], Zhou et al. [2018], Rusu et al. [2018]). Closely related to our method is the work of Lee and Choi [2018], who also update only part of the network at test time. They however employ a learning mechanism that allows them to also learn which
part of the network to update. This is less sensitive to the choice of initial learning rate (something we investigate in our empirical evaluation as well) and can outperform MAML on the MiniImagenet classification task. The idea of dynamic partitioning is attractive however it results in a more complex meta learning algorithm. In this work we consider a simpler, more interpretable alternative. Our experiments include, where applicable, comparisons to the above algorithms. We show that CAML is highly competitive with the above approaches while remaining a simple approach.
Context features as a component of inductive transfer were first introduced by Silver et al. [2008]
, who use a onehot encoded taskspecifying context as input to the network (which is not learned but predefined). They show that this works better than learning a shared feature extractor and having separate heads for all tasks. Learning a task embedding itself has been also explored, e.g., by
Oreshkin et al. [2018] or Garnelo et al. [2018], who use the task’s training set to condition the network. By contrast, we learn the context parameters via backpropagation through the same network that is used to solve the task.5 Experiments
In this section we empirically evaluate CAML. Our extensive experiments aim to demonstrate three qualities of our method. First, adapting a small number of input parameters during the inner loop is sufficient to yield performance equivalent to or better than MAML in a range of regression, classification and reinforcement learning tasks. Like in MAML, it is possible to continue learning by performing several gradient update steps at test time, even when training using only one gradient step. Second, CAML is robust to the taskspecific learning rate and scales well to more expressive networks without overfitting. Third, an embedding of the task emerges in the context parameters solely via backpropagation through the original inner loss.
5.1 Regression
Number of Additional Input Parameters  

Method  0  1  2  3  4  5 
CAML    
MAML 
Parallel Partitioning  

Nodes  Layer  and Layer  Layer 
1  
5  
20 
Stacked Partitioning  

Layer  Last Layer 
We start with the regression problem of fitting sine curves, using the same setup as Finn et al. [2017a] to allow a direct comparison. A task is defined by the amplitude and phase of the sine curve, and is generated by uniformly sampling the amplitude from and the phase from . For training, ten labelled datapoints (uniformly sampled from ) are given for each task for the inner loop update. Per metaupdate we iterate over a batch of
tasks and perform gradient descent on a meansquared error (MSE) loss. We use a neural network with two hidden layers and
nodes each and ReLU nonlinearities. During testing we present the model with ten datapoints from 1000 newly sampled tasks and measure MSE over 100 test points.
CAML uses the same training procedure and architecture but adds context parameters. To allow a fair comparison, we add the same number of additional inputs to MAML, an extension that was also done by Finn et al. [2017b]. These additional parameters are metalearned together with the rest of the network, which can improve performance due to a more expressive gradient. Our method differs from this formulation in that we update only the context parameters in the inner loop, abd reinitialise them to zero for each new task. In the outer loop, we only update the shared parameters.
Table 1 shows that CAML outperforms the original MAML (with no additional inputs) significantly, and MAML with the same network architecture by a small margin. This performance gain is possible even though at test time, CAML adapts only  parameters, instead of around . To test the hypothesis that it suffices to adapt only input parameters per task, we also compare to alternative parameter partitions in Table 2. In parallel partitioning, we choose a strict subset of the nodes of each layer for taskspecific adaptation, and metalearn the rest. In stacked partitioning, we choose one or several layers for taskspecific adaptation, and metalearn the other layers. The results confirm that partitioning on context parameters is key to success: the other variants perform worse, often significantly so. A recent method proposed by Lee and Choi [2018], also partitions the network to adapt only part of it on a specific task – the partitioning mask, however, is learned. They test their method on the regression task as well, but we outperform the numbers they report significantly (not shown, since we believe this might be due to differences in implementation). In the next section we will see that on fewshot classification, this approach achieves comparable performance to our method.
MAML is known to keep learning after several gradient update steps. We test this on our method as well, with the results shown in Figure 3 for up to gradient steps. CAML outperforms MAML even after taking several gradient update steps, and is more stable, as indicated by the size of the confidence intervals and the monotonic learning curve.
As described in Section 3.4, CAML has the freedom to scale the gradients at the context parameters since they are inputs to the model and trained separately. Figure 5 plots the inner learning rate against the norm of the gradient of the context parameters at test time. We can see that the weights are adjusted so that lower learning rates bring about larger context parameter gradients and viceversa. This results in the method being extremely robust to learning rates as confirmed by Figure 5. We plot the performance while varying the learning rate from to . CAML is robust to changes in learning rate while MAML performs well only in a small range. Work by Li et al. [2017] shows that MAML can be improved by learning a parameterspecific learning rate, which, however, introduces a lot of additional parameters.
CAML’s performance on the regression task correlates with how many variables are needed to encode the tasks. In these experiments, two parameters vary between tasks, which is exactly the context parameter dimensionality at which CAML starts to perform well (the optimal encoding is three dimensional, as phase is periodic). This suggests CAML may indeed learn task descriptions in the context parameters. Figure 3
illustrates this by plotting the value of the learned inputs against the amplitude/phase of the task in the case of two context parameters. The model learns a smooth embedding in which interpolation between tasks is possible.
5.2 Classification
learn  5way accuracy  
Method  init.  univ.  1shot  5shot 
Matching Nets [Vinyals et al., 2016]  
Meta LSTM [Ravi and Larochelle, 2017]  
Prototypical Networks [Snell et al., 2017]  
MetaSGD [Li et al., 2017]  ✓  ✓  
REPTILE [Nichol and Schulman, 2018]  ✓  ✓  
PLATIPUS [Finn et al., 2018]  ✓  ✓    
MTNET [Lee and Choi, 2018]  ✓  ✓    
Qiao et al. [2017]  
LEO Rusu et al. [2018]  ✓  
MAML (32) [Finn et al., 2017a]  ✓  ✓  
MAML (64)  ✓  ✓  
CAML (32)  ✓  ✓  
CAML (64)  ✓  ✓  
CAML (128)  ✓  ✓  
CAML (256)  ✓  ✓  
CAML (512)  ✓  ✓  
CAML (256, first order)  ✓  ✓  
CAML (512, first order)  ✓  ✓ 
To evaluate CAML on a more challenging problem, we test it on the competitive fewshot image classification benchmark MiniImagenet [Ravi and Larochelle, 2017]. In way shot classification. a task is a random selection of classes, for each of which the model gets to see
examples. From these it must learn to classify unseen images from the
classes. The MiniImagenet dataset consists of 64 training classes, 12 validation classes, and 24 test classes. During training, we generate a task by selecting classes at random from the classes and training the model on examples of each, i.e., a batch of images. The metaupdate is done on a set of unseen images of the same classes.On this benchmark, MAML uses a network with four convolutional layers with filters each and one fully connected layer at the output [Finn et al., 2017a]. We use the same network architecture, but with between and filters per layer. We use context parameters and add a FiLM layer (see Section 3.3) that conditions on these after the third convolutional layer. The parameters of the FiLM layer are metalearned with the rest of the network, i.e., they are part of .
All our models were trained with two gradient steps in the inner loop and evaluated with two gradient steps (note: MAML was trained with five innerloop gradient steps and evaluated with ten gradient steps). The inner learning rate was set to . Following Finn et al. [2017a], we ran each experiment for metaiterations and selected the model with the highest validation accuracy for evaluation on the test set.
Table 3 shows our results on MiniImagenet heldout test data for way shot and
shot classification. We compare to a number of existing metalearning approaches that use convolutional neural networks, including MAML. Our largest model (
filters) clearly outperforms MAML, and outperforms the other methods on the shot classification task. On shot classification, the best results are obtained by prototypical networks [Snell et al., 2017], a method that is specific to fewshot classification and works by computing distances to prototype representations of each class. Our smallest model ( filters) underperforms MAML (within the confidence intervals). As we can see, CAML benefits from increasing model expressiveness: since we only adapt the context parameters in the inner loop per task, we can substantially increase the network size, without overfitting during the inner loop update. We tested scaling up MAML to a larger network size as well (see Table 3), but found that this hurt accuracy. We also include two stateoftheart results from methods based on much deeper, residual networks [Qiao et al., 2017, Rusu et al., 2018, greyed out], which are not directly comparable due to much more expressive networks. Our method can be readily applied to deep residual networks as well, and we leave this exploration for future work. Table 3 also shows the first order approximation of our largest models, where the gradient with respect to is not backpropagated through the inner loop update of the context parameters . As expected, this results in a lower accuracy (a drop of ) , but we are still able to outperform MAML with a firstorder version of our largest network.Thus, CAML can achieve much higher accuracies than MAML by increasing the network size, without overfitting. Our results are obtained by only adjusting parameters at test time, instead of .
5.3 Reinforcement Learning
To demonstrate the versatility of CAML, we also perform preliminary reinforcement learning experiments on a 2D Navigation task, also introduced by Finn et al. [2017a]. In this domain, the agent moves in a 2D world using continuous actions. At each timestep it is given a negative reward proportional to its distance from a predefined goal position. Each task is defined by a new unknown goal position.
We follow the same procedure as Finn et al. [2017a]. Goals are sampled from an interval of . At each step we sample 20 tasks for both the inner and outer loops and testing is performed on 40 new unseen tasks. We perform learning for 500 iterations and the best performing policy during training is then presented with new test tasks and allowed two gradient updates. For each update, the total reward over 20 rollouts per task is measured. We use a twolayer network with 100 units per layer and ReLU nonlinearities to represent the policy and a linear value function approximator. For CAML we use five context parameters at the input layer.
In terms of performance (Figure 5(a)) we can see that the two methods are highly competitive. MAML performs better after the first gradient update after it is surpassed by CAML. Figure 5(b), which plots performance for several learning rates, shows that CAML is again less sensitive to the inner loop learning rate. Only when using a learning rate of 0.1 is MAML competitive in performance. Furthermore, CAML adapts parameters whereas MAML adapts around parameters.
As with regression, the optimal task embedding is low dimensional enough to plot. We therefore apply CAML with two context parameters and plot how these correlate with the actual position of the goal for 200 test tasks (this results in slightly worse performance than when using context parameters; comparison not shown). Figure 5(c) shows that the context parameters obtained after two policy gradient updates represent a disentangled embedding of the actual task. Specifically, context parameter 1 appears to encode the position of the goal, while context parameter 2 encodes the position. Hence, CAML can learn compact potentially interpretable task embeddings via backpropagation through the inner loss.
6 Conclusion and Future Work
In this paper we introduced CAML, a metalearning approach for fast adaptation that introduces context parameters in addition to the model’s parameters. The context parameters are used to modulate the whole network during the inner loop of metalearning, while the rest of the network parameters are adapted in the outer loop and shared across tasks. On regression, our method outperforms MAML and is superior to naive approaches to partitioning network parameters. We also showed that CAML is highly competitive with state of the art methods on few shot classification using CNNs. In addition to this, we experimented extensively with some unique properties that specifically arise from the way that our method is formulated, such as robustness to learning rate and the emergence of task embeddings at the context parameters. Another interesting extension would be to inspect the context parameter representations learned by CAML on the MiniImagenet benchmark using advanced dimensionality reduction techniques.
In this paper we performed some preliminary RL experiments. We are interested in extending CAML to more challenging problems and explore its role in allowing for smart exploration in order to identify the task at hand. It would also be interesting to consider probabilistic extensions along the lines of PLATIPUS [Finn et al., 2018] where the context parameters include uncertainty about the task.
Finally, the intriguing empirical properties of CAML detailed in this work will be the base of more theoretical investigations in the future.
Acknowledgements
We thank Wendelin Boehmer and Mark Finean for useful discussions and feedback, and Joost van Amersfoort for support with using PyTorch. We would also like to thank Jackie Loong for their opensourced MAMLPyTorch implementation, and Chelsea Finn for responding quickly to github issues. The NVIDIA DGX1 used for this research was donated by the NVIDIA corporation. L. Zintgraf is supported by the Microsoft Research PhD Scholarship Program. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713).
References
 Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 Bengio et al. [1992] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pages 6–8. Univ. of Texas, 1992.
 Finn et al. [2017a] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017a.
 Finn et al. [2017b] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. Oneshot visual imitation learning via metalearning. arXiv preprint arXiv:1709.04905, 2017b.
 Finn et al. [2018] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic modelagnostic metalearning. arXiv preprint arXiv:1806.02817, 2018.
 Garnelo et al. [2018] Marta Garnelo, Dan Rosenbaum, Chris J Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J Rezende, and SM Eslami. Conditional neural processes. arXiv preprint arXiv:1807.01613, 2018.
 Grant et al. [2018] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradientbased metalearning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.

He et al. [2015]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on
imagenet classification.
In
Proceedings of the IEEE international conference on computer vision
, pages 1026–1034, 2015. 
He et al. [2016]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask rcnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
 Kokkinos [2017] Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low, mid, and highlevel vision using diverse datasets and limited memory. In CVPR, volume 2, page 8, 2017.
 Lee and Choi [2018] Yoonho Lee and Seungjin Choi. Gradientbased metalearning with learned layerwise metric and subspace. In International Conference on Machine Learning, pages 2933–2942, 2018.
 Li et al. [2017] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Metasgd: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835, 2017.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Nichol and Schulman [2018] Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018.
 Oreshkin et al. [2018] Boris N Oreshkin, Alexandre Lacoste, and Pau Rodriguez. Tadam: Task dependent adaptive metric for improved fewshot learning. arXiv preprint arXiv:1805.10123, 2018.
 Perez et al. [2017] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.
 Qiao et al. [2017] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Fewshot image recognition by predicting parameters from activations. CoRR, abs/1706.03466, 1, 2017.
 Ravi and Larochelle [2017] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. In International Conference on Learning Representations (ICLR), 2017, 2017.
 Rei [2015] Marek Rei. Online representation learning in recurrent neural language models. arXiv preprint arXiv:1508.03854, 2015.
 Rusu et al. [2018] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Metalearning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.
 Sæmundsson et al. [2018] Steindór Sæmundsson, Katja Hofmann, and Marc Peter Deisenroth. Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551, 2018.
 Schmidhuber [1987] Jürgen Schmidhuber. Evolutionary Principles in Selfreferential Learning: On Learning how to Learn: the Metametameta…hook. 1987.
 Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 Silver et al. [2008] Daniel L Silver, Ryan Poirier, and Duane Currie. Inductive transfer with contextsensitive neural networks. Machine Learning, 73(3):313, 2008.
 Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
 Vinyals et al. [2016] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
 Wojke et al. [2017] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In Image Processing (ICIP), 2017 IEEE International Conference on, pages 3645–3649. IEEE, 2017.
 Zhou et al. [2018] Fengwei Zhou, Bin Wu, and Zhenguo Li. Deep metalearning: Learning to learn in the concept space. arXiv preprint arXiv:1802.03596, 2018.
Appendix A PseudoCode
Appendix B Experimental Details
b.1 Classification
For MiniImagenet, our model takes as input images of size and has outputs, one for each class. The model has four modules that each consist of: a convolution with a
kernel, padding
andfilters, a batch normalisation layer, a maxpooling operation with kernel size
, if applicable a FiLM transformation (only at the third convolution, details below), and a ReLU activation function. The output size of these four blocks is , which we flatten to a vector and feed into one fully connected layer.The FiLM layer itself is a fully connected layer with inputs and a dimensional output and the identity function at the output. The output is divided into and , each of dimension , which are used to transform the filters that the convolutional operation outputs. The context vector is of size (other sizes tested: , ) and is added after the third convolution (other versions tested: at the first, second or fourth convolution).
The network is initialised using He et al. [2015] initialisation for the weights of the convolutional and fully connected weights (including the FiLM layer weights). The bias parameters are initialised to zero, except at the FiLM layer.
We use the Adam optimiser for the metaupdate step with an initial learning rate of . This learning rate is annealed every steps by multiplying it by . The inner learning rate is set to (other hyperparameters tested: , ).
For MiniImagenet, we use a meta batchsize of and tasks for shot and shot classification respectively. For the batch norm statistics, we always use the current batch – also during testing. I.e., for way shot classification the batch size at test time is , and we use this batch for normalisation.