Log In Sign Up

CAML: Fast Context Adaptation via Meta-Learning

by   Luisa M Zintgraf, et al.

We propose CAML, a meta-learning method for fast adaptation that partitions the model parameters into two parts: context parameters that serve as additional input to the model and are adapted on individual tasks, and shared parameters that are meta-trained and shared across tasks. At test time, the context parameters are updated with one or several gradient steps on a task-specific loss that is backpropagated through the shared part of the network. Compared to approaches that adjust all parameters on a new task (e.g., MAML), our method can be scaled up to larger networks without overfitting on a single task, is easier to implement, and saves memory writes during training and network communication at test time for distributed machine learning systems. We show empirically that this approach outperforms MAML, is less sensitive to the task-specific learning rate, can capture meaningful task embeddings with the context parameters, and outperforms alternative partitionings of the parameter vectors.


MT3: Meta Test-Time Training for Self-Supervised Test-Time Adaption

An unresolved problem in Deep Learning is the ability of neural networks...

Meta-learning the Learning Trends Shared Across Tasks

Meta-learning stands for 'learning to learn' such that generalization to...

HetMAML: Task-Heterogeneous Model-Agnostic Meta-Learning for Few-Shot Learning Across Modalities

Most of existing gradient-based meta-learning approaches to few-shot lea...

Connecting Context-specific Adaptation in Humans to Meta-learning

Cognitive control, the ability of a system to adapt to the demands of a ...

Scene-Adaptive Video Frame Interpolation via Meta-Learning

Video frame interpolation is a challenging problem because there are dif...

Meta-Learning Dynamics Forecasting Using Task Inference

Current deep learning models for dynamics forecasting struggle with gene...

Lightweight Conditional Model Extrapolation for Streaming Data under Class-Prior Shift

We introduce LIMES, a new method for learning with non-stationary stream...

1 Introduction

A key challenge in meta-learning is fast adaptation: learning on previously unseen tasks fast and with little data. In principle, this can be achieved by leveraging knowledge obtained in other, related tasks. However, the best way to do so remains an open question.

A popular recent method for fast adaptation is model agnostic meta learning (MAML) [Finn et al., 2017a], which learns a model initialisation, such that at test time the model can be adapted to solve the new task in only a few gradient steps. MAML has an interleaved training procedure, comprised of inner loop and outer loop updates that operate on a batch of tasks at each iteration. In the inner loop, MAML learns task-specific parameters by performing one gradient step on a task-specific loss. Then, in the outer loop, the model parameters from before the inner loop update are updated to reduce the expected loss across tasks after the inner loop update on the individual tasks. Hence, MAML learns a model initialisation that, at test time, can generalise to a new task after only a few gradient updates.

However, while MAML adapts the entire model to the new task, many transfer learning algorithms adapt only a fraction of the model

[Kokkinos, 2017], keeping the rest fixed across tasks. For example, representations learned for image classification [He et al., 2016] can be reused for semantic segmentation [He et al., 2017] or tracking [Wojke et al., 2017]. This suggests that some model parameters can be considered task independent and others task specific. Adapting only some model parameters can make learning faster and easier, as well as mitigate overfitting and catastrophic forgetting.

To this end, we propose context adaptation for meta-learning (CAML), a new method for fast adaptation via meta-learning. Like MAML, we learn a model initialisation that can quickly be adapted to new tasks. However, unlike MAML, we adapt only a subset of the model parameters to the new task. While restricting adaptation in this way is straightforward, it raises a key question: how should we decide which parameters to adapt and which to keep fixed? The main insight behind CAML is that, for many fast adaptation problems, the inner loop reduces to a task identification problem, rather than learning how to solve the whole task, which is typically infeasible with only a few gradient updates. Thus, it suffices if the part of the model that varies across tasks is an additional input to the model, and is independent of its other inputs.

These additional inputs, which we call context parameters (see Figure 1), can be interpreted as a task embedding that modulates the behaviour of the model. This embedding is learned via backpropagation during the inner loop of a meta-learning procedure similar to MAML, while the rest of the model is updated only in the outer loop. This allows CAML to explicitly optimise the task-independent parameters for good performance across tasks, while ensuring that the task-specific context parameters can quickly adapt to new tasks.

This separation of task solver and task embedding has several advantages. First, the size of both components can be chosen appropriately for the task. In particular, the network can be made expressive enough without overfitting to a single task in the inner loop, which we show empirically MAML is prone to. Model design and architecture choice also benefit from this separation, since for many practical problems we have prior knowledge of which aspects vary across tasks and hence how much capacity the context parameter should have. Like MAML, our method is model-agnostic, i.e., it can be applied to any model that is trained via gradient descent. However, CAML is easier to implement: assigning the correct computational graphs for higher order gradients is done only on the level of the context parameters, avoiding manual access and operations on the network weights and biases. Furthermore, parameter copies are not necessary which saves memory writes, a common bottleneck for running on GPUs. CAML can also help distributed machine learning systems, where the same model is deployed to different machines and we wish to learn different contexts concurrently. Network communication is often the bottleneck, which is mitigated by only sharing the (gradients of) context parameters.

We show empirically that CAML outperforms MAML on a regression and classification task and performs similarly on a reinforcement learning problem, while adapting significantly fewer parameters at test time. We observe that CAML is less sensitive to the inner-loop learning rate, and can be scaled up to larger networks without overfitting. We also demonstrate that the context parameters represent meaningful embeddings of tasks, confirming that the inner loop acts as a task identification step.

2 Background: Meta-Learning for Fast Adaptation

We consider settings where the goal is to learn models that can quickly adapt to a new task with only little data. To this end, learning on the new task is preceded by meta-learning on a set of related tasks. Here we describe the meta-learning problem for supervised and reinforcement learning, as well as MAML.

2.1 Problem Setting

In few-shot learning problems, we are given distributions over training tasks and test tasks . Training tasks can be used to learn how to adapt fast to any of the tasks with little per-task data, and evaluation is then done on (previously unseen) test tasks. Unless stated otherwise, we assume that and refer to both as . Tasks in typically share some structure, so that transferring knowledge between tasks speeds learning. During each meta-training iteration, a batch of tasks is sampled from .

Supervised Learning.

In a supervised learning setting, we learn a model

that maps data points that have a true label to predictions . A task is defined as a tuple , where is the input space, is the output space,

is a task-specific loss function, and

is a distribution over labelled data points. We assume that all data points are drawn i.i.d. from . Different tasks can be created by changing any element of .

Training in the supervised meta-learning setting proceeds over meta-training iterations, where for every task from the current batch, we sample two datasets (for training) and from :


where . and are the number of training and test datapoints, respectively. The training data is used to update , and the test data is then used to evaluate how good this update was, and adjust or the update rule accordingly.

Reinforcement Learning. In a reinforcement learning (RL) setting, we aim to learn a policy that maps states to actions . Each task corresponds to a Markov decision process (MDP): a tuple , where is a set of states, is a set of actions, is a reward function, is a transition function, and is an initial state distribution. The goal is to maximise the expected cumulative reward under ,


where is the horizon and is the discount factor. Again, different tasks can be created by changing any element of .

During each meta-training iteration, for every task from the current batch, we first collect a trajectory


where the initial state is sampled from , the actions are chosen by the current policy , the state transitions according to , and is the number of environment interactions available. We unify several episodes in this formulation: if the horizon is reached within the trajectory, the environment is reset using . Once the trajectory is collected, this data is used to update the policy. Another trajectory is then collected by rolling out the updated policy for time steps. This test trajectory is used to evaluate the quality of the update on that task, and to adjust or the update rule accordingly.

Evaluation for both supervised and reinforcement learning problems is done on a new (unseen) set of tasks drawn from (or if the test distribution of task is different). For each such task, the model is updated using or and only few datapoints ( or ). Performance of the updated model is reported on or .

2.2 Model-Agnostic Meta-Learning

One method for few-shot learning is model-agnostic meta-learning [Finn et al., 2017a, MAML]. Here, we describe the application of MAML to a supervised learning setting. MAML learns an initialisation for the parameters of a model such that, given a new task, a good model for that task can be learned with only a small number of gradient steps and data points. In the inner loop, MAML computes new task-specific parameters (starting from ) via one111We outline the method for one gradient update here, but several gradient steps can be performed at this point as well. gradient update,


For the meta-update in the outer loop, the original model parameters are then updated with respect to the performance after the inner-loop update, i.e.,


The result of training is a model initialisation that can be adapted with just a few gradient steps to any new task that we draw from . Since the gradient is taken with respect to the parameters before the inner-loop update (4), the outer-loop update (5) involves higher order derivatives in .

3 Fast Context Adaptation via Meta-Learning

We propose to partition the model parameters into two parts: context parameters that are adapted in the inner loop on an individual task, and parameters that are shared across tasks and meta-learned in the outer loop. In the following we describe the training procedure for supervised and reinforcement learning problems. Pseudo-code is provided in Appendix A.

Figure 1: Context adaptation. A network layer is augmented with additional context parameters (red), which are initialised to before each adaptation step. The context parameters are updated by gradient descent during each inner loop and during test time. The network parameters (green) are only updated in the outer loop and shared across tasks. Hence, they stay fixed at test time. By initialising to 0, the network parameters associated with the context parameters (blue) do not affect the output of the layer before adaptation. After the first adaptation step they are used to modulate the rest of the network in order to solve the new task.

3.1 Supervised Learning

At every meta-training iteration and for the current batch of tasks, we use the training data of each task as follows. Starting from , which can either be fixed or meta-learned as well (we typically choose ; see Section 3.4), we learn task-specific parameters via one gradient update:


While we only take the gradient with respect to , the updated parameter is also a function of , since during backpropagation, the gradients flow through the model. Once we have collected the updated parameters for all sampled tasks, we proceed to the meta-learning step, in which is updated:


This update includes higher order gradients in due to the dependency on (6).

3.2 Reinforcement Learning

During each iteration, for a current batch of MDPs , we proceed as follows. Given (see Section 3.4), we collect a rollout by executing the policy . We then compute task-specific parameters via one gradient update:


where is an objective function given by any gradient-based reinforcement learning method that uses trajectories produced by a parameterised policy to update that policy’s parameters, such as TRPO [Schulman et al., 2015] or DQN [Mnih et al., 2015]. After updating the policy, we collect another trajectory to evaluate the updated policy, where actions are chosen according to the updated policy .

After doing this for all tasks in , we continue with the meta-update step. Here, we update the parameters to maximise the average performance across tasks (after individually updating for them),


This update includes higher order gradients in due to the dependency on (8).

3.3 Conditioning on Context Parameters

Since are independent of the network input, we need to decide where and how to condition the network on them. For an output node at a fully connected layer , this can for example be done by simply concatenating to the inputs to that layer:



is a non-linear activation function,

is a bias parameter, are the weights associated with layer input , and are the weights associated with the context parameter . This is illustrated in Figure 1. In our experiments, for fully connected networks, we add the context parameter at the first layer, i.e., concatenate them to the input.

Other conditioning methods can be used with CAML as well. E.g., for convolutional networks, we use the feature-wise linear modulation FiLM method [Perez et al., 2017] for image classification experiments (Section 5.2). FiLM conditions by doing an affine transformation on the feature maps: given context parameters and a convolutional layer that outputs feature maps

, FiLM applies a linear transformation to each feature map

, where the parameters are a function of the context parameters. We use a fully connected layer with the identity function at the output. In our experiments, we found it helps performance to add the context parameters not at the first layer (in our case, after the third out of four convolutional operations).

3.4 Context Parameter Initialisation

When learning a new task, the context parameters have to be initialised to some value, . We argue that, instead of meta-learning this initialisation as well, a fixed is sufficient: in (10), if both and are meta-learned, the learned initialisation of can be subsumed into the bias parameter , and can be set to a fixed value. The same holds for conditioning when using FiLM layers. A key benefit of CAML is therefore that it is easy to implement, since the initialisation of the context parameters does not have to be meta-learned and parameter copies are not required. We set in our implementation, which gives the additional opportunity for visual inspection of the learned context parameters (see Sections 5.1 and 5.3).

3.5 Learning Rate

Since the context parameters are inputs to the model, the gradients at this point are not backpropagated further through any other part of the model. Furthermore, because learning and is decoupled, the inner loop learning rate can effectively be meta-learned by the rest of the model. This makes the method robust to the initial learning rate that is chosen for the inner loop, as we show empirically in Sections 5.1 and 5.3.

4 Related Work

Meta-learning, or learning to learn, has been explored in various ways in the literature. One general approach is to learn the algorithm or update function itself (a.o., Schmidhuber [1987], Bengio et al. [1992], Andrychowicz et al. [2016], Ravi and Larochelle [2017]). Another approach is to meta-learn a model initialisation such that the model can perform well on a new task after only few gradient steps, such as MAML [Finn et al., 2017a]. Other such methods are REPTILE [Nichol and Schulman, 2018] which does not require second order gradient computation and Meta-SGD [Li et al., 2017], which attempts to learn the per-parameter inner loop learning rate. Recent work by Grant et al. [2018] also considers a Bayesian interpretation of MAML. The main difference to our work is that we consider to only adapt a small number of parameters in the inner learning loop, and that these parameters come in the form of input context parameters.

In Finn et al. [2017b] the authors augment the model with additional biases to improve the performance of MAML in a robotic manipulation setting. In contrast, we update only the context parameters in the inner loop, and they are initialised to before adaptation to a new task. In the context of neural language models, Rei [2015] also considers adapting only context parameters in the inner loop, but they do not consider the benefits of initialising and resetting their value to . In addition is does not consider the application of the method to the variety of domains covered by this paper (something briefly explored in the appendix of Finn et al. [2017a]).

Other meta-learning methods are also motivated by the fact that learning in a high-dimensional parameter space can pose practical difficulties, and fast adaptation in lower dimensional space can be better (e.g., Sæmundsson et al. [2018], Zhou et al. [2018], Rusu et al. [2018]). Closely related to our method is the work of Lee and Choi [2018], who also update only part of the network at test time. They however employ a learning mechanism that allows them to also learn which

part of the network to update. This is less sensitive to the choice of initial learning rate (something we investigate in our empirical evaluation as well) and can outperform MAML on the Mini-Imagenet classification task. The idea of dynamic partitioning is attractive however it results in a more complex meta learning algorithm. In this work we consider a simpler, more interpretable alternative. Our experiments include, where applicable, comparisons to the above algorithms. We show that CAML is highly competitive with the above approaches while remaining a simple approach.

Context features as a component of inductive transfer were first introduced by Silver et al. [2008]

, who use a one-hot encoded task-specifying context as input to the network (which is not learned but predefined). They show that this works better than learning a shared feature extractor and having separate heads for all tasks. Learning a task embedding itself has been also explored, e.g., by

Oreshkin et al. [2018] or Garnelo et al. [2018], who use the task’s training set to condition the network. By contrast, we learn the context parameters via backpropagation through the same network that is used to solve the task.

5 Experiments

In this section we empirically evaluate CAML. Our extensive experiments aim to demonstrate three qualities of our method. First, adapting a small number of input parameters during the inner loop is sufficient to yield performance equivalent to or better than MAML in a range of regression, classification and reinforcement learning tasks. Like in MAML, it is possible to continue learning by performing several gradient update steps at test time, even when training using only one gradient step. Second, CAML is robust to the task-specific learning rate and scales well to more expressive networks without overfitting. Third, an embedding of the task emerges in the context parameters solely via backpropagation through the original inner loss.

5.1 Regression

Number of Additional Input Parameters
Method 0 1 2 3 4 5
Table 1: MSE results of CAML and MAML for varying number of input parameters, for shots. Numbers are averages over random sets of tasks, with confidence intervals in brackets.
Parallel Partitioning
Nodes Layer and Layer Layer
Stacked Partitioning
Layer Last Layer
Table 2: Alternative partitioning schemes, MSE on the regression task for shot (averages over random sets of tasks, with confidence intervals in brackets). Labels indicate which parameters are task-specific. The rest of the network is shared across tasks and updated in the outer loop.

We start with the regression problem of fitting sine curves, using the same setup as Finn et al. [2017a] to allow a direct comparison. A task is defined by the amplitude and phase of the sine curve, and is generated by uniformly sampling the amplitude from and the phase from . For training, ten labelled datapoints (uniformly sampled from ) are given for each task for the inner loop update. Per meta-update we iterate over a batch of

tasks and perform gradient descent on a mean-squared error (MSE) loss. We use a neural network with two hidden layers and

nodes each and ReLU non-linearities. During testing we present the model with ten datapoints from 1000 newly sampled tasks and measure MSE over 100 test points.

CAML uses the same training procedure and architecture but adds context parameters. To allow a fair comparison, we add the same number of additional inputs to MAML, an extension that was also done by Finn et al. [2017b]. These additional parameters are meta-learned together with the rest of the network, which can improve performance due to a more expressive gradient. Our method differs from this formulation in that we update only the context parameters in the inner loop, abd reinitialise them to zero for each new task. In the outer loop, we only update the shared parameters.

Table 1 shows that CAML outperforms the original MAML (with no additional inputs) significantly, and MAML with the same network architecture by a small margin. This performance gain is possible even though at test time, CAML adapts only - parameters, instead of around . To test the hypothesis that it suffices to adapt only input parameters per task, we also compare to alternative parameter partitions in Table 2. In parallel partitioning, we choose a strict subset of the nodes of each layer for task-specific adaptation, and meta-learn the rest. In stacked partitioning, we choose one or several layers for task-specific adaptation, and meta-learn the other layers. The results confirm that partitioning on context parameters is key to success: the other variants perform worse, often significantly so. A recent method proposed by Lee and Choi [2018], also partitions the network to adapt only part of it on a specific task – the partitioning mask, however, is learned. They test their method on the regression task as well, but we outperform the numbers they report significantly (not shown, since we believe this might be due to differences in implementation). In the next section we will see that on few-shot classification, this approach achieves comparable performance to our method.

MAML is known to keep learning after several gradient update steps. We test this on our method as well, with the results shown in Figure 3 for up to gradient steps. CAML outperforms MAML even after taking several gradient update steps, and is more stable, as indicated by the size of the confidence intervals and the monotonic learning curve.

Figure 2: Performance after several gradient steps (on the same batch) averaged over unseen tasks. The size of the context parameter / additional input to MAML is .
Figure 3: Visualisation of what the context parameters learn given a new task. In this case we have context parameters, and shown is the value they take after gradient update steps on a new task. Each dot is one random task, with its colour indicating the amplitude (left) or phase (right) of that task.
Figure 4: CAML scales the model weights so that the inner learning rate is compensated by the context parameters gradients magnitude.
Figure 5:

Measuring performance for different learning rates shows that CAML is more robust to this hyperparameter than MAML.

As described in Section 3.4, CAML has the freedom to scale the gradients at the context parameters since they are inputs to the model and trained separately. Figure 5 plots the inner learning rate against the norm of the gradient of the context parameters at test time. We can see that the weights are adjusted so that lower learning rates bring about larger context parameter gradients and vice-versa. This results in the method being extremely robust to learning rates as confirmed by Figure 5. We plot the performance while varying the learning rate from to . CAML is robust to changes in learning rate while MAML performs well only in a small range. Work by Li et al. [2017] shows that MAML can be improved by learning a parameter-specific learning rate, which, however, introduces a lot of additional parameters.

CAML’s performance on the regression task correlates with how many variables are needed to encode the tasks. In these experiments, two parameters vary between tasks, which is exactly the context parameter dimensionality at which CAML starts to perform well (the optimal encoding is three dimensional, as phase is periodic). This suggests CAML may indeed learn task descriptions in the context parameters. Figure 3

illustrates this by plotting the value of the learned inputs against the amplitude/phase of the task in the case of two context parameters. The model learns a smooth embedding in which interpolation between tasks is possible.

5.2 Classification

learn 5-way accuracy
Method init. univ. 1-shot 5-shot
Matching Nets [Vinyals et al., 2016]
Meta LSTM [Ravi and Larochelle, 2017]
Prototypical Networks [Snell et al., 2017]
Meta-SGD [Li et al., 2017]
REPTILE [Nichol and Schulman, 2018]
PLATIPUS [Finn et al., 2018] -
MT-NET [Lee and Choi, 2018] -
Qiao et al. [2017]
LEO Rusu et al. [2018]
MAML (32) [Finn et al., 2017a]
MAML (64)
CAML (32)
CAML (64)
CAML (128)
CAML (256)
CAML (512)
CAML (256, first order)
CAML (512, first order)
Table 3: Few-shot classification results on the Mini-Imagenet test set (average accuracy with confidence intervals on a random set of tasks). We compare to existing CNN-based methods, and indicate whether they learn an initialisation (column learn init) and whether are universally applicable to a range of problems like regression, classification, and RL (column univ.). We include two methods that use much deeper, residual networks, but greyed them out since direct comparison is not possible. For MAML, we show the results reported by Finn et al. [2017a], and when using a larger network (results obtained with the author’s open sourced code and unchanged hyperparameters except the number of filters).

To evaluate CAML on a more challenging problem, we test it on the competitive few-shot image classification benchmark Mini-Imagenet [Ravi and Larochelle, 2017]. In -way -shot classification. a task is a random selection of classes, for each of which the model gets to see

examples. From these it must learn to classify unseen images from the

classes. The Mini-Imagenet dataset consists of 64 training classes, 12 validation classes, and 24 test classes. During training, we generate a task by selecting classes at random from the classes and training the model on examples of each, i.e., a batch of images. The meta-update is done on a set of unseen images of the same classes.

On this benchmark, MAML uses a network with four convolutional layers with filters each and one fully connected layer at the output [Finn et al., 2017a]. We use the same network architecture, but with between and filters per layer. We use context parameters and add a FiLM layer (see Section 3.3) that conditions on these after the third convolutional layer. The parameters of the FiLM layer are meta-learned with the rest of the network, i.e., they are part of .

All our models were trained with two gradient steps in the inner loop and evaluated with two gradient steps (note: MAML was trained with five inner-loop gradient steps and evaluated with ten gradient steps). The inner learning rate was set to . Following Finn et al. [2017a], we ran each experiment for meta-iterations and selected the model with the highest validation accuracy for evaluation on the test set.

Table 3 shows our results on Mini-Imagenet held-out test data for -way -shot and

-shot classification. We compare to a number of existing meta-learning approaches that use convolutional neural networks, including MAML. Our largest model (

filters) clearly outperforms MAML, and outperforms the other methods on the -shot classification task. On -shot classification, the best results are obtained by prototypical networks [Snell et al., 2017], a method that is specific to few-shot classification and works by computing distances to prototype representations of each class. Our smallest model ( filters) under-performs MAML (within the confidence intervals). As we can see, CAML benefits from increasing model expressiveness: since we only adapt the context parameters in the inner loop per task, we can substantially increase the network size, without overfitting during the inner loop update. We tested scaling up MAML to a larger network size as well (see Table 3), but found that this hurt accuracy. We also include two state-of-the-art results from methods based on much deeper, residual networks [Qiao et al., 2017, Rusu et al., 2018, greyed out], which are not directly comparable due to much more expressive networks. Our method can be readily applied to deep residual networks as well, and we leave this exploration for future work. Table 3 also shows the first order approximation of our largest models, where the gradient with respect to is not backpropagated through the inner loop update of the context parameters . As expected, this results in a lower accuracy (a drop of ) , but we are still able to outperform MAML with a first-order version of our largest network.

Thus, CAML can achieve much higher accuracies than MAML by increasing the network size, without overfitting. Our results are obtained by only adjusting parameters at test time, instead of .

5.3 Reinforcement Learning

(a) Performance per gradient update.
(b) Performance per learning rate.
(c) Learned task embedding.
Figure 6: 2D navigation task analysis. Figure 5(a) shows the performance of each method as more gradient updates are performed. Figure 5(b) shows the performance of each method after 2 updates as the inner loop learning rate is increased. As in the case of regression CAML is not affected by this parameter. Figure 5(c) describes the goal position of different 2D navigation tasks and the corresponding context parameter activation obtained by performing 2 gradient updates. We can see that the context parameters represent an interpretable embedding of the task at hand. Context parameter 1 seems to encode the y position, while context parameter 2 encodes the x position.

To demonstrate the versatility of CAML, we also perform preliminary reinforcement learning experiments on a 2D Navigation task, also introduced by Finn et al. [2017a]. In this domain, the agent moves in a 2D world using continuous actions. At each timestep it is given a negative reward proportional to its distance from a pre-defined goal position. Each task is defined by a new unknown goal position.

We follow the same procedure as Finn et al. [2017a]. Goals are sampled from an interval of . At each step we sample 20 tasks for both the inner and outer loops and testing is performed on 40 new unseen tasks. We perform learning for 500 iterations and the best performing policy during training is then presented with new test tasks and allowed two gradient updates. For each update, the total reward over 20 rollouts per task is measured. We use a two-layer network with 100 units per layer and ReLU non-linearities to represent the policy and a linear value function approximator. For CAML we use five context parameters at the input layer.

In terms of performance (Figure 5(a)) we can see that the two methods are highly competitive. MAML performs better after the first gradient update after it is surpassed by CAML. Figure 5(b), which plots performance for several learning rates, shows that CAML is again less sensitive to the inner loop learning rate. Only when using a learning rate of 0.1 is MAML competitive in performance. Furthermore, CAML adapts parameters whereas MAML adapts around parameters.

As with regression, the optimal task embedding is low dimensional enough to plot. We therefore apply CAML with two context parameters and plot how these correlate with the actual position of the goal for 200 test tasks (this results in slightly worse performance than when using context parameters; comparison not shown). Figure 5(c) shows that the context parameters obtained after two policy gradient updates represent a disentangled embedding of the actual task. Specifically, context parameter 1 appears to encode the position of the goal, while context parameter 2 encodes the position. Hence, CAML can learn compact potentially interpretable task embeddings via backpropagation through the inner loss.

6 Conclusion and Future Work

In this paper we introduced CAML, a meta-learning approach for fast adaptation that introduces context parameters in addition to the model’s parameters. The context parameters are used to modulate the whole network during the inner loop of meta-learning, while the rest of the network parameters are adapted in the outer loop and shared across tasks. On regression, our method outperforms MAML and is superior to naive approaches to partitioning network parameters. We also showed that CAML is highly competitive with state of the art methods on few shot classification using CNNs. In addition to this, we experimented extensively with some unique properties that specifically arise from the way that our method is formulated, such as robustness to learning rate and the emergence of task embeddings at the context parameters. Another interesting extension would be to inspect the context parameter representations learned by CAML on the Mini-Imagenet benchmark using advanced dimensionality reduction techniques.

In this paper we performed some preliminary RL experiments. We are interested in extending CAML to more challenging problems and explore its role in allowing for smart exploration in order to identify the task at hand. It would also be interesting to consider probabilistic extensions along the lines of PLATIPUS [Finn et al., 2018] where the context parameters include uncertainty about the task.

Finally, the intriguing empirical properties of CAML detailed in this work will be the base of more theoretical investigations in the future.


We thank Wendelin Boehmer and Mark Finean for useful discussions and feedback, and Joost van Amersfoort for support with using PyTorch. We would also like to thank Jackie Loong for their open-sourced MAML-PyTorch implementation, and Chelsea Finn for responding quickly to github issues. The NVIDIA DGX-1 used for this research was donated by the NVIDIA corporation. L. Zintgraf is supported by the Microsoft Research PhD Scholarship Program. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713).


Appendix A Pseudo-Code

1:Distribution over tasks
2:Step sizes and
3:Initial model with initialised randomly and
4:while not done do
5:     Sample batch of tasks where
6:     for all  do
10:     end for
12:end while
Algorithm 1 CAML for Supervised Learning
1:Distribution over tasks
2:Step sizes and
3:Initial policy with initialised randomly and
4:while not done do
5:     Sample batch of tasks where
6:     for all  do
7:         Collect rollout using
9:         Collect rollout using
10:     end for
12:end while
Algorithm 2 CAML for RL

Appendix B Experimental Details

b.1 Classification

For Mini-Imagenet, our model takes as input images of size and has outputs, one for each class. The model has four modules that each consist of: a convolution with a

kernel, padding


filters, a batch normalisation layer, a max-pooling operation with kernel size

, if applicable a FiLM transformation (only at the third convolution, details below), and a ReLU activation function. The output size of these four blocks is , which we flatten to a vector and feed into one fully connected layer.

The FiLM layer itself is a fully connected layer with inputs and a -dimensional output and the identity function at the output. The output is divided into and , each of dimension , which are used to transform the filters that the convolutional operation outputs. The context vector is of size (other sizes tested: , ) and is added after the third convolution (other versions tested: at the first, second or fourth convolution).

The network is initialised using He et al. [2015] initialisation for the weights of the convolutional and fully connected weights (including the FiLM layer weights). The bias parameters are initialised to zero, except at the FiLM layer.

We use the Adam optimiser for the meta-update step with an initial learning rate of . This learning rate is annealed every steps by multiplying it by . The inner learning rate is set to (other hyperparameters tested: , ).

For Mini-Imagenet, we use a meta batchsize of and tasks for -shot and -shot classification respectively. For the batch norm statistics, we always use the current batch – also during testing. I.e., for -way -shot classification the batch size at test time is , and we use this batch for normalisation.