MetaFun: Meta-Learning with Iterative Functional Updates

12/05/2019 ∙ by Jin Xu, et al. ∙ 18

Few-shot supervised learning leverages experience from previous learning tasks to solve new tasks where only a few labelled examples are available. One successful line of approach to this problem is to use an encoder-decoder meta-learning pipeline, whereby labelled data in a task is encoded to produce task representation, and this representation is used to condition the decoder to make predictions on unlabelled data. We propose an approach that uses this pipeline with two important features. 1) We use infinite-dimensional functional representations of the task rather than fixed-dimensional representations. 2) We iteratively apply functional updates to the representation. We show that our approach can be interpreted as extending functional gradient descent, and delivers performance that is comparable to or outperforms previous state-of-the-art on few-shot classification benchmarks such as miniImageNet and tieredImageNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

Code Repositories

metafun-tensorflow

Code for "MetaFun: Meta-Learning with Iterative Functional Updates"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans have a remarkable ability to generalise to new tasks and use past experiences to solve new problems quickly. Traditional machine learning algorithms struggle to do so. In recent years, significant effort has been devoted into addressing these issues under the field of meta-learning, whose goal is to be able to generalise to new tasks from the same task distribution as the training tasks. In supervised learning, a task can be described as making predictions on a set of unlabelled data points (

target) by effectively learning from a set of data points with labels (context).

Various ideas have been proposed to tackle meta-learning from different perspectives. Andrychowicz et al. (2016); Ravi and Larochelle (2016) propose to learn the optimisation algorithm from previous tasks which can be used for new tasks. Santoro et al. (2016)

demonstrates that Memory-Augmented Neural Networks (MANN) can rapidly integrate new data into memory, and utilise this stored information to make predictions while only seeing a few examples of the new task. MAML

(Finn et al., 2017) learns an initialisation of the model parameters, and adapts to a new task by further running a few gradient steps. Koch (2015); Snell et al. (2017); Vinyals et al. (2016) explore the idea of learning a metric space from previous tasks in which new data points are compared to each other to make predictions at test time.

In this work, we are particularly interested in another family of meta-learning models that use an encoder-decoder pipeline (Garnelo et al., 2018a, b; Rusu et al., 2019). The encoder is a permutation-invariant function on the context set that summarises the task into a task representation, while the decoder is a predictive model that makes predictions on the target, conditioned on the task representation. The objective of meta-learning is then to learn the encoder and the decoder such that the predictive models generalise well to new tasks.

Previous works such as LEO (Rusu et al., 2019), CNP and NP (Garnelo et al., 2018a, b), all belong to this category. Despite their success on various tasks, NPs tend to underfit the context. ANP (Kim et al., 2019) addresses this issue by modifying the encoder to produce summaries of the task using a target-specific representation, which allows each target to attend to context points more relevant to it. We show in Section 3.2 that this can be interpreted as representing the task using a function of the target inputs. Moreover, different from previous perspectives, MAML (Finn et al., 2017), which meta-learns an initialisation and runs a few gradient steps on the context set of a new task starting from the initialisation during test time, can be reinterpreted under the encoder-decoder formulation in Section 2.1, with the very high-dimensinal model parameters seen as task representation. Suggested by the above, meta-learning models may benefit from having a very high-dimensional (like MAML) or even infinite dimensional (like ANP) task representation.

Generally speaking, designing an iterative update rule is often easier than finding the final solution: for example, it is not possible to derive closed-form solutions to most non-convex optimisation problems, but many iterative algorithms can be designed to effectively reach the optima; it can be challenging to sample directly from a high-dimensional target posterior distribution, but we can design a transition kernel for Markov chain Monte Carlo (MCMC) whose equilibrium distribution is the target distribution we are trying to sample from. In meta-learning, both learning to optimise

(Andrychowicz et al., 2016; Ravi and Larochelle, 2016) and MAML can be seen as applying iterative updating procedures where the updating rule in MAML is given by the gradient.

In this work, we investigate more deeply the idea of summarising tasks using functional representation. Specifically, we focus on developing a model that learns to iteratively update task representations in the function space. Recently, Gordon et al. (2019) also considers using functional representations in CNP. However, they mainly focus on incorporating translation equivariance in the data as inductive bias. The primary contribution of this work is a meta-learning model that summarises the task into a functional representation, and iteratively applies functional updates based on the context set and current state of the functional representation. we apply our models to solve meta-learning problems on both regression and classification tasks, and achieve performance that is comparable to or outperforms previous state-of-the-art on heavily benchmarked datasets such as miniImageNet (Vinyals et al., 2016) and tieredImageNet (Ren et al., 2018). Moreover, we draw close connection to gradient-based meta-learning methods such as MAML under a unified perspective that can include many previous works. Furthermore, we show that our model is an extension of a classical notion called functional gradient descent. From this perspective, our model can also be seen as a learned optimiser operating in function space. Finally, we conduct ablation study to understand the effects of different components in our model.

2 Meta-Learning under the Encoder-Decoder Formulation

Meta-learning, or learning to learn, leverages past experiences in order to quickly adapt to new tasks from the same task distribution. In supervised meta-learning, a task takes the form of , where

is the loss function to be minimised,

is the context, and is the target. A meta-learner adapts to a new task by inferring the parameters of a predictive model from the context of the task, and the objective of meta-learning is to build a learning model with parameters such that the total loss on the target under is minimised:

(1)

2.1 Permutation-Invariant Representation

Many previous meta-learning models, e.g., CNP, NP, ANP as well as MAML and its modifications encode the context into a task representation using a permutation-invariant function. The task representation is then used to obtain a predictive model via a decoding step. Under this framework, the meta-learner consists of an encoder and a decoder, meta-learning corresponds to training the encoder-decoder pipeline, while learning is just a single forward pass through the encoder and the decoder. Formally, we construct the learning model as

(2)

where r is the task representation.

The encoder in CNP corresponds to a summation of instance-level representation produced by a shared instance encoder :

(3)

NPs, on the other hand, use a probabilistic encoder with the same parametric form as in Equation 3, but producing a distribution of stochastic representation r. Note that to represent permutation-invariant functions, a summation after shared instance-wise encoders is a generic form (Zaheer et al., 2017; Bloem-Reddy and Teh, 2019).

Interestingly, many gradient-based meta-learning methods can also be cast into this formulation, because a gradient descent step is actually a valid permutation-invariant function. To be specific, for a model parameterised by r, one step of gradient descent on the context set with loss function and learning rate has the following form, where is the initialisation,

(4)

This corresponds to the special case where we take the instance-wise encoder to be . Moreover, multiple gradient-descent steps also result in a permutation-invariant function111A composition of permutation-invariant functions is permutation-invariant.. We refer to this as a gradient-based encoder.

What follows is that popular meta-learning methods such as MAML and LEO (Rusu et al., 2019) can be seen as part of this framework. More specifically, in MAML, is the initialisation of the model parameters, and r becomes the task representation (albeit very high-dimensional). LEO composes a generic NP encoder, relation networks (Sung et al., 2018; Raposo et al., 2017)) and a gradient-based encoder, and therefore also falls into our formulation.

We see that many meta-learning methods use a permutation-invariant function to infer model parameters from the context, and that the differences between these methods come down to the choice of the permutation-invariant function (the encoder), and the dimensionality of the representation produced by the encoder. The two main function classes used in these models are the generic, more flexible neural-network-based encoder, and a gradient-based one, which is a special case of the former. The success of MAML can be partially explained by the observation that using gradient-based encoder constitutes a strong inductive bias for learning, which is absent in vanilla neural networks. LEO does rely on a more flexible generic encoder, but it combines it with gradient-based updates, therefore enjoying the best of both worlds.

With regard to the dimensionality of the representation, it is shown in Wagstaff et al. (2019) that if the dimensionality of the representation is smaller than the number of context points, there exists a permutation-invariant function that cannot be expressed in the form of Equation 3. Kim et al. (2019) further show that a finite-dimensional context representation can be quite limiting in its expressiveness, often resulting in underfitting for regression tasks. MAML circumvents this issue by using model parameters as a very high-dimensional task representation. While the approach we will propose uses a functional task representation which can be seen as having infinite dimensions.

3 Meta-Learning in Function Space

In this section we will consider an approach to meta-learning using functions to represent and summarise task-specific context sets. Taking an infinite-dimensional functional approach allows us to bypass issues of expressiveness associated with finite dimensional task representations. We motivate our approach by starting with a a high level description of classical functional gradient descent (Mason et al., 1999; Y. Guo and Williamson, 2001). We then show how we can replace each component with more flexible and learnable neural modules, and finally specialise our approach for both few-shot regression and classification tasks.

3.1 Functional Gradient Descent

For a supervised learning task , with context set , the central object of interest is to learn the prediction function . The idea of functional gradient descent is to learn by directly computing its gradient and updating it in function space. To ensure that our functions are regularised to be smooth, we work with functions in a RKHS (Aronszajn, 1950; Berlinet and Thomas-Agnan, 2011) defined by a kernel . For our purposes it is sufficient to think of as defining a measure of similarity between two points x and in the input space .

Figure 1: (A) regression task with Mean Squared Error (MSE) loss. (B) Unregularised functional gradient (strictly, subgradient). For MSE loss, (see Section A.3). Therefore unregularised functional gradient is proportional to the difference between predictions and labels at the context, and undefined otherwise. However, updating using this unregularised functional gradient would lead to extreme overfitting because it does not generalise outside the context. (C) We consider functional gradient in a smoothed RKHS. (D) Functional gradient descent in RKHS.

Given a function in the RKHS, we are interested in minimising the supervised loss with respect to . We can do so by computing the functional derivative. This is itself a function of x, and its evaluation at input point x can be shown to be (Mason et al., 1999; Y. Guo and Williamson, 2001) (see Section A.3 for more details):

(5)

where is the partial derivative of the loss with respect to its first argument. We can interpret Equation 5 as follows: the derivative of at x is a linear combination of the derivatives at training points (context), , weighted by the similarities .

We can use this functional derivative to directly update iteratively:

(6)

with step size . To gain more intuition, we illustrate running functional gradient descent on a simple 1D regression task in Figure 1. Obviously one cannot compute Equation 6 at all inputs . However it turns out that it is sufficient to compute it only on the context, since the function values outside the context does not affect the next functional update, hence does not affect the final model after iteration (see Equation 28 in Section A.3).

Figure 2: Updating functional representation in MetaFun. This figure illustrates one iteration of MetaFun. For classification problems, the local update function has special inner structures, which is further illustrated on top left.

3.2 MetaFun

The updates above have no tunable parameters, except for the step size and kernel. In this section, we will develop MetaFun, a meta-learning framework with architecture inspired by the functional gradient descent updates above. Specifically, we can make the above procedure more flexible, by replacing each component with a neural network module:

  1. We can use a latent functional representation which is decoded into a prediction function .

  2. We can replace the derivatives at context points, , with a learned neural network .

  3. We can use a deep kernel (Wilson et al., 2016) parameterised by a neural network to learn more complex similarity relationships among input points.

  4. We can replace the kernel altogether with attention (Vaswani et al., 2017; Kim et al., 2019).

Using meta-learning, we can train the various modules to generalise well from context sets to target sets. Alternatively, we can think of our method as learning an optimiser which operates in function space to generalise well. In the rest of this Section we will elaborate on each of the modifications above.

As in Section 3.1, it is unnecessary to compute the functional representations (or their functional updates) on all input points. Instead we will compute them only on the context points and target points . We use to denote a matrix where each row is evaluated on either context or target inputs. Using a deep kernel parameterised by a neural network input transformation , and replacing with another neural network which we call local update function, the kernel-based functional gradient of Equation 5 can be expressed as:

(7)

where is a matrix with rows being queries (consisting of both contexts and targets), is a matrix of keys, and a matrix of values (using terminology from the attention literature). The kernel computes the matrix of kernel/similarity values among the query and key points. A dot-product attention (Vaswani et al., 2017) can alternatively be used in place of a kernel, and now dot-product serves as a similarity metric:

(8)

where

is the dimension of the query/key vectors. However, the similarity values in attention are normalised due to softmax. Note that

Equation 8 is the same as the query-specific cross-attention module in the deterministic path of attentive neural processes (ANP) (Kim et al., 2019), with serving as instance encoders. It is also possible to use other forms of attention, such as Laplace attention or multihead attention (Kim et al., 2019). We note that attention mechanisms need not correspond to kernels as they need not be symmetric or positive semi-definite.

Having defined functional representations of the updates, we can now write down a procedure to iteratively compute the functional representation of the task. We initialise the representation at , and update the representation a fixed number of times using:

(9)
(10)
(11)
(12)

where is the learning rate, and the final representation after steps is . Note that the local update function and the kernel/attention component is shared across iterations. This iterative procedure is illustrated in Figure 2. In practice, the performance of our method is not sensitive to the learning rate. This is expected as the function is learned, and the output scale of can account for different values of the learning rate. Furthermore, we found that zero-initialised works reasonably well empirically, even though we also consider a constant-initialised and a parametric variant

  during hyperparameter tuning.

Our approach mainly consists of three learnable components in total: the local update function, the kernel/attention component, and the decoder. For a new task, we simply run iterations of our functional updates (eqs. 9, 10, 11 and 12) to get a task representation , where each iteration shares the same local update function followed by the same kernel/attention component. The final representation is then used to condition the decoder

, which is parameterised as an MLP or a linear transformation, to make predictions for the new task. During meta-training, we minimise the following objective:

(13)

and is given by Equations 9, 10, 11 and 12.

3.3 MetaFun for Regression and Classification

While the proposed framework can be applied to any supervised learning task, the specific parameterisation of learnable components does affect the model performance. In this section, we specify the parametric forms of our model that work well on regression and classification tasks.

Regression

For regression tasks, we parameterise the local update function using an MLP, which takes as input the following concatenation i.e and outputs functional updates. The input transformation function in the kernel/attention component is parameterised by another MLP in experiments, even though it is possible to use other architectures in general. The decoder in this case can be given by an MLP such that , and is used to parameterise the predictive model which is also an MLP. It is also possible to use other types of decoder such as simply using an MLP taking the concatenation of and x as inputs: , or feeding

to each layer of the MLP. Note, our model can easily be modified to incorporate gaussian uncertainty by adding an extra output vector for the predictive standard deviation such as for example

. For architecture details about these MLPs, see Table 6.

Classification

For a -way classification task, the latent functional representation is divided into parts , where corresponds to class . Consequently, the local update function will also have parts, i.e. . In this case, corresponds to a one-hot vector describing the class label and is defined as follows,

(14)

where summarises representations of all classes, and where , , are parameterised by separate MLPs. This formulation allows updating class representations using either (when the label matches ) or (where the label is different than ), and in practice parameterising local update function in this way is critical for classification tasks. This design of local update function is also illustrated in Figure 2. In fact, it is partly motivated by the updating procedure of functional gradient descent for classification tasks, which we derive in Section A.3. Same as regression tasks, the input transformation function in the kernel/attention component is still an MLP. The parametric form of the decoder is the same as LEO (Rusu et al., 2019). The class representation generates softmax weights through an MLP or just a linear function , and the final prediction is given by

(15)

where . Hyperparameters of all components can be found in Appendix B.

4 Experiments

Figure 3: MetaFun is able to learn smooth updates, and recover the ground truth function almost perfectly. While the updates given by MAMLs are relatively not smooth, especially for MAML with less parameters.
Figure 4: Predictive uncertainties for MetaFun matches those for the oracle GP very closely in both 5-shot and 15-shot cases. The model is trained on random context size ranging from to .
Model -shot MSE -shot MSE
Original MAML
Large MAML
Very Wide MAML
MetaFun
Table 1: Few-shot regression on sinusoid. MAML can beneift from more parameters, but MetaFun still outperforms all MAMLs despite less parameters being used compared to large MAML. We report mean and standard deviation of independent runs.

We evaluate our proposed model on both few-shot regression and classification tasks. In all experiments that follow, we partition the data into training, validation and test meta-sets, each containing data from disjoint tasks. For quantitative results, we train each model with different random seeds and report the mean and the standard deviation of the test accuracy. For further details on hyperparameter tuning, see Appendix B.

4.1 1-D Function Regression

We first explore a 1D sinusoid regression task where we visualise the updating procedure in function space, providing intuition for the learned functional updates. Then we incorporate Gaussian uncertainty into the model, and compare our predictive uncertainty against that of a GP which generates the data.

Visualisation of functional updates We train a -step MetaFun with dot-product attention, on a simple sinusoid regression task from Finn et al. (2017), where each task uses data points of a sine wave. The amplitude and phase of the sinusoid varies across tasks and are randomly sampled during training and test time, with and . The x-coordinates are uniformly sampled from . Figure 4 shows that our proposed algorithm learns a smooth transition from the initial state to the final prediction at . Note that although only context points on a single phase of the sinusoid are given at test time, the final iteration makes predictions close to the ground truth across the whole period. As a comparison, we use MAML as an example of updating in parameter space. The original MAML ( units hidden layers) can fit the sinusoid quite well after several iterations from the learned initialisation. However the prediction is not as good, particularly on the left side where there are no context points (see Figure 4 B). As we increase the model size to large MAML ( units hidden layers), updates become much smoother (Figure 4 C) and the predictions are closer to the ground truth. We further conduct experiments with a very wide MAML ( units hidden layers), but the performance cannot be further improved (Figure 4 D). In Table 1, we compare the mean squared error averaged across tasks. MetaFun performs much better than all MAMLs, even though less parameters ( parameters) are used compared to large MAML ( parameters).

Predictive uncertainties As another simple regression example, we demonstrate that MetaFun, like CNP, can produce good predictive uncertainties. We use synthetic data generated using a GP with an RBF kernel and Gaussian observation noise (

), and our decoder produces both predictive means and variances. As in

Kim et al. (2019), we found that MetaFun-Attention can produce somewhat piece-wise constant mean predictions which is less appealing in this situation. On the other hand, MetaFun-Kernel (with deep kernels) performed much better, as can be seen in Figure 4. We consider the cases of or context points, and compare our predictions to those for the oracle GP. In both cases, our model gave very good predictions.

[]

Table 2: Few-shot Classification Test Accuracy
miniImageNet 5-way miniImageNet 5-way
Models 1-shot 5-shot

(Without deep residual networks feature extraction):

Matching networks (Vinyals et al., 2016)
Meta-learner LSTM (Ravi and Larochelle, 2016)
MAML (Finn et al., 2017)
LLAMA (Grant et al., 2018) -
REPTILE (Nichol et al., 2018)
PLATIPUS (Finn et al., 2018) -
(Without data augmentation):
Meta-SGD (Li et al., 2017)
SNAIL (Mishra et al., 2018)
Bauer et al. (2017)
Munkhdalai et al. (2018)
TADAM (Oreshkin et al., 2018)
Qiao et al. (2018)
LEO
MetaFun-Attention
MetaFun-Kernel
(With data augmentation):
Qiao et al. (2018)
LEO
MetaOptNet-SVM (Lee et al., 2019)1
MetaFun-Attention
MetaFun-Kernel
tieredImageNet 5-way tieredImageNet 5-way
Models 1-shot 5-shot
(Without deep residual networks feature extraction):
MAML (Finn et al., 2017)
Prototypical Nets (Snell et al., 2017)
Relation Net [in Liu et al. (2019)]
Transductive Prop. Nets (Liu et al., 2019)
(With deep residual networks feature extraction):
Meta-SGD
LEO
MetaOptNet-SVM
MetaFun-Attention
MetaFun-Kernel

4.2 Classification: miniImageNet and tieredImageNet

The miniImageNet dataset (Vinyals et al., 2016) consists of 100 classes selected randomly from the ILSVRC-12 dataset (Russakovsky et al., 2015), and each class contains 600 randomly sampled images. We follow the split in Ravi and Larochelle (2016), where the dataset is divided into training ( classes), validation ( classes), and test ( classes) meta-sets. The tieredImageNet dataset (Ren et al., 2018) contains a larger subset of the ILSVRC-12 dataset. These classes are further grouped into higher-level nodes. These nodes are then divided into training ( nodes), validation ( nodes), and test (

nodes) meta-sets. This dataset is considered more challenging because the split is near the root of the ImageNet hierarchy

(Ren et al., 2018). For both datasets, we use the pre-trained features provided by Rusu et al. (2019).

Following the commonly used experimental setting, each few-shot classification task consists of randomly sampled classes from a meta-set. Within each class, we have either example (-shot) or examples (-shot) as context, and examples as target. For all experiments, hyperparameters are chosen by training on the training meta-set, and comparing target accuracy on the validation meta-set. We conduct randomised hyperparameters search (Bergstra and Bengio, 2012), and the search space is given in Table 4. Then with the model configured by the chosen hyperparameters, we train on the union of the training and validation meta-sets, and report final target accuracy on the test meta-set.

In Table 2 we compare our approach to other meta-learning methods. The numbers presented are the mean and standard deviation of independent runs. The table demonstrates that our model outperforms previous state-of-the-art on 1-shot and 5-shot classification tasks for the more challenging tieredImageNet. As for miniImageNet, we note that previous work, such as MetaOptNet-SVM (Lee et al., 2019), used significant data augmentation to regularise their model and hence achieved superior results. For a fair comparison, we also equipped each model with data augmentation and reported accuracy with/without data augmentation. However, MetaOptNet-SVM (Lee et al., 2019) uses a different data augmentation scheme involving horizontal flip, random crop, and color (brightness, contrast, and saturation) jitter. On the other hand, MetaFun, Qiao et al. (2018) and LEO (Rusu et al., 2019), only use image features averaging representation of different crops and their horizontal mirrored versions. In 1-shot cases, MetaFun matches previous state-of-the-art performance, while in 5-shot cases, we get significantly better results. In Table 2, results for both MetaFun-Attention (using dot-product attention) and MetaFun-Kernel (using deep kernels) are reported. Although both of them demonstrate state-of-the-art performance, MetaFun-Kernel generally outperforms MetaFun-Attention for 5-shot problems, but performs slightly worse for 1-shot problems.

4.3 Ablation Study

[]

Table 3: Ablation Study. We conduct independent randomised hyperparameter search for each number presented, and reported means and standard deviations over 5 independent runs for each.
Attention/ Local update Decoder MiniImageNet tieredImageNet
kernel function 1-shot 5-shot
Attention NN
Deep Kernel NN
Attention Gradient
Deep Kernel Gradient
SE Kernel NN
Deep Kernel Gradient
Figure 5: This figure illustrates the accuracy of our approach for varying number of iterations , over different few-shot learning problems. For each problem, we use the same configuration of hyperparameters except for the number of iterations and the choice between attention and deep kernels. Error bars (standard deviations) are given by training the same model times with different random seeds.

As stated in Section 3.3, our model has three learnable components: the local update function, the kernel/attention, and the decoder. In this section we explore the effects of using different versions of these components. We also investigate how the model performance would change with different numbers of iterations.

Table 3 demonstrates that neural network parameterised local update functions, described in Section 3.2, consistently outperforms gradient-based local update function, despite the latter having build-in inductive biases. Interestingly, the choice between attention and deep kernel is problem dependent. We found that MetaFun with deep kernels usually perform better than MetaFun with attention on -shot classification tasks, but worse on -shot tasks. We conjecture that the deep kernel is better able to fuse the information across the 5 images per class compared to attention. In the comparative experiments in Section 4.2 we reported results on both.

In addition, we investigate how a simple Squared Exponential (SE) kernel would perform on these few-shot classification tasks. This corresponds to using an identity input transformation function in deep kernels. Table 3 shows that using SE kernel is consistently worse than using deep kernels, showing that the heavily parameterised deep kernel is necessary for these problems.

Next, we looked into directly applying functional gradient descent with parameterised deep kernel to these tasks. This corresponds to removing the decoder and using deep kernels and gradient-based local update function (see Section 3.1). Unsurprisingly, this did not fare as well, given as it only has one trainable component (the deep kernel) and the updates are directly applied to the predictions rather than a latent functional representation.

Finally, Figure 5 illustrates the effects of using different numbers of iterations . On all few-shot classification tasks, we can see that using multiple iterations (two is often good enough) always significantly outperform one iteration. We also note that this performance gain diminishes as we add more iterations. In Section 4.2 we treated the number of iterations as one of the hyperparameters.

5 Conclusions and Future Work

In this paper, we propose a novel approach for meta-learning called MetaFun. The proposed approach learns to iteratively update task representations in function space. We evaluate it on both few-shot regression and classification tasks, and demonstrate that it matches or exceeds previous state-of-the-art results on the challenging miniImageNet and tieredImageNet.

Interesting extensions to our work include: Exploring a stochastic encoder and hence working with stochastic functional representations, akin to NP, and not sharing the parameters in the local update functions and the kernel/attention components across iterations. The additional flexibility could lead to potential performance gains.

Acknowledgements

We would like to thank Jonathan Schwarz for valuable discussion. Jin Xu and Yee Whye Teh acknowledge funding from Tencent AI Lab through the Oxford-Tencent Collaboration on Large Scale Machine Learning project. Jean-Francois Ton is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1).

References

  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
  • Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pages 3981–3989, 2016.
  • Aronszajn (1950) Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
  • Bauer et al. (2017) Matthias Bauer, Mateo Rojas-Carulla, Jakub Bartłomiej Świątkowski, Bernhard Schölkopf, and Richard E Turner. Discriminative k-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
  • Berlinet and Thomas-Agnan (2011) Alain Berlinet and Christine Thomas-Agnan.

    Reproducing kernel Hilbert spaces in probability and statistics

    .
    Springer Science & Business Media, 2011.
  • Bloem-Reddy and Teh (2019) Benjamin Bloem-Reddy and Yee Whye Teh. Probabilistic symmetry and invariant neural networks. arXiv preprint arXiv:1901.06082, 2019.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
  • Finn et al. (2018) Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pages 9516–9527, 2018.
  • Garnelo et al. (2018a) Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In International Conference on Machine Learning, pages 1690–1699, 2018a.
  • Garnelo et al. (2018b) Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018b.
  • Gordon et al. (2019) Jonathan Gordon, Wessel P Bruinsma, Andrew YK Foong, James Requeima, Yann Dubois, and Richard E Turner. Convolutional conditional neural processes. arXiv preprint arXiv:1910.13556, 2019.
  • Grant et al. (2018) Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. In International Conference on Learning Representations, 2018.
  • Kim et al. (2019) Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. In International Conference on Learning Representations, 2019.
  • Koch (2015) Gregory Koch. Siamese neural networks for one-shot image recognition. Master’s thesis, University of Toronto, 2015.
  • Lee et al. (2019) Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 10657–10665, 2019.
  • Li et al. (2017) Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
  • Liu et al. (2019) Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. In International Conference on Learning Representations, 2019.
  • Mason et al. (1999) Llew Mason, Jonathan Baxter, Peter L Bartlett, Marcus Frean, et al. Functional gradient techniques for combining hypotheses.

    Advances in Large Margin Classifiers. MIT Press

    , 1999.
  • Mishra et al. (2018) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
  • Munkhdalai et al. (2018) Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, and Adam Trischler.

    Rapid adaptation with conditionally shifted neurons.

    In International Conference on Machine Learning, pages 3661–3670, 2018.
  • Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
  • Oreshkin et al. (2018) Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pages 721–731, 2018.
  • Qiao et al. (2018) Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7229–7238, 2018.
  • Raposo et al. (2017) David Raposo, Adam Santoro, David Barrett, Razvan Pascanu, Timothy Lillicrap, and Peter Battaglia. Discovering objects and their relations from entangled scene representations. In Workshops at the International Conference on Learning Representations (ICLR), 2017.
  • Ravi and Larochelle (2016) Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2016.
  • Ren et al. (2018) Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. In International Conference on Learning Representations, 2018.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • Rusu et al. (2019) Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.
  • Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
  • Wagstaff et al. (2019) Edward Wagstaff, Fabian B Fuchs, Martin Engelcke, Ingmar Posner, and Michael Osborne. On the limitations of representing functions on sets. arXiv preprint arXiv:1901.09006, 2019.
  • Wilson et al. (2016) Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
  • Y. Guo and Williamson (2001) A. Smola Y. Guo, P. Bartlett and R. C. Williamson. Norm-based regularization of boosting. Submitted to Journal of Machine Learning Research, 2001.
  • Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in neural information processing systems, pages 3391–3401, 2017.

References

  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
  • Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pages 3981–3989, 2016.
  • Aronszajn (1950) Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
  • Bauer et al. (2017) Matthias Bauer, Mateo Rojas-Carulla, Jakub Bartłomiej Świątkowski, Bernhard Schölkopf, and Richard E Turner. Discriminative k-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
  • Berlinet and Thomas-Agnan (2011) Alain Berlinet and Christine Thomas-Agnan.

    Reproducing kernel Hilbert spaces in probability and statistics

    .
    Springer Science & Business Media, 2011.
  • Bloem-Reddy and Teh (2019) Benjamin Bloem-Reddy and Yee Whye Teh. Probabilistic symmetry and invariant neural networks. arXiv preprint arXiv:1901.06082, 2019.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
  • Finn et al. (2018) Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pages 9516–9527, 2018.
  • Garnelo et al. (2018a) Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In International Conference on Machine Learning, pages 1690–1699, 2018a.
  • Garnelo et al. (2018b) Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018b.
  • Gordon et al. (2019) Jonathan Gordon, Wessel P Bruinsma, Andrew YK Foong, James Requeima, Yann Dubois, and Richard E Turner. Convolutional conditional neural processes. arXiv preprint arXiv:1910.13556, 2019.
  • Grant et al. (2018) Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. In International Conference on Learning Representations, 2018.
  • Kim et al. (2019) Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. In International Conference on Learning Representations, 2019.
  • Koch (2015) Gregory Koch. Siamese neural networks for one-shot image recognition. Master’s thesis, University of Toronto, 2015.
  • Lee et al. (2019) Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 10657–10665, 2019.
  • Li et al. (2017) Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
  • Liu et al. (2019) Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. In International Conference on Learning Representations, 2019.
  • Mason et al. (1999) Llew Mason, Jonathan Baxter, Peter L Bartlett, Marcus Frean, et al. Functional gradient techniques for combining hypotheses.

    Advances in Large Margin Classifiers. MIT Press

    , 1999.
  • Mishra et al. (2018) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
  • Munkhdalai et al. (2018) Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, and Adam Trischler.

    Rapid adaptation with conditionally shifted neurons.

    In International Conference on Machine Learning, pages 3661–3670, 2018.
  • Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
  • Oreshkin et al. (2018) Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pages 721–731, 2018.
  • Qiao et al. (2018) Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7229–7238, 2018.
  • Raposo et al. (2017) David Raposo, Adam Santoro, David Barrett, Razvan Pascanu, Timothy Lillicrap, and Peter Battaglia. Discovering objects and their relations from entangled scene representations. In Workshops at the International Conference on Learning Representations (ICLR), 2017.
  • Ravi and Larochelle (2016) Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2016.
  • Ren et al. (2018) Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. In International Conference on Learning Representations, 2018.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • Rusu et al. (2019) Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.
  • Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
  • Wagstaff et al. (2019) Edward Wagstaff, Fabian B Fuchs, Martin Engelcke, Ingmar Posner, and Michael Osborne. On the limitations of representing functions on sets. arXiv preprint arXiv:1901.09006, 2019.
  • Wilson et al. (2016) Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
  • Y. Guo and Williamson (2001) A. Smola Y. Guo, P. Bartlett and R. C. Williamson. Norm-based regularization of boosting. Submitted to Journal of Machine Learning Research, 2001.
  • Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in neural information processing systems, pages 3391–3401, 2017.

Appendix A Functional Gradient Descent

Functional gradient descent, just like gradient descent in parameter space, is an iterative optimisation algorithm for finding the minimum of a function. However, the function to be minimised now is actually a function on functions (functional). Formally, a functional is a mapping from a function space to a Euclidean space , and the minimiser of in function space is denoted as

Just like gradient descent in parameter space taking steps proportional to the negative of the gradient, functional gradient descent updates following the gradient in function space (functional gradients). In this work, we only consider a special function space called RKHS (Section A.1), and calculate functional gradients in RKHS (Section A.2). The algorithm is described in Section A.3 with further details using specific loss functions.

a.1 Reproducing Kernel Hilbert Space

A Hilbert space extends the notion of Euclidean space by introducing inner product which describes the concept of distance or similarity in this space. A RKHS is a Hilbert space of functions with the reproducing property that for all x there exists a unique such that the evaluation functional can be represented by taking the inner product of an element and , formally written as:

(16)

Since is also a function in , we can define a kernel function by letting

(17)

Using properties of inner product, it is easy to show that the kernel function is symmetric and positive definite. We call this function reproducing kernel of the Hilbert space .

a.2 Functional Gradients

Functional derivative can be thought of as describing the rate of change of the output with respect to the input in a functional. Formally, functional derivative is defined as:

(18)

which is a function of . This is known as Fréchet derivative in a Banach space, of which the Hilbert space is a special case.

Functional gradient, denoted as , is related to functional derivative by the following equation:

(19)

It is straightforward to compute functional gradient of an evaluation functional in RKHS thanks to the reproducing property (Equation 16):

(20)

Therefore, the functional gradient of an evaluation functional is actually :

(21)
(22)

For a learning task with loss function and a context set , the overall supervised loss on the context set can be written as:

(23)

In this case, the functional gradient of

can be easily calculated by applying the chain rule:

(24)
(25)

This result matches Equation 5.

a.3 Functional Gradient Descent

To optimise the overall loss on the entire context set in Equation 23, we choose a suitable learning rate , and iteratively update with:

(26)
(27)

In order to evaluate the final model at iteration , we only need to compute

(28)

which does not depend on function values outside the context from previous iterations . In the case of a multidimensional function , A multidimensional kernel should be used. However, in this work, we only consider a simple case where , produces a scalar, and

is an identity matrix. It is straightforward to derive the updating rule for a specific loss function

. Below we consider two common cases: the mean square error loss for regression tasks, and the cross entropy loss for classification tasks which motivates our parametric form of the local update function used in Section 3.3.

Mean Square Error (MSE) loss for regression When MSE loss is adopted in a regression task, the loss function is defined as:

(29)

Hence for context point

(30)

Note that this is simply the difference between prediction and label, which naturally describes how to change the prediction at location in order to match the label . In Figure 1 we use a simple 1D regression task with MSE loss to illustrate functional gradient descent.

Cross Entropy (CE) loss for classification When we use cross entropy loss for a -way classification problem, the model predicts

-dimensional logits

. In this case, the cross entropy loss is

(31)

where is the one-hot label for x.

Applying chain rule, functional gradient of the loss can be calculated as:

(32)

This form has structure similar to the local update function we use for classification in Equation 14. The connection becomes clear if we let:

(33)

and rewrite Section A.3 as:

(34)

As our approach can be seen as extending functional gradient descent, Equation 34 motivates our design of local update function for classification problems.

Appendix B Experimental Details

All experiments in this work are implemented in Tensorflow(Abadi et al., 2015), and the code will be released upon publication. For miniImageNet and tieredImageNet, we conduct randomised hyperparameters search (Bergstra and Bengio, 2012) for hyperparameters tunning. Typically, configurations of hyperparameters are sampled for each problem, and the best is chosen according to cross validation performance on the validation set. The considered range of hyperparameters is given in Table 4. The configurations of hyperparameters chosen to report the final classification accuracies are recorded in Table 5 for reference. For regression tasks, we simply use hyperparameters presented in Table 6 for both attention version and deep kernel version of our approach.

[] Components Architecture Shared MLP nn-sizes nn-layers MLP for positive labels nn-sizes nn-layers MLP for negative labels nn-sizes nn-layers Key/query transformation MLP dim embedding-layers Decoder linear with output dimension dim Hyperparameters Considered Range num-iters randint(2, 7) nn-layers randint(2, 4) embedding-layers randint(1, 3) nn-sizes dim-reprs nn-sizes Initial representation [zero, constant, parametric] Outer learning rate Initial inner learning rate Dropout rate uniform(0.0, 0.5) Orthogonality penalty weight L2 penalty weight Label smoothing

Table 4: Considered Range of Hyperparameters. The random generators such as randint or uniform use numpy.random syntax, so the first argument is inclusive while the second argument is exclusive. Whenever a list is given, it means uniformly sampling from the list. and will be followed by a linear transformation with an output dimension of dim-reprs.

[]

Table 5: Results of randomised hyperparameters search. Hyperparameters shown in this table are not guaranteed to be optimal within the considered range, because we conduct randomised hyperparameters search. Models configured by these hyperparameters perform reasonably well, and we used them to report final results comparing to other methods. Dropout is only applied to the inputs. Orthogonality penalty weight and L2 penalty weight are used in exactly the same way as in Rusu et al. (2019) and their released code at https://github.com/deepmind/leo. Inner learning rate is trainable so only an initial inner learning rate is given in the table.
miniImageNet tieredImageNet
Hyperparameters (MetaFun-Attention) -shot -shot -shot -shot
num-iters
nn-layers
embedding-layers
nn-sizes
Initial state zero constant constant constant
Outer learning rate
Initial inner learning rate
Dropout rate
Orthogonality penalty weight
L2 penalty weight
Label smoothing
miniImageNet tieredImageNet
Hyperparameters (MetaFun-Kernel) -shot -shot -shot -shot
num-iters
nn-layers
embedding-layers
nn-sizes
Initial state zero parametric parametric zero
Outer learning rate
Initial inner learning rate
Dropout rate
Orthogonality penalty weight
L2 penalty weight
Label smoothing

[] Components Architecture Local update function nn-sizes nn-layers Key/query transformation MLP nn-sizes embedding-layers Decoder nn-sizes nn-layers Predictive model nn-sizes (nn-layers-1) Hyperparameters Considered Range num-iters nn-layers embedding-layers nn-sizes dim-reprs nn-sizes Initial representation zero Outer learning rate Initial inner learning rate Dropout rate Orthogonality penalty weight L2 penalty weight

Table 6: Hyperparameters for regression tasks. Local update function and the predictive model will be followed by linear transformations with output dimension of dim-reprs and dim(y) accordingly.