Log In Sign Up

Task-similarity Aware Meta-learning through Nonparametric Kernel Regression

Meta-learning refers to the process of abstracting a learning rule for a class of tasks through a meta-parameter that captures the inductive bias for the class. The metaparameter is used to achieve a fast adaptation to unseen tasks from the class, given a few training samples. While meta-learning implicitly assumes the tasks as being similar, it is generally unclear how this similarity could be quantified. Further, many of the popular meta-learning approaches do not actively use such a task-similarity in solving for the tasks. In this paper, we propose the task-similarity aware nonparameteric meta-learning algorithm that explicitly employs similarity/dissimilarity between tasks using nonparametric kernel regression. Our approach models the task-specific parameters to lie in a reproducing kernel Hilbert space, wherein the kernel function captures the similarity across tasks. The proposed algorithm iteratively learns a meta-parameter which is used to assign a task-specific descriptor for every task. The task descriptors are then used to quantify the similarity through the kernel function. We show how our approach generalizes the popular meta-learning approaches of model-agnostic meta-learning (MAML) and Meta-stochastic gradient descent (Meta-SGD) approaches. Numerical experiments with regression tasks show that our algorithm performs well even in the presence of outlier or dissimilar tasks, validating the proposed approach


page 1

page 2

page 3

page 4


Local Nonparametric Meta-Learning

A central goal of meta-learning is to find a learning rule that enables ...

Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

In this paper, we present an approach to incorporate retrieved datapoint...

Online gradient-based mixtures for transfer modulation in meta-learning

Learning-to-learn or meta-learning leverages data-driven inductive bias ...

Dynamic Kernel Selection for Improved Generalization and Memory Efficiency in Meta-learning

Gradient based meta-learning methods are prone to overfit on the meta-tr...

Noise Contrastive Meta-Learning for Conditional Density Estimation using Kernel Mean Embeddings

Current meta-learning approaches focus on learning functional representa...

PAC-Bayesian Meta-learning with Implicit Prior

We introduce a new and rigorously-formulated PAC-Bayes few-shot meta-lea...

Towards Sample-efficient Overparameterized Meta-learning

An overarching goal in machine learning is to build a generalizable mode...

1 Introduction

Meta-learning seeks to abstract a general learning rule to help solve a class of learning problems or tasks, from the knowledge of a set of training tasks Finn and Levine (2018); Denevi et al. (2018)

. The setting is that the data available for solving each of these tasks is often severely limited, restricting the achievable performance on the tasks when solved individually. By abstracting similarity across tasks, meta-learning aims to perform well not just the given set of tasks, but on the entire class of tasks to which they belong. This also sets it apart from the transfer learning paradigm where the focus is to transfer a well-performing network from one domain to another

Pan and Yang (2010). Depending on how the meta information is defined and abstracted, meta-learning approaches come under three broad categories: optimization based Ravi and Larochelle (2017); Finn et al. (2017, 2018), metric-learning based Vinyals et al. (2016); Rusu et al. (2019), and model based Santoro et al. (2016); Snell et al. (2017); Mishra et al. (2018). Though meta-learning approaches work on the assumption of the tasks being similar or belonging to a class, task-similarity is typically not explicitly employed in the meta-learning algorithms, particularly in the optimization-based meta-learning algorithms. In many practical applications, it is realistic to assume that not all the tasks are very similar and that there is a presence of outlier or dissimilar tasks. In such cases, one expects that incorporating a metric of similarity explicitly would enable meta-learning to better adapt to variations among tasks, specially when the number of tasks available for training is limited. In this work, we address this particular issue by proposing a meta-learning algorithm that explicitly incorporates task-similarity. In this sense, our algorithm becomes a combination of both the optimization-based and metric-learning based meta-learning.

Our contribution is a novel meta-learning algorithm called the Task-similarity Aware Nonparametric Meta-Learning (TANML) which:

  • Explicitly employs similarity across the tasks in fast adaptation to tasks. The parameters for a given task are obtained by considering the information from other tasks and weighing them according to their similarity/dissimilarity.

  • Models the task-specific parameters to lie in a reproducing kernel Hilbert space (RKHS) associating the tasks through a nonparametric kernel regression. The meta-parameters perform the role of selecting the RKHS that best describes the observed tasks and gives the regression coefficients to relate them through the kernels.

  • Uses a particular strategy for defining the RKHS by assigning a task-descriptor to every task, which is then used to quantify similarity/dissimilarity among tasks. This is obtained by viewing meta-learning through the lens of linear/kernel regression and then generalizing to nonparametric regression.

  • Offers a fairly general framework that admits many possible variants. Though we use a particular form of the parameterized kernels with task-descriptors, the underlying RKHS task-similarity aware framework is a very general one.

1.1 Mathematical overview of the proposed algorithm

We now give a mathematical overview of TANML. Consider a class of tasks , where each task has input-output data in the form of training set , and the test set . Let

denote the parametric model or the mapping which we wish to learn, assumed to be of the same form for every task in the

. denotes the parameter for the task indexed by

obtained by minimizing a loss function

. In the case of a neural network, for example,

is the output of the neural network, the cross entropy or the mean-squared error, and

denotes the vector of all the learnt network weights. We further assume we have access to a set

of tasks for meta-training for which both and known. For the unseen tasks, only is known. Then, TANML approach models the task-specific parameter s as:



  • is the parameterized kernel function capturing similarity between th and th tasks,

  • is the parameter vector that defines the kernel and the associated reproducing kernel Hilbert space (RKHS),

  • denotes the kernel regression coefficients for the th task, and

  • and are the learnt meta-parameters

In order that the algorithm is meaningful, the kernel function must be a function of task losses , and/or their derivatives. Equation (1) models the task-specific parameters to lie in a RKHS defined by the kernel function – the distance between the task parameters is given by the kernel function. Every choice of the meta-parameter defines the associated RKHS in which the tasks (their parameters more specifically) lie. Thus, TANML achieves meta-learning in two steps: it selects the RKHS that best describes the given set of training tasks, and predicts the optimal parameters for a new unseen task using the learnt kernel coefficients for the selected RKHS.

The meta-parameters are learnt to minimize the test-loss of the seen tasks in :


where is as given in (1). As with the MAML, TANML meta-parameters are computed iteratively through one-step gradient descent:


We note here that while our RKHS based framework is a general one, in this work we take a specific approach obtained by viewing MAML/Meta-SGD from the lens of linear regression. We shall see that such a view directly results in the definition of a task-descriptor used to quantify the similarity/dissimilarity between tasks through kernels. The task descriptor turns out to be a function of the task loss gradient. We wish to emphasize here that the central aim of this work is to propose a meta-learning algorithm and an associated general meta-learning framework for incorporating task-similarities bringing together optimization-based and metric-based meta-learning. The experiments that we consider serve the goal of illustrating the potential of the approach, and are in no way exhaustive.

1.2 A motivating example

Let us consider a class of tasks , each task with input and output functionally related by an unknown process: , where has the same functional form for every task. We are given a set of training tasks . For every task, we have access only to a very few number of input-output data-points, for randomly drawn from . Given this data, our goal is to learn a neural network (NN) with weights to predict the output, for every task. The NN is trained by minimizing the training error:

Clearly, training for the tasks individually with a descent algorithm will result in a NN that overfits to and generalizes poorly to . MAML-type optimization-based approaches Finn et al. (2017) solve this by inferring the information across tasks in in the form of a good initialization for the NN weights – specialized/adapted to obtain for the task as

MAML obtains the meta-parameter by iteratively taking a gradient descent with respect to the test loss on training tasks given by . The NN weights for the a task are obtained without using any information from the other seen tasks directly, except for the learnt initialization . As a result though initialized similarly, it is not clear how similar/dissimilar the NN weights for the tasks would be. If a task is less similar to the majority of tasks used for the meta-training, one could expect the NN with predicted weights to perform poorly on test data.

In contrast, we observe from (1) that our approach explicitly uses the similarity between the tasks to predict for a new task – we predict the NN weights for a new task by weighing the information

from all the training tasks by their task-similarity. TANML can be likened to an interpolation across the tasks: the interpolating basis being the kernel function

, and being the interpolation coefficients. TANML learns both , the information from the seen/training tasks, and the metric-space in which the NN weights are best described to lie in (through ). As a result, we expect the predicted NN for the new task to perform well even when the tasks are not too similar. In our numerical experiments in Section 4, we consider the case of being the sinusoidal function.

1.3 Related work

The structural characterization of tasks and use of task-dependent knowledge has gained interest in meta-learning recently. In Edwards and Storkey (2017)

, a variational autoencoder based approach was employed to generate task/dataset statistics used to measure similarity. In

Ruder and Plank (2017), domain similarity and diversity measures were considered in the context of transfer learning Ruder and Plank (2017). The study of how task properties affect the catastrophic forgetting in continual learning was pursued in Nguyen et al. (2019). In Lee et al. (2020), the authors proposed a task-adaptive meta-learning approach for classification that adaptively balances meta-learning and task-specific learning differently for every task and class. It was shown in Oreshkin et al. (2018) that the performance few-shot learning shows significant improvements with the use of task-dependent metrics. While the use of kernels or similarity metrics is not new in meta-learning, they are typically seen in the context of defining relations between the classes or samples within a given task Vinyals et al. (2016); Snell et al. (2017); Oreshkin et al. (2018); Fortuin and Rätsch (2019); Goo and Niekum (2020). Information-theoretic ideas have also been used in the study of the topology and the geometry of task spaces Nguyen et al. (2019); Achille et al. (2018). In Achille et al. (2019), the authors construct vector representations for tasks using partially trained probe networks, based on which task-similarity metrics are developed. Task descriptors have been of interest specially in vision related tasks in the context of transfer learning Zamir et al. (2018); Achille et al. (2019); Tran et al. (2019).

2 Review of MAML and Meta-SGD

We first review MAML and Meta-SGD approaches and highlight the relevant aspects necessary for our discussion. We shall then show how these approaches lead to the definition of a generalized meta-SGD and consequently, to the TANML.

2.1 Maml

Model-agnostic meta-learning proceeds in two stages. First is the specialization or adaptation of the meta-parameter to obtain for task (referred to as the inner-loop update): achieved by a gradient descent with respect to . Second is the update of achieved by running a gradient descent over using the total test-loss (referred to as the outer-loop update). The meta-training phase of MAML is given described in the following algorithm:

for #meta-iterations do
       for  do
             [Inner-loop update]
       end for
       [Outer-loop update]
end for
Algorithm 1 Model agnostic meta-learning

The inner-loop update is performed over the training set of the training tasks in . The outer-loop update is performed over the corresponding test datasets by evaluating the loss function at the task parameter values obtained from the inner loop. and are the learning rates. Thus, MAML performs meta-learning by learning the inductive bias which can be used for fast adaptation (through a single gradient step) to a new task. Once the meta-training phase is complete, the parameters for the test task are obtained by applying the inner loop to the training dataset of the test task. We note here that the MAML described in Algorithm 1 is the efficient first-order MAML Finn et al. (2018) as opposed to the general MAML where the inner loop may contain several gradient descent steps. We shall hereafter be referring to the first-order MAML when we talk of MAML in our analysis. A schematic of MAML is presented in Figure 1.

2.2 Meta-SGD

Meta stochastic gradient descent (Meta-SGD) is a variant of the MAML which learns the component-wise step sizes for the inner-loop update along with . Let denote the vector of step-sizes for the different components of . Then, the meta-training process for Meta-SGD is given by:

for # meta-iterations do
       for  do
             [Inner-loop update]
       end for
       [Outer-loop update]
end for
Algorithm 2 Meta-stochastic gradient descent

where operator denotes the point-wise vector product. The outer-loop gradient is taken with respect to . Notice that the inner-loop update is expressible as


Thus, the task estimate

can be viewed as the output of a linear regression which takes as the input, for every once is known or estimated from training data, the parameters for any task may be obtained by computing the corresponding and then applying the linear regression matrix . Consequently, the Meta-SGD can be seen as a special case of the more general algorithm given by:

for # meta-iterations do
       for  do
             [Inner-loop update]
       end for
       [Outer-loop update]
end for
where and are as in (4),
Algorithm 3 Generalized Meta-SGD

where setting and as in (4) results in the Meta-SGD. is a regularization on the matrix . For example, could be , the Frobenius norm of . We could refer to this new formulation as the Generalized Meta-SGD. The parameter predicted by the Generalized Meta-SGD for any task is obtained as the output of the linear regression . In other words, we have established the similarity of the inner-loop update of MAML/Meta-SGD to that of a linear regression which associates the parameter as the output to the vector as input. We shall hereafter refer to as task descriptor of the th task.

3 Task-similarity Aware Meta-Learning

It is well known that the expressive power of linear regression is limited due to both its linear nature and the finite dimension of the input. Further, since the dimension of linear regression matrix grows quadratically with the dimension of , a large amount of training data becomes necessary to estimate it. A transformation of linear regression in the form of ’kernel substitution’ or ’kernel trick’ results in the more general nonparametric or kernel regression Bishop (2006); Schölkopf and Smola (2002). Kernel regression essentially performs a linear regression in an infinite dimensional space making it is a non-parametric regression approach. Kernel regression is known to be less prone to overfitting and more robust in capturing variations in the data than its linear counterpart even with limited training samples Bishop (2006). Then, as with the linear regression, by viewing the task parameter as the predicted target and the task descriptor as the input, we propose the following nonparametric or kernel regression model


where and are the kernel regression coefficients that must be estimated from data. Kernel regression models the task specific parameters to be lying in a reproducing kernel Hilbert space defined by the kernel , parametrized through which enters the kernel through the task descriptors. On returning to our discussion in the mathematical overview of TANML and comparing (1) and (5), we observe that

that is, our particular view of the MAML/Meta-SGD and the task-descriptor results in TANML where the meta-parameter is given by , the inductive bias of the MAML. Thus, the MAML line of thought helps us arrive at a well-defined choice of the kernel through the task descriptors defined in (4). We dilineate the meta-training for TANML in Algorithm 4.

for # meta-iterations do
       for  do
             [Inner-loop update]
       end for
       [Outer-loop update]
end for
where .
Algorithm 4 TANML

As with the Generalized Meta-SGD, could be some suitable regularization on . In our analysis, we use the commonly used regularization given by (which is the sum of the squared-norm of the task parameters in the RKHS cf. Schölkopf and Smola (2002); Bishop (2006)):

where is the kernel matrix for the training tasks such that . We wish to reiterate that in general any parametrized kernel may be used for TANML in Algorithm 4 through Equation (5). We use the particular choice of the kernel with task descriptors because it follows directly from the MAML-type analysis. A schematic describing the task-descriptor based TANML and the intuition behind its working is shown in Figure 1.

[height=2.1in]maml.pdf [height=2.1in]taml.pdf

Figure 1: Left: Schematic of MAML
Right: Schematic of the Task-similarity Aware Nonparametric Meta-Learning. Only the computation of is shown to keep the diagram uncluttered.

On the choice of kernels and sequential training

While the expressive power of kernels is immense, it is also known that the performance could vary depending on the choice of the kernel functionSchölkopf and Smola (2002). The kernel function that works best for a dataset is usually found by trial and error. A possible approach is to use multi-kernel regression where one lets the data decide which of the pre-specified set of kernels are relevant Sonnenburg and Schäfer (2005); Gönen and Alpaydin (2011)

. Domain-specific knowledge may also be incorporated in the choice of kernels. In our analysis, we use two of the popular kernel functions: the Gaussian or the radial basis function (RBF) kernel, and the cosine kernel.

We note that since the MAML-type approaches update the inner-loop independently for every task, they naturally admit a sequential or batch based training. Since TANML pursues a nonparametric kernel regression approach, it inherits the limitation of kernel-based approaches that all the training data is used simultaneously. As a result, the task losses and the associated gradients for all the training tasks are used at every meta-training iteration of TANML. One could overcome this limitation through use of online or sequential kernel regression techniques Lu et al. (2016); Sahoo et al. (2019); Vermaak et al. (2003). We are currently working towards achieving this improvement to our algorithm.

On an aspect of privacy

We would like to highlight an aspect of privacy that our algorithm possesses. Many optimization-based meta-learning approaches collect the information from all the training tasks into a single meta-parameter vector that corresponds to an initialization or inductive bias . Once this parameter is known, any new task from the same class is solved by fast adaptation through gradient descent. Returning to the scenario of tasks being related to individuals, this would mean that a malicious agent who has access to this meta-parameter, the agent could easily hack an existing task or a new related task – all that is needed is the knowledge of the meta-parameter and data of the task. Imagine the case where the agents are customers of an online banking that specializes its services based on a base or meta neural network model with weights

. Once the meta-model is hacked into, it becomes possible to gain easy access to any customer’s sensitive information, and to even manipulate them. This is because people often think and behave similarly in many spheres of activity, which is also why artificial intelligence has been a success in specialized services.

In contrast, our algorithm abstracts the information in two parts, first being the meta-parameter that governs the similarity space, and second meta-parameter the information of the apriori seen tasks. As a result, the privacy or security is a two-layered one: a malicious agent must have access to both and in order to be successful at hacking it. Since it is not easy to directly estimate one meta-parameter from the other, we believe that our algorithm is potentially more secure and privacy preserving. Further, unlike the meta-learning approaches where is of the same dimension and meaning as the parameter of individual tasks, could potentially be very different from the task parameter in general, depending on the task-descriptor used. For example, the task descriptor or similarity measure could be a probe network Achille et al. (2019) or a variational auto-encoder based statistic Edwards and Storkey (2017). Thus, our algorithm and the nonparametric regression based framework exhibits potential in privacy preserving learning.

4 Experiments

We consider the application of TANML described in Algorithm 4 for sinusoidal regression tasks using the setup described earlier in Section 1.2. We consider tasks whose input-output data is generated from the sinusoidal model where is drawn randomly from the interval . We wish to emphasize again that as discussed in Section 1.2, we do not use the knowledge of that the data comes from a sinusoidal function. In every task, the goal is to learn a neural network that predict the scalar output for a given input . We are given shots or data-points in training and test datasets, corresponding to randomly sampled input . In the meta-training phase, we consider a set of tasks for which both the training and test/validation data are known. In the meta-test phase, we have access only to the training set of the previously unseen test tasks. In order to illustrate the potential of TANML in using the similarity/ dissimilarity among tasks, we consider a fixed fraction of the tasks to be outliers, that is, generated from a non-sinusoidal function. For each task, the predicted output is given by

, the output of a fully-connected three-layer feed-forward neural network of

hidden units each, with Rectified linear unit (ReLU) activation function. It must be noted that the only information that the neural network uses is the available input-output data, and is made aware of neither the sinusoidal nature nor the presence of outlier tasks. The parameter

then corresponds to all the weights and biases in the neural network.

We consider two different regression experiments:

Fixed frequency varying amplitude:

In this experiment, we consider input-output data related by , where the amplitude is uniformly and randomly drawn from . We consider a fixed percentage of outlier tasks generated by the model .

Fixed amplitude varying frequency:

In this experiment, we consider input-output data related by ,where the frequency is uniformly and randomly drawn from . For the outlier tasks, we consider . In both experiments, we consider a fixed fraction of the tasks in both the meta-training and meta-test tasks to be outliers.

We consider the following two kernel functions:

While the parameter of the RBF could also be learnt, we do not include it in the meta-training phase and set it to a predetermined constant. In order that the structural similarities are better expressed, we consider kernel regression for the different layers separately. That is, instead of performing the inner-loop update for all the components of with a single kernel regression, we perform the update separately for different blocks of components of . That is, for the parameter components belonging to block , we perform:

The different blocks correspond to the different weights and biases of the different layers. We believe this helps capture the similarity better, given that different parameters will typically have different dynamic ranges. Taking a very long vector of all parameters one runs the risk of certain parameters dominating the kernel regression, specially as the dimension of the parameters becomes large. We perform the experiments with the number of meta-training tasks equal to and . The meta-test set consists of tasks different from those in the meta-training set. We train the system with meta iterations. We use an Adam optimizer for learning the meta-parameters for all the approaches. We set the meta-learning rate as it results in the stable training. The kernel regularization parameter was set to and the kernel parameter was set to . We refer the reader to the Supplementary material for the specific details of the experiments. We compare the performance of TANML with the RBF and cosine kernels, with that of the MAML, and Meta-SGD using the normalized mean-squared error (NMSE):

The NMSE performance on the meta-test set obtained by averaging over Monte Carlo realizations of tasks is reported Tables 1 and 2. We observe that TANML outperforms the MAML in test prediction by a significant margin even when the fraction of the outlier tasks is

. This clearly validates our intuition that an explicit awareness or notion of similarity aids in the learning, specially when the number of training tasks is limited. We also observe that on an average TANML with the cosine kernel performs better than the Gaussian kernel. This may perhaps be explained as a result of the Gaussian kernel having an additional hyperparameter that needs to be specified, whereas the cosine kernel does not have any hyperparameters. As a result, the performance of the Gaussian kernel may be sensitive to the choice of the variance hyperparameter

and the dataset used. We note that the performance of the approaches in Experiment 1 is better than that in Experiment 2. This is because there is higher variation among the tasks (due to the changing frequency) than in Experiment 1 (where only the amplitude varies over tasks). We also observe that the performance improves as the number of meta-training tasks is increased from to .

Algorithm Experiment 1 Experiment1 Experiment 2 Experiment2
outlier outlier outlier outlier
MAML 0.83 0.75 0.89 0.83
Meta-SGD 0.92 0.81 1.5 1.06
TANML-Gaussian 0.4 0.38 0.76 0.73
TANML-Cosine 0.37 0.30 0.44 0.47
Table 1: NMSE on test tasks with meta-training tasks
Algorithm Experiment 1 Experiment1 Experiment 2 Experiment2
outlier outlier outlier outlier
MAML 0.77 0.74 0.81 0.76
Meta-SGD 1.04 0.93 0.92 0.93
TANML-Gaussian 0.41 0.38 0.60 0.58
TANML-Cosine 0.35 0.26 0.38 0.33
Table 2: NMSE on test tasks with meta-training tasks

5 Acknowledgement

This work was partially supported by the Swedish Research Council and by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation

6 Conclusion

We proposed a task-similarity aware meta-learning algorithm that explicitly quantifies and employs the similarity between tasks through nonparametric kernel regression. Our approach models the task-specific parameters to lie in a RKHS and captures the similarity through kernels. The meta-parameters select the best RKHS that describes the tasks and specify the kernel regression coefficients that relate the different tasks. We showed how our approach could be seen as a more general treatment of the popular model-agnostic meta-learning algorithm. Experiments with regression tasks showed that our approach provides a reasonable prediction performance even in the presence of outlier or dissimilar tasks, and limited training data. The aim of the current contribution was to present an algorithm and the associated general framework that meaningfully incorporates task-similarity in the meta-learning process, bringing together optimization-based and metric-based meta-learning. To that end, we wish to reiterate that the study is an ongoing one and the experiments considered in the current work are in no way exhaustive. We will be pursuing the application of our algorithm to other meta-learning problems in the future, particularly to classification tasks in the few-shot learning regime.

We note that while we used a particular form of the task descriptor/parameterized kernel inspired by the model-agnostic meta-learning, our RKHS framework for associating tasks is not restricted to the particular choice. Different approaches for obtaining task statistics and measuring task-similarity have been proposed in literature and it would be interesting to see how they can be incorporated into our kernel based framework. An important next step for our approach is also the use of online/sequential kernel regression techniques to perform meta-training in a sequential or batch-based manner. The nonparametric kernel regression framework opens doors to a probablistic or Bayesian treatment of meta-learning such as the Gaussian processes. Such a treatment would help quantify the uncertainty in the meta-learning as a function of the different tasks available, while being aware of how similar or dissimilar the tasks are to the general learning process. This may be viewed also as a process of task selection – the kernel helps decide which task is more relevant and to what extent, and if it should be included in/discarded from the learning task-set. We will continue working along these lines in the near future.

Experimental details

We compare four different approaches: the MAML, Meta-SGD, TANML-Cosine, TANML-Gaussian. All the algorithms were trained for 60000 meta-iterations, where each meta-iteration outer-loop uses the entire set of training tasks, and not as a stochastic gradient descent. All the experiments were performed on either NVIDIA Tesla K80 GPU. The NMSE of all four methods are reported here:

Appendix A Hyper-parameters

The hyper-parameters for the four approaches are listed next. The learning-rate parameters were chosen such that the training error converged without instability.

a.1 Maml

  • Inner-loop learning rate: : 0.01

  • Outer-loop learning rate:

  • Total NN layers: 4 with, 2 hidden layers

  • Non-linearity: ReLU

  • Optimizer: Adam

a.2 Meta-SGD

  • Inner-loop learning rate : learnt, initialized with values randomly drawn from

  • Outer-loop learning rate for :

  • Outer-loop learning rate for : (Note that the learning rates for and are different)

  • Total NN layers: 4 with, 2 hidden layers

  • Non-linearity: ReLU

  • Optimizer: Adam

a.3 TANML-Gaussian

  • Outer-loop learning rate for :

  • Outer-loop learning rate for : (Note that the learning rates for and are different)

  • Total NN layers: 4 with, 2 hidden layers

  • Non-linearity: ReLU

  • Optimizer: Adam

a.4 TANML-Cosine

  • Outer-loop learning rate for :

  • Outer-loop learning rate for : (Note that the learning rates for and are different)

  • Total NN layers: 4 with, 2 hidden layers

  • Non-linearity: ReLU

  • Optimizer: Adam


  • A. Achille, M. Lam, R. Tewari, A. Ravichandran, S. Maji, C. C. Fowlkes, S. Soatto, and P. Perona (2019) Task2Vec: task embedding for meta-learning. In

    2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019

    pp. 6429–6438. External Links: Document, Link Cited by: §1.3, §3.
  • A. Achille, G. Mbeng, and S. Soatto (2018) Dynamics and reachability of learning tasks. External Links: 1810.02440 Cited by: §1.3.
  • C. M. Bishop (2006) Pattern recognition and machine learning (information science and statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. External Links: ISBN 0387310738 Cited by: §3, §3.
  • G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil (2018) Learning to learn around A common mean. See DBLP:conf/nips/2018, pp. 10190–10200. External Links: Link Cited by: §1.
  • H. Edwards and A. J. Storkey (2017) Towards a neural statistician. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1.3, §3.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. See DBLP:conf/icml/2017, pp. 1126–1135. External Links: Link Cited by: §1.2, §1.
  • C. Finn and S. Levine (2018) Meta-learning and universality: deep representations and gradient descent can approximate any learning algorithm. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1.
  • C. Finn, K. Xu, and S. Levine (2018) Probabilistic model-agnostic meta-learning. See DBLP:conf/nips/2018, pp. 9537–9548. External Links: Link Cited by: §1, §2.1.
  • V. Fortuin and G. Rätsch (2019) Deep mean functions for meta-learning in gaussian processes. CoRR abs/1901.08098. External Links: 1901.08098, Link Cited by: §1.3.
  • M. Gönen and E. Alpaydin (2011) Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, pp. 2211–2268. Cited by: §3.
  • W. Goo and S. Niekum (2020) Local nonparametric meta-learning. External Links: 2002.03272 Cited by: §1.3.
  • H. Lee, H. Lee, D. Na, S. Kim, M. Park, E. Yang, and S. J. Hwang (2020) Learning to balance: bayesian meta-learning for imbalanced and out-of-distribution tasks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1.3.
  • J. Lu, S. C. H. Hoi, J. Wang, P. Zhao, and Z. Liu (2016) Large scale online kernel learning. J. Mach. Learn. Res. 17, pp. 47:1–47:43. External Links: Link Cited by: §3.
  • N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2018) A simple neural attentive meta-learner. See DBLP:conf/iclr/2018, External Links: Link Cited by: §1.
  • C. V. Nguyen, A. Achille, M. Lam, T. Hassner, V. Mahadevan, and S. Soatto (2019) Toward understanding catastrophic forgetting in continual learning. CoRR abs/1908.01091. External Links: 1908.01091, Link Cited by: §1.3.
  • B. N. Oreshkin, P. R. López, and A. Lacoste (2018) TADAM: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 719–729. External Links: Link Cited by: §1.3.
  • S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §1.
  • S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. See DBLP:conf/iclr/2017, External Links: Link Cited by: §1.
  • S. Ruder and B. Plank (2017) Learning to select data for transfer learning with bayesian optimization. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017

    , M. Palmer, R. Hwa, and S. Riedel (Eds.),
    pp. 372–382. External Links: Document, Link Cited by: §1.3.
  • A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2019) Meta-learning with latent embedding optimization. See DBLP:conf/iclr/2019, External Links: Link Cited by: §1.
  • D. Sahoo, S. C. H. Hoi, and B. Li (2019) Large scale online multiple kernel regression with application to time-series prediction. ACM Trans. Knowl. Discov. Data 13 (1), pp. 9:1–9:33. External Links: Document, Link Cited by: §3.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. P. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1842–1850. External Links: Link Cited by: §1.
  • B. Schölkopf and A. J. Smola (2002)

    Learning with kernels: support vector machines, regularization, optimization, and beyond

    Adaptive computation and machine learning series, MIT Press. External Links: ISBN 9780262194754, Link Cited by: §3, §3, §3.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4077–4087. External Links: Link Cited by: §1.3, §1.
  • G. Sonnenburg and C. Schäfer (2005) A general and efficient multiple kernel learning algorithm. Proc. Int. Conf. Neural Inf. Process. Syst., pp. 1273–1280. Cited by: §3.
  • A. T. Tran, C. V. Nguyen, and T. Hassner (2019) Transferability and hardness of supervised classification tasks. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 1395–1405. External Links: Document, Link Cited by: §1.3.
  • J. Vermaak, S. J. Godsill, and A. Doucet (2003) Sequential bayesian kernel regression. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], S. Thrun, L. K. Saul, and B. Schölkopf (Eds.), pp. 113–120. External Links: Link Cited by: §3.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. See DBLP:conf/nips/2016, pp. 3630–3638. External Links: Link Cited by: §1.3, §1.
  • A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese (2018) Taskonomy: disentangling task transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.3.