1 Introduction
Metalearning seeks to abstract a general learning rule to help solve a class of learning problems or tasks, from the knowledge of a set of training tasks Finn and Levine (2018); Denevi et al. (2018)
. The setting is that the data available for solving each of these tasks is often severely limited, restricting the achievable performance on the tasks when solved individually. By abstracting similarity across tasks, metalearning aims to perform well not just the given set of tasks, but on the entire class of tasks to which they belong. This also sets it apart from the transfer learning paradigm where the focus is to transfer a wellperforming network from one domain to another
Pan and Yang (2010). Depending on how the meta information is defined and abstracted, metalearning approaches come under three broad categories: optimization based Ravi and Larochelle (2017); Finn et al. (2017, 2018), metriclearning based Vinyals et al. (2016); Rusu et al. (2019), and model based Santoro et al. (2016); Snell et al. (2017); Mishra et al. (2018). Though metalearning approaches work on the assumption of the tasks being similar or belonging to a class, tasksimilarity is typically not explicitly employed in the metalearning algorithms, particularly in the optimizationbased metalearning algorithms. In many practical applications, it is realistic to assume that not all the tasks are very similar and that there is a presence of outlier or dissimilar tasks. In such cases, one expects that incorporating a metric of similarity explicitly would enable metalearning to better adapt to variations among tasks, specially when the number of tasks available for training is limited. In this work, we address this particular issue by proposing a metalearning algorithm that explicitly incorporates tasksimilarity. In this sense, our algorithm becomes a combination of both the optimizationbased and metriclearning based metalearning.Our contribution is a novel metalearning algorithm called the Tasksimilarity Aware Nonparametric MetaLearning (TANML) which:

Explicitly employs similarity across the tasks in fast adaptation to tasks. The parameters for a given task are obtained by considering the information from other tasks and weighing them according to their similarity/dissimilarity.

Models the taskspecific parameters to lie in a reproducing kernel Hilbert space (RKHS) associating the tasks through a nonparametric kernel regression. The metaparameters perform the role of selecting the RKHS that best describes the observed tasks and gives the regression coefficients to relate them through the kernels.

Uses a particular strategy for defining the RKHS by assigning a taskdescriptor to every task, which is then used to quantify similarity/dissimilarity among tasks. This is obtained by viewing metalearning through the lens of linear/kernel regression and then generalizing to nonparametric regression.

Offers a fairly general framework that admits many possible variants. Though we use a particular form of the parameterized kernels with taskdescriptors, the underlying RKHS tasksimilarity aware framework is a very general one.
1.1 Mathematical overview of the proposed algorithm
We now give a mathematical overview of TANML. Consider a class of tasks , where each task has inputoutput data in the form of training set , and the test set . Let
denote the parametric model or the mapping which we wish to learn, assumed to be of the same form for every task in the
. denotes the parameter for the task indexed byobtained by minimizing a loss function
. In the case of a neural network, for example,
is the output of the neural network, the cross entropy or the meansquared error, anddenotes the vector of all the learnt network weights. We further assume we have access to a set
of tasks for metatraining for which both and known. For the unseen tasks, only is known. Then, TANML approach models the taskspecific parameter s as:(1) 
where,

is the parameterized kernel function capturing similarity between th and th tasks,

is the parameter vector that defines the kernel and the associated reproducing kernel Hilbert space (RKHS),

denotes the kernel regression coefficients for the th task, and

and are the learnt metaparameters
In order that the algorithm is meaningful, the kernel function must be a function of task losses , and/or their derivatives. Equation (1) models the taskspecific parameters to lie in a RKHS defined by the kernel function – the distance between the task parameters is given by the kernel function. Every choice of the metaparameter defines the associated RKHS in which the tasks (their parameters more specifically) lie. Thus, TANML achieves metalearning in two steps: it selects the RKHS that best describes the given set of training tasks, and predicts the optimal parameters for a new unseen task using the learnt kernel coefficients for the selected RKHS.
The metaparameters are learnt to minimize the testloss of the seen tasks in :
(2) 
where is as given in (1). As with the MAML, TANML metaparameters are computed iteratively through onestep gradient descent:
(3) 
We note here that while our RKHS based framework is a general one, in this work we take a specific approach obtained by viewing MAML/MetaSGD from the lens of linear regression. We shall see that such a view directly results in the definition of a taskdescriptor used to quantify the similarity/dissimilarity between tasks through kernels. The task descriptor turns out to be a function of the task loss gradient. We wish to emphasize here that the central aim of this work is to propose a metalearning algorithm and an associated general metalearning framework for incorporating tasksimilarities bringing together optimizationbased and metricbased metalearning. The experiments that we consider serve the goal of illustrating the potential of the approach, and are in no way exhaustive.
1.2 A motivating example
Let us consider a class of tasks , each task with input and output functionally related by an unknown process: , where has the same functional form for every task. We are given a set of training tasks . For every task, we have access only to a very few number of inputoutput datapoints, for randomly drawn from . Given this data, our goal is to learn a neural network (NN) with weights to predict the output, for every task. The NN is trained by minimizing the training error:
Clearly, training for the tasks individually with a descent algorithm will result in a NN that overfits to and generalizes poorly to . MAMLtype optimizationbased approaches Finn et al. (2017) solve this by inferring the information across tasks in in the form of a good initialization for the NN weights – specialized/adapted to obtain for the task as
MAML obtains the metaparameter by iteratively taking a gradient descent with respect to the test loss on training tasks given by . The NN weights for the a task are obtained without using any information from the other seen tasks directly, except for the learnt initialization . As a result though initialized similarly, it is not clear how similar/dissimilar the NN weights for the tasks would be. If a task is less similar to the majority of tasks used for the metatraining, one could expect the NN with predicted weights to perform poorly on test data.
In contrast, we observe from (1) that our approach explicitly uses the similarity between the tasks to predict for a new task – we predict the NN weights for a new task by weighing the information
from all the training tasks by their tasksimilarity. TANML can be likened to an interpolation across the tasks: the interpolating basis being the kernel function
, and being the interpolation coefficients. TANML learns both , the information from the seen/training tasks, and the metricspace in which the NN weights are best described to lie in (through ). As a result, we expect the predicted NN for the new task to perform well even when the tasks are not too similar. In our numerical experiments in Section 4, we consider the case of being the sinusoidal function.1.3 Related work
The structural characterization of tasks and use of taskdependent knowledge has gained interest in metalearning recently. In Edwards and Storkey (2017)
, a variational autoencoder based approach was employed to generate task/dataset statistics used to measure similarity. In
Ruder and Plank (2017), domain similarity and diversity measures were considered in the context of transfer learning Ruder and Plank (2017). The study of how task properties affect the catastrophic forgetting in continual learning was pursued in Nguyen et al. (2019). In Lee et al. (2020), the authors proposed a taskadaptive metalearning approach for classification that adaptively balances metalearning and taskspecific learning differently for every task and class. It was shown in Oreshkin et al. (2018) that the performance fewshot learning shows significant improvements with the use of taskdependent metrics. While the use of kernels or similarity metrics is not new in metalearning, they are typically seen in the context of defining relations between the classes or samples within a given task Vinyals et al. (2016); Snell et al. (2017); Oreshkin et al. (2018); Fortuin and Rätsch (2019); Goo and Niekum (2020). Informationtheoretic ideas have also been used in the study of the topology and the geometry of task spaces Nguyen et al. (2019); Achille et al. (2018). In Achille et al. (2019), the authors construct vector representations for tasks using partially trained probe networks, based on which tasksimilarity metrics are developed. Task descriptors have been of interest specially in vision related tasks in the context of transfer learning Zamir et al. (2018); Achille et al. (2019); Tran et al. (2019).2 Review of MAML and MetaSGD
We first review MAML and MetaSGD approaches and highlight the relevant aspects necessary for our discussion. We shall then show how these approaches lead to the definition of a generalized metaSGD and consequently, to the TANML.
2.1 Maml
Modelagnostic metalearning proceeds in two stages. First is the specialization or adaptation of the metaparameter to obtain for task (referred to as the innerloop update): achieved by a gradient descent with respect to . Second is the update of achieved by running a gradient descent over using the total testloss (referred to as the outerloop update). The metatraining phase of MAML is given described in the following algorithm:
The innerloop update is performed over the training set of the training tasks in . The outerloop update is performed over the corresponding test datasets by evaluating the loss function at the task parameter values obtained from the inner loop. and are the learning rates. Thus, MAML performs metalearning by learning the inductive bias which can be used for fast adaptation (through a single gradient step) to a new task. Once the metatraining phase is complete, the parameters for the test task are obtained by applying the inner loop to the training dataset of the test task. We note here that the MAML described in Algorithm 1 is the efficient firstorder MAML Finn et al. (2018) as opposed to the general MAML where the inner loop may contain several gradient descent steps. We shall hereafter be referring to the firstorder MAML when we talk of MAML in our analysis. A schematic of MAML is presented in Figure 1.
2.2 MetaSGD
Meta stochastic gradient descent (MetaSGD) is a variant of the MAML which learns the componentwise step sizes for the innerloop update along with . Let denote the vector of stepsizes for the different components of . Then, the metatraining process for MetaSGD is given by:
where operator denotes the pointwise vector product. The outerloop gradient is taken with respect to . Notice that the innerloop update is expressible as
(4) 
Thus, the task estimate
can be viewed as the output of a linear regression which takes as the input, for every once is known or estimated from training data, the parameters for any task may be obtained by computing the corresponding and then applying the linear regression matrix . Consequently, the MetaSGD can be seen as a special case of the more general algorithm given by:where setting and as in (4) results in the MetaSGD. is a regularization on the matrix . For example, could be , the Frobenius norm of . We could refer to this new formulation as the Generalized MetaSGD. The parameter predicted by the Generalized MetaSGD for any task is obtained as the output of the linear regression . In other words, we have established the similarity of the innerloop update of MAML/MetaSGD to that of a linear regression which associates the parameter as the output to the vector as input. We shall hereafter refer to as task descriptor of the th task.
3 Tasksimilarity Aware MetaLearning
It is well known that the expressive power of linear regression is limited due to both its linear nature and the finite dimension of the input. Further, since the dimension of linear regression matrix grows quadratically with the dimension of , a large amount of training data becomes necessary to estimate it. A transformation of linear regression in the form of ’kernel substitution’ or ’kernel trick’ results in the more general nonparametric or kernel regression Bishop (2006); Schölkopf and Smola (2002). Kernel regression essentially performs a linear regression in an infinite dimensional space making it is a nonparametric regression approach. Kernel regression is known to be less prone to overfitting and more robust in capturing variations in the data than its linear counterpart even with limited training samples Bishop (2006). Then, as with the linear regression, by viewing the task parameter as the predicted target and the task descriptor as the input, we propose the following nonparametric or kernel regression model
(5) 
where and are the kernel regression coefficients that must be estimated from data. Kernel regression models the task specific parameters to be lying in a reproducing kernel Hilbert space defined by the kernel , parametrized through which enters the kernel through the task descriptors. On returning to our discussion in the mathematical overview of TANML and comparing (1) and (5), we observe that
that is, our particular view of the MAML/MetaSGD and the taskdescriptor results in TANML where the metaparameter is given by , the inductive bias of the MAML. Thus, the MAML line of thought helps us arrive at a welldefined choice of the kernel through the task descriptors defined in (4). We dilineate the metatraining for TANML in Algorithm 4.
As with the Generalized MetaSGD, could be some suitable regularization on . In our analysis, we use the commonly used regularization given by (which is the sum of the squarednorm of the task parameters in the RKHS cf. Schölkopf and Smola (2002); Bishop (2006)):
where is the kernel matrix for the training tasks such that . We wish to reiterate that in general any parametrized kernel may be used for TANML in Algorithm 4 through Equation (5). We use the particular choice of the kernel with task descriptors because it follows directly from the MAMLtype analysis. A schematic describing the taskdescriptor based TANML and the intuition behind its working is shown in Figure 1.
On the choice of kernels and sequential training
While the expressive power of kernels is immense, it is also known that the performance could vary depending on the choice of the kernel functionSchölkopf and Smola (2002). The kernel function that works best for a dataset is usually found by trial and error. A possible approach is to use multikernel regression where one lets the data decide which of the prespecified set of kernels are relevant Sonnenburg and Schäfer (2005); Gönen and Alpaydin (2011)
. Domainspecific knowledge may also be incorporated in the choice of kernels. In our analysis, we use two of the popular kernel functions: the Gaussian or the radial basis function (RBF) kernel, and the cosine kernel.
We note that since the MAMLtype approaches update the innerloop independently for every task, they naturally admit a sequential or batch based training. Since TANML pursues a nonparametric kernel regression approach, it inherits the limitation of kernelbased approaches that all the training data is used simultaneously. As a result, the task losses and the associated gradients for all the training tasks are used at every metatraining iteration of TANML. One could overcome this limitation through use of online or sequential kernel regression techniques Lu et al. (2016); Sahoo et al. (2019); Vermaak et al. (2003). We are currently working towards achieving this improvement to our algorithm.
On an aspect of privacy
We would like to highlight an aspect of privacy that our algorithm possesses. Many optimizationbased metalearning approaches collect the information from all the training tasks into a single metaparameter vector that corresponds to an initialization or inductive bias . Once this parameter is known, any new task from the same class is solved by fast adaptation through gradient descent. Returning to the scenario of tasks being related to individuals, this would mean that a malicious agent who has access to this metaparameter, the agent could easily hack an existing task or a new related task – all that is needed is the knowledge of the metaparameter and data of the task. Imagine the case where the agents are customers of an online banking that specializes its services based on a base or meta neural network model with weights
. Once the metamodel is hacked into, it becomes possible to gain easy access to any customer’s sensitive information, and to even manipulate them. This is because people often think and behave similarly in many spheres of activity, which is also why artificial intelligence has been a success in specialized services.
In contrast, our algorithm abstracts the information in two parts, first being the metaparameter that governs the similarity space, and second metaparameter the information of the apriori seen tasks. As a result, the privacy or security is a twolayered one: a malicious agent must have access to both and in order to be successful at hacking it. Since it is not easy to directly estimate one metaparameter from the other, we believe that our algorithm is potentially more secure and privacy preserving. Further, unlike the metalearning approaches where is of the same dimension and meaning as the parameter of individual tasks, could potentially be very different from the task parameter in general, depending on the taskdescriptor used. For example, the task descriptor or similarity measure could be a probe network Achille et al. (2019) or a variational autoencoder based statistic Edwards and Storkey (2017). Thus, our algorithm and the nonparametric regression based framework exhibits potential in privacy preserving learning.
4 Experiments
We consider the application of TANML described in Algorithm 4 for sinusoidal regression tasks using the setup described earlier in Section 1.2. We consider tasks whose inputoutput data is generated from the sinusoidal model where is drawn randomly from the interval . We wish to emphasize again that as discussed in Section 1.2, we do not use the knowledge of that the data comes from a sinusoidal function. In every task, the goal is to learn a neural network that predict the scalar output for a given input . We are given shots or datapoints in training and test datasets, corresponding to randomly sampled input . In the metatraining phase, we consider a set of tasks for which both the training and test/validation data are known. In the metatest phase, we have access only to the training set of the previously unseen test tasks. In order to illustrate the potential of TANML in using the similarity/ dissimilarity among tasks, we consider a fixed fraction of the tasks to be outliers, that is, generated from a nonsinusoidal function. For each task, the predicted output is given by
, the output of a fullyconnected threelayer feedforward neural network of
hidden units each, with Rectified linear unit (ReLU) activation function. It must be noted that the only information that the neural network uses is the available inputoutput data, and is made aware of neither the sinusoidal nature nor the presence of outlier tasks. The parameter
then corresponds to all the weights and biases in the neural network.We consider two different regression experiments:
Fixed frequency varying amplitude:
In this experiment, we consider inputoutput data related by , where the amplitude is uniformly and randomly drawn from . We consider a fixed percentage of outlier tasks generated by the model .
Fixed amplitude varying frequency:
In this experiment, we consider inputoutput data related by ,where the frequency is uniformly and randomly drawn from . For the outlier tasks, we consider . In both experiments, we consider a fixed fraction of the tasks in both the metatraining and metatest tasks to be outliers.
We consider the following two kernel functions:
While the parameter of the RBF could also be learnt, we do not include it in the metatraining phase and set it to a predetermined constant. In order that the structural similarities are better expressed, we consider kernel regression for the different layers separately. That is, instead of performing the innerloop update for all the components of with a single kernel regression, we perform the update separately for different blocks of components of . That is, for the parameter components belonging to block , we perform:
The different blocks correspond to the different weights and biases of the different layers. We believe this helps capture the similarity better, given that different parameters will typically have different dynamic ranges. Taking a very long vector of all parameters one runs the risk of certain parameters dominating the kernel regression, specially as the dimension of the parameters becomes large. We perform the experiments with the number of metatraining tasks equal to and . The metatest set consists of tasks different from those in the metatraining set. We train the system with meta iterations. We use an Adam optimizer for learning the metaparameters for all the approaches. We set the metalearning rate as it results in the stable training. The kernel regularization parameter was set to and the kernel parameter was set to . We refer the reader to the Supplementary material for the specific details of the experiments. We compare the performance of TANML with the RBF and cosine kernels, with that of the MAML, and MetaSGD using the normalized meansquared error (NMSE):
The NMSE performance on the metatest set obtained by averaging over Monte Carlo realizations of tasks is reported Tables 1 and 2. We observe that TANML outperforms the MAML in test prediction by a significant margin even when the fraction of the outlier tasks is
. This clearly validates our intuition that an explicit awareness or notion of similarity aids in the learning, specially when the number of training tasks is limited. We also observe that on an average TANML with the cosine kernel performs better than the Gaussian kernel. This may perhaps be explained as a result of the Gaussian kernel having an additional hyperparameter that needs to be specified, whereas the cosine kernel does not have any hyperparameters. As a result, the performance of the Gaussian kernel may be sensitive to the choice of the variance hyperparameter
and the dataset used. We note that the performance of the approaches in Experiment 1 is better than that in Experiment 2. This is because there is higher variation among the tasks (due to the changing frequency) than in Experiment 1 (where only the amplitude varies over tasks). We also observe that the performance improves as the number of metatraining tasks is increased from to .Algorithm  Experiment 1  Experiment1  Experiment 2  Experiment2 

outlier  outlier  outlier  outlier  
MAML  0.83  0.75  0.89  0.83 
MetaSGD  0.92  0.81  1.5  1.06 
TANMLGaussian  0.4  0.38  0.76  0.73 
TANMLCosine  0.37  0.30  0.44  0.47 
Algorithm  Experiment 1  Experiment1  Experiment 2  Experiment2 

outlier  outlier  outlier  outlier  
MAML  0.77  0.74  0.81  0.76 
MetaSGD  1.04  0.93  0.92  0.93 
TANMLGaussian  0.41  0.38  0.60  0.58 
TANMLCosine  0.35  0.26  0.38  0.33 
5 Acknowledgement
This work was partially supported by the Swedish Research Council and by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation
6 Conclusion
We proposed a tasksimilarity aware metalearning algorithm that explicitly quantifies and employs the similarity between tasks through nonparametric kernel regression. Our approach models the taskspecific parameters to lie in a RKHS and captures the similarity through kernels. The metaparameters select the best RKHS that describes the tasks and specify the kernel regression coefficients that relate the different tasks. We showed how our approach could be seen as a more general treatment of the popular modelagnostic metalearning algorithm. Experiments with regression tasks showed that our approach provides a reasonable prediction performance even in the presence of outlier or dissimilar tasks, and limited training data. The aim of the current contribution was to present an algorithm and the associated general framework that meaningfully incorporates tasksimilarity in the metalearning process, bringing together optimizationbased and metricbased metalearning. To that end, we wish to reiterate that the study is an ongoing one and the experiments considered in the current work are in no way exhaustive. We will be pursuing the application of our algorithm to other metalearning problems in the future, particularly to classification tasks in the fewshot learning regime.
We note that while we used a particular form of the task descriptor/parameterized kernel inspired by the modelagnostic metalearning, our RKHS framework for associating tasks is not restricted to the particular choice. Different approaches for obtaining task statistics and measuring tasksimilarity have been proposed in literature and it would be interesting to see how they can be incorporated into our kernel based framework. An important next step for our approach is also the use of online/sequential kernel regression techniques to perform metatraining in a sequential or batchbased manner. The nonparametric kernel regression framework opens doors to a probablistic or Bayesian treatment of metalearning such as the Gaussian processes. Such a treatment would help quantify the uncertainty in the metalearning as a function of the different tasks available, while being aware of how similar or dissimilar the tasks are to the general learning process. This may be viewed also as a process of task selection – the kernel helps decide which task is more relevant and to what extent, and if it should be included in/discarded from the learning taskset. We will continue working along these lines in the near future.
Experimental details
We compare four different approaches: the MAML, MetaSGD, TANMLCosine, TANMLGaussian. All the algorithms were trained for 60000 metaiterations, where each metaiteration outerloop uses the entire set of training tasks, and not as a stochastic gradient descent. All the experiments were performed on either NVIDIA Tesla K80 GPU. The NMSE of all four methods are reported here:
Appendix A Hyperparameters
The hyperparameters for the four approaches are listed next. The learningrate parameters were chosen such that the training error converged without instability.
a.1 Maml

Innerloop learning rate: : 0.01

Outerloop learning rate:

Total NN layers: 4 with, 2 hidden layers

Nonlinearity: ReLU

Optimizer: Adam
a.2 MetaSGD

Innerloop learning rate : learnt, initialized with values randomly drawn from

Outerloop learning rate for :

Outerloop learning rate for : (Note that the learning rates for and are different)

Total NN layers: 4 with, 2 hidden layers

Nonlinearity: ReLU

Optimizer: Adam
a.3 TANMLGaussian

Outerloop learning rate for :

Outerloop learning rate for : (Note that the learning rates for and are different)



Total NN layers: 4 with, 2 hidden layers

Nonlinearity: ReLU

Optimizer: Adam
a.4 TANMLCosine

Outerloop learning rate for :

Outerloop learning rate for : (Note that the learning rates for and are different)


Total NN layers: 4 with, 2 hidden layers

Nonlinearity: ReLU

Optimizer: Adam
References

Task2Vec: task embedding for metalearning.
In
2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27  November 2, 2019
, pp. 6429–6438. External Links: Document, Link Cited by: §1.3, §3.  Dynamics and reachability of learning tasks. External Links: 1810.02440 Cited by: §1.3.
 Pattern recognition and machine learning (information science and statistics). SpringerVerlag New York, Inc., Secaucus, NJ, USA. External Links: ISBN 0387310738 Cited by: §3, §3.
 Learning to learn around A common mean. See DBLP:conf/nips/2018, pp. 10190–10200. External Links: Link Cited by: §1.
 Towards a neural statistician. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §1.3, §3.
 Modelagnostic metalearning for fast adaptation of deep networks. See DBLP:conf/icml/2017, pp. 1126–1135. External Links: Link Cited by: §1.2, §1.
 Metalearning and universality: deep representations and gradient descent can approximate any learning algorithm. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1.
 Probabilistic modelagnostic metalearning. See DBLP:conf/nips/2018, pp. 9537–9548. External Links: Link Cited by: §1, §2.1.
 Deep mean functions for metalearning in gaussian processes. CoRR abs/1901.08098. External Links: 1901.08098, Link Cited by: §1.3.
 Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, pp. 2211–2268. Cited by: §3.
 Local nonparametric metalearning. External Links: 2002.03272 Cited by: §1.3.
 Learning to balance: bayesian metalearning for imbalanced and outofdistribution tasks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, External Links: Link Cited by: §1.3.
 Large scale online kernel learning. J. Mach. Learn. Res. 17, pp. 47:1–47:43. External Links: Link Cited by: §3.
 A simple neural attentive metalearner. See DBLP:conf/iclr/2018, External Links: Link Cited by: §1.
 Toward understanding catastrophic forgetting in continual learning. CoRR abs/1908.01091. External Links: 1908.01091, Link Cited by: §1.3.
 TADAM: task dependent adaptive metric for improved fewshot learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 719–729. External Links: Link Cited by: §1.3.
 A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §1.
 Optimization as a model for fewshot learning. See DBLP:conf/iclr/2017, External Links: Link Cited by: §1.

Learning to select data for transfer learning with bayesian optimization.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 911, 2017
, M. Palmer, R. Hwa, and S. Riedel (Eds.), pp. 372–382. External Links: Document, Link Cited by: §1.3.  Metalearning with latent embedding optimization. See DBLP:conf/iclr/2019, External Links: Link Cited by: §1.
 Large scale online multiple kernel regression with application to timeseries prediction. ACM Trans. Knowl. Discov. Data 13 (1), pp. 9:1–9:33. External Links: Document, Link Cited by: §3.
 Metalearning with memoryaugmented neural networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1842–1850. External Links: Link Cited by: §1.

Learning with kernels: support vector machines, regularization, optimization, and beyond
. Adaptive computation and machine learning series, MIT Press. External Links: ISBN 9780262194754, Link Cited by: §3, §3, §3.  Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4077–4087. External Links: Link Cited by: §1.3, §1.
 A general and efficient multiple kernel learning algorithm. Proc. Int. Conf. Neural Inf. Process. Syst., pp. 1273–1280. Cited by: §3.
 Transferability and hardness of supervised classification tasks. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27  November 2, 2019, pp. 1395–1405. External Links: Document, Link Cited by: §1.3.
 Sequential bayesian kernel regression. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 813, 2003, Vancouver and Whistler, British Columbia, Canada], S. Thrun, L. K. Saul, and B. Schölkopf (Eds.), pp. 113–120. External Links: Link Cited by: §3.
 Matching networks for one shot learning. See DBLP:conf/nips/2016, pp. 3630–3638. External Links: Link Cited by: §1.3, §1.
 Taskonomy: disentangling task transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.3.