1 Introduction
Learning quickly or with very few samples has been a longterm goal of the machine learning community. The field of metalearning has recently made significant strides towards achieving that goal. Metalearning
(nichol2018first; ravi2016optimization; finn2017model) comprises of a set of algorithms designed to exploit prior experiences from multiple tasks (drawn from a task distribution) for improving sampleefficiency on a new but related task from the same distribution. Given the increasing cost of getting annotated samples on an everincreasing variety of related tasks, the practical scope of these algorithms is immense.Most metalearning methods can be classified into two broad categories (i)
gradientbased (ravi2018amortized; denevi2019learning; finn2017model) approaches that metalearn parameters of optimization algorithms (like initialization and learning rate) in a way that the metalearner (optimizer) is amenable to quickly adapt on a new task by performing gradient descent on a very small number of labeled samples, and (ii) amortizedinference (snell2017prototypical; lee2019meta; bertinetto2018meta)based approaches that directly infer the optimal parameters of a new task without performing any gradient based optimization. In general, such algorithms learn to adapt the parameters of a complex neural network using only a few samples from an unseen task in a way that the
adapted network generalizes well (wang2019generalizing). In this work, although we focus on improving gradientbased methods, we believe that our core idea can be adapted to the latter as well. Recent works (finn2018probabilistic), (ravi2018amortized), (kim2018bayesian) have used a Bayesian framework to learn a suitable prior over the network parameters by leveraging the inherent structure of the task distribution. By viewing the parameters of a metalearner through a Bayesian lens we can use the predictive posterior (gal2016dropout) to estimate the uncertainty of the adapted parameters for each task (ravi2018amortized).In a Bayesian metalearner (kim2018bayesian; finn2018probabilistic; ravi2018amortized), the posterior over the adapted network parameters for a new task is typically inferred using a few samples from the task along with a metalearned prior. In this work, we hypothesize that the covariate distribution of a task can also influence the posterior over the adapted network parameters. To the best of our knowledge none of the existing metalearning algorithms like bertinetto2018meta; rajeswaran2019meta; ravi2018amortized; finn2018probabilistic explicitly utilize the information present in the covariates to improve the estimate of the adapted parameters. This is done by modeling the latent factors of the covariate distribution. We define a prior not only on the network parameters (which determine the conditional ) but also on the covariate distribution . Our metalearning objective involves maximizing the joint likelihood as opposed to just which leads to metaparameters sharing information about the covariates across tasks, in addition to the optimal network parameters. This way the latent factors of the covariate distribution of a new task can be quickly inferred from very few covariates. Finally, the inferred latent covariate factors are used to infer the posterior over the adapted network parameters.
For simplicity, we motivate the need to model covariates via a synthetic example in Fig. 1. The metadistribution consists of tasks with optimal hypothesis that can be classified into four hypothesis classes: sinusoidal (), linear (), tanh () and quadratic (). We note that the support of the input distribution is vastly different for each of the four hypothesis classes. Thus, intuitively we can see that inferring properties of the covariate distribution of a task can be helpful in adjusting the posterior over the network parameters. For example, for a given task if we only observe negative covariates in the range we can adapt the posterior to have a higher measure for (sinusoidal hypothesis).
We can generalize the above intuition to realworld metalearning problems as well, especially for highdimensional data like images. Particularly, in the limited availability of labeled data, semisupervised methods
(chapelle2009semi) tend to leverage unlabeled data (and hence covariate distribution) to attain generalizable models (berthelot2019mixmatch) On similar lines, modeling the metadistribution over the covariates can help inferring the taskspecific covariate distribution which can then better inform taskspecific features for image classification. In fewshot classification, images for different tasks can lie on different manifolds (saul2003think) of varying complexities. Information about the covariates (by modeling the unlabeled data) can help us better estimate the required complexity of the discriminative features for a given task. For example, distinguishing species of plants can be considered harder than classifying mammals since the features of plant images may be cluttered on a lowdimensional manifold as opposed to the possibly well separated features in the case of mammals. Thus the former may require having complex (nonlinear) decision boundaries as opposed to simpler linear classifiers in the case of the latter.The main contributions of our work are as follows: (1) we identify the need to model the latent structure present in the covariate distributions for a sequence of tasks (2) to the best of our knowledge we are the first to propose a Bayesian framework which exploits this latent information to better infer the posterior over the adapted network parameters (that define ) (3) we propose a gradient based modelagnostic metalearning algorithm that is an instantiation of our probabilistic theory and demonstrate its benefits on synthetic regression datasets.
2 Related Work
Our methodology is complementary to most existing works in the probabilistic metalearning literature. We borrow the basic hierarchical Bayes framework from ravi2018amortized; finn2018probabilistic and extend it to model Bayesian variables that generate the covariate distribution for a task. This enables our method to be modelagnostic while having the ability to benefit from the latent relationship between the task covariates and the optimal parameters as mentioned in Sec. 1. In the nonBayesian setting, the mmaml algorithm proposed by vuorio2019multimodal is mildly similar to our approach in the sense that they learn task specific initializations instead of a single one as originally introduced by finn2017model. mmaml uses the labeled samples to infer an initialization for a given task and hence one can view the covariate distribution as being used indirectly. But they fail to explicitly model the mutual information between the covariates and adapted parameters. On the other hand, our approach is more direct since it first infers the posterior over the latent factors of the covariate distribution via a maximum likelihood objective and then uses the inferred posterior to improve the adaptation of network parameters. Additionally, our framework is capable of modeling the uncertainty of the adaptation which can prove to be be critical in the few shot scenario.
3 Methodology
We begin by introducing some notations for the metalearning setup used in the rest of the paper followed by the proposed probabilistic framework which explicitly exploits (i) the structure of the covariate distributions across tasks (ii) the relation between the covariate distribution and optimal hypothesis for a given task. We then derive the Maximum Likelihood Estimation (mle) objectives for the observed variables in our model and show how the mle derivations can inform a novel metalearning objective. Finally, we discuss a specific metalearning algorithm that can efficiently optimize the proposed objective. We do this via an instantiation of the generic approach obtained by making certain simplifying assumptions in the original framework.
Notations We are given a sequence of tasks with each task having labeled samples given by the dataset where and . Following the definitions introduced by finn2018probabilistic we split the dataset into support () and query sets respectively with . Each sample in
is drawn from the joint distribution
over with the marginals given by and .The probabilistic model we consider in our work is summarized in Fig. 2. Without making any assumptions on the nature of , we assume the existence of metaparameters that govern the common structure shared across the set of joint distributions . Within each task the generative model for the input
involves a random variable
which we shall refer to as the latent factors of the covariate distribution. Also, each task has an additional latent variablewhich plays a role in the generative model for the response variable
given the input . In most settings, a naive assumption of independence is made over the latent factors and . On the other hand, we refrain from making such assumptions and instead exploit the information present in the covariates to better infer the posterior over the latent variable which influences the conditional .3.1 Formal Derivations
In this section, we derive a lower bound for the likelihood of the observed data using hierarchical variational inference. This gives us the metalearning objective that can be optimized using standard gradientbased approaches.
(1)  
In the above equation, the distribution is a variational approximation (with parameters ) for the true posterior over the metaparameter . For the derivations henceforth we shall drop the notations and when understood from context. The loglikelihood of the dataset given by , can be written as an integral over the factors , and .
To lower bound the log of the above objective we introduce two variational approximations (i) with parameters for the true posterior and (ii) with parameters for the true posterior .
(2) 
Since is the latent factor in the generative model for and is the corresponding latent variable for , we arrive at the independence: and . Based on this, we finally arrive at the following Evidence Lower Bound (elbo) for which we shall refer to as .
(3)  
(4) 
Therefore, the elbo on the likelihood of the dataset for task is a function of the taskspecific variational parameters and the variational metaparameter . For each task, the optimal variational parameters that approximate the true posteriors are distinct. Hence, need to be adapted for each task individually. Given we can rewrite the lower bound in Eq. 1 as:
(5)  
3.2 Algorithm
The primary aim of any metalearning algorithm is to optimize for the metaparameter given the datasets from the corresponding sequence of tasks. This is generally a two step process where stepI involves identifying the optimal taskspecific parameters using and the support set . In stepII, based on the taskspecific adapted parameters from stepI the metaparameter is optimized over the query set . Within our framework, since both the metaparameter and the taskspecific parameters are Bayesian random variables with variational parameters given by respectively, we instead define an algorithm to optimize the elbo in Eq. 5.
(6)  
(7) 
In order to optimize the objectives in Eqs. 6, 7 we introduce certain simplifying assumptions over each of the variational approximations in Sec. 3.1. ^{1}^{1}1We note that the assumptions we consider are merely to induce a computationally feasible metalearner and the theory involving a metadistribution over covariates (in Sec. 3.1) is barely undermined by it.
Assumption 1.
The variational approximation follows a distribution given by and the prior is given by . We note that the assumptions taken are effective in arriving at a computationally feasible algorithm and are
Assumption 2.
The variational approximation
follows a normal distribution with mean, diagonal covariance matrix given by
. The prior is taken as . On the other hand, the distribution is given by a distribution: .Using Assms. 1, 2 the Eqs. 6, 7 can be rewritten as a two layer loglikelihood objective with an regularization term.
(8)  
(9)  
(10) 
To be concrete, the taskspecific random variable represents the parameters of the neural network for the task, that takes as input and outputs a prediction . On the other hand, as stated previously represents the latent variable for the covariate distribution . Furthermore, the optimal variational parameters for the distribution is given by which consists of the mean and std. deviation of a normal distribution. Here, represent neural networks which take as input the support set and output . It is important to note that even though the parameters of are taskagnostic, the variational parameter is still different for each task and is determined using the covariates in .
Having identified we now describe the optimization algorithm for in Eq. 9. Notice that to obtain it is sufficient to only minimize the objective . In Eq. 3.1 the kl term acts as a regularizer in the optimization objective for
. Since the most common algorithm for optimization is Stochastic Gradient Descent (
sgd) many metalearning algorithms avoid the kl term by choosing a regularization specific to sgd. In most works (ravi2018amortized; finn2018probabilistic; kim2018bayesian), the kl term is a function of only the metaparameter (or given Assm. 1). Hence the regularization is induced by letting the initialization for the optimization of (given by ) be determined by . In our framework, we realize that the kl term is a function of both and the latent variable for the task specific covariate distribution. Hence we model the initialization using a neuralnetwork whose parameters are subsumed in and thus without loss of expressivity . Thus, the optimal parameters of the variational approximation () would be given by performing steps of sgd on the mle objective in Eq. 3.1 with the initialization given by .(11)  
The expectation in Eq 11 is computed using montecarlo approximation. We find that sampling a single value of is sufficient to optimize for .
Finally, we note that the metaparameter constitutes the parameters of the network which determines the initialization as well as the parameters of which output . Thus, is optimized to jointly maximize the likelihood of the covariates of a sequence of tasks as well as for learning to choose covariate dependent initializations suitable for fewshot adaptation. For the optimization objective in Eq. 8, we use the standard reparameterization trick ( step of the outer forloop in Algorithm 1) commonly used in Variational AutoEncoders (vaes). This is done so as to be able to differentiate through the expectation over in Eqs. 3, 11. The stepbystep procedure for the metatraining and metatesting phases are given by Algorithm 1 and Algorithm 2 respectively.
4 Experiments and Results
In order to first litmus test our approach on a simpler task we begin by evaluating it on a synthetic regression dataset borrowed from vuorio2019multimodal and defer further experimentation on more complicated real world datasets to future work. Most metalearning algorithms have been tested on regression datasets where the covariate distribution is same across all tasks. We thus describe how we suitably modify the original dataset so that there exists a structure over the set of covariate distributions across tasks. This enables us to fairly evaluate our method against the baselines in the proposed setting.
We compare our algorithm with the popular gradient based metalearning approach maml introduced by finn2017model. Additionally, we also chose as baselines Amortized maml (ravi2018amortized) and mmaml (vuorio2019multimodal) which are exemplars of the Bayesian and taskspecific initialization type approaches respectively. These methods either model the taskparameters as Bayesian random variables (former) or adapt the parameters of the optimizer based on the input dataset (latter) and hence warrant a close comparison with our proposed methodology.
Model  sine  sinequadlinear  five 

maml  0.05  1.27  1.69 
Amortized maml  0.07  1.39  1.13 
mmaml  0.04  0.59  0.93 
Ours  0.008  0.39  0.89 
Model  sine  sinequadlinear  five 

maml  0.04  1.15  1.73 
Amortized maml  0.07  1.41  1.08 
mmaml  0.03  0.51  0.88 
Ours  0.04  0.51  0.84 
Dataset
The covariate distribution for each task is given by a normal whose parameters are sampled from a discrete distribution over pairs . The pairs are fixed in the beginning once they are sampled from a pair of independent uniform priors, . The parameters of the discrete distribution are sampled from a Dirichlet prior . Following (vuorio2019multimodal), the optimal hypothesis for each task is sampled from one of many modalities or hypothesis classes. The hypothesis classes considered are : sine, linear, quad, transformedL and tanh
. For each task, having chosen a hypothesis class, the parameters of the optimal hypothesis (like slope of a linear functions) is chosen based on uniform distributions over the following
^{3}^{3}3Taken as is from (vuorio2019multimodal). Reiterated for the sake of completion.:
sine: , with and .

quad: with and .

linear: , .

transformedL: , with and .

tanh: with .
For each of the true hypothesis classes above the final value of is generated by adding an independent error term sampled from a normal distribution with mean
and standard deviation of
i.e .We consider two cases, first where there exists a relation between the parameters (mean, variance) of the covariate distribution and the optimal hypothesis class chosen for a task and second when the optimal hypothesis class is chosen independent of the mean, variance of the covariate distribution. We highlight the results in each case separately.
CaseI: With a specific relation
This setting conforms to the case when in Fig. 2. We experiment with three different metadistributions which are of different complexities owing to the variety (number) of hypothesis classes each of them span. The datasets are listed as follows:

sine: true hypothesis for each class is given by a sinusoidal function with parameters sampled from distributions mentioned previously. The range for each of the parameters is split into disjoint sets, each one corresponding to a specific covariate distribution.

sinequadlinear: each of the three hypothesis classes are mapped to a specific covariate distribution. Thus, once the parameters of the covariate distribution are sampled based on the discrete prior, the corresponding hypothesis class is also chosen and a task is sampled from it.

five: similar to the previous case with the distinction that all five hypothesis classes are considered in this dataset i.e .
Table 1 highlights the Mean Squared Errors (mse) achieved by our method and the baselines on the three regression datasets described above. We can see that when there exists a relation between the true hypothesis class and the covariate distribution our approach performs significantly better than other stateoftheart approaches for regression. Interestingly, improvements are also observed for the five dataset which was specifically introduced by vuorio2019multimodal for evaluating mmaml. The Bayesian model Amortized maml performs poorly since unlike our approach it fails to acknowledge the latent relationship between the covariate distribution and the posterior over the five hypothesis classes.
CaseII: Independent
This setting conforms to the case when in Fig. 2. In Table 2 we demonstrate that the performance of our approach is no worse (if not better) than other methods which assume the independence by default. Once again we experiment with the same three types of metadistribution mentioned in the previous case.
Implementation
The conditional is modeled using a threelayered neural network with hidden sizes . The networks which take as input the sequence of labeled samples in a dataset
are modeled using a common Recurrent Neural Network
rnn backbone with output of dimension . This can be referred to as the taskembedding (vuorio2019multimodal). The mapping of taskembedding to , given by is modeled similar to mmaml as well as the modulation over which gives us the final initialization of the network parameters ^{4}^{4}4The code for our implementation was in part borrowed from https://github.com/vuoristo/MMAMLRegression. Following prior work (finn2017meta), a biastransformation of size was appended to the inputs. The total number of training tasks used were with samples for the support and query sets for each task. Adam optimizer with an initial learning rate of was used to train the metalearner. Additionally, all kl terms were reweighted with a weight of .5 Conclusion
Cognizant of the fact that the generalization performance of fewshot algorithms depends on a varying number of factors ranging from sample size, hypothesis class complexity to the optimization algorithm, input distribution; in this work we focus our efforts on improving metalearning algorithms by using the covariate distribution to infer the adapted parameters via a principled Bayesian approach. We begin by deriving elbo bounds for the hierarchical Bayes formulation and follow it up with a metalearning algorithm to infer the posterior over the network parameters. Finally, we show some preliminary results on a synthetic regression dataset designed to test the usefulness of our method.
Motivated by the empirical gains observed, we plan to extend our work to more challenging fewshot image classification benchmarks like miniimagenet
(ravi2016optimization) and FC100 (oreshkin2018tadam). We also acknowledge that the proposed framework warrants a more rigorous theoretical analysis to understand exactly how inferring the covariate distribution can impact regret bounds (khodak2019adaptive; khodak2019provable) in the online fewshot setting or even generalization error bounds in the classical one.