Learning quickly or with very few samples has been a long-term goal of the machine learning community. The field of meta-learning has recently made significant strides towards achieving that goal. Meta-learning(nichol2018first; ravi2016optimization; finn2017model) comprises of a set of algorithms designed to exploit prior experiences from multiple tasks (drawn from a task distribution) for improving sample-efficiency on a new but related task from the same distribution. Given the increasing cost of getting annotated samples on an ever-increasing variety of related tasks, the practical scope of these algorithms is immense.
Most meta-learning methods can be classified into two broad categories (i)gradient-based (ravi2018amortized; denevi2019learning; finn2017model) approaches that meta-learn parameters of optimization algorithms (like initialization and learning rate) in a way that the meta-learner (optimizer) is amenable to quickly adapt on a new task by performing gradient descent on a very small number of labeled samples, and (ii) amortized-inference (snell2017prototypical; lee2019meta; bertinetto2018meta)
based approaches that directly infer the optimal parameters of a new task without performing any gradient based optimization. In general, such algorithms learn to adapt the parameters of a complex neural network using only a few samples from an unseen task in a way that theadapted network generalizes well (wang2019generalizing). In this work, although we focus on improving gradient-based methods, we believe that our core idea can be adapted to the latter as well. Recent works (finn2018probabilistic), (ravi2018amortized), (kim2018bayesian) have used a Bayesian framework to learn a suitable prior over the network parameters by leveraging the inherent structure of the task distribution. By viewing the parameters of a meta-learner through a Bayesian lens we can use the predictive posterior (gal2016dropout) to estimate the uncertainty of the adapted parameters for each task (ravi2018amortized).
In a Bayesian meta-learner (kim2018bayesian; finn2018probabilistic; ravi2018amortized), the posterior over the adapted network parameters for a new task is typically inferred using a few samples from the task along with a meta-learned prior. In this work, we hypothesize that the covariate distribution of a task can also influence the posterior over the adapted network parameters. To the best of our knowledge none of the existing meta-learning algorithms like bertinetto2018meta; rajeswaran2019meta; ravi2018amortized; finn2018probabilistic explicitly utilize the information present in the covariates to improve the estimate of the adapted parameters. This is done by modeling the latent factors of the covariate distribution. We define a prior not only on the network parameters (which determine the conditional ) but also on the covariate distribution . Our meta-learning objective involves maximizing the joint likelihood as opposed to just which leads to meta-parameters sharing information about the covariates across tasks, in addition to the optimal network parameters. This way the latent factors of the covariate distribution of a new task can be quickly inferred from very few covariates. Finally, the inferred latent covariate factors are used to infer the posterior over the adapted network parameters.
For simplicity, we motivate the need to model covariates via a synthetic example in Fig. 1. The meta-distribution consists of tasks with optimal hypothesis that can be classified into four hypothesis classes: sinusoidal (), linear (), tanh () and quadratic (). We note that the support of the input distribution is vastly different for each of the four hypothesis classes. Thus, intuitively we can see that inferring properties of the covariate distribution of a task can be helpful in adjusting the posterior over the network parameters. For example, for a given task if we only observe negative covariates in the range we can adapt the posterior to have a higher measure for (sinusoidal hypothesis).
We can generalize the above intuition to real-world meta-learning problems as well, especially for high-dimensional data like images. Particularly, in the limited availability of labeled data, semi-supervised methods(chapelle2009semi) tend to leverage unlabeled data (and hence covariate distribution) to attain generalizable models (berthelot2019mixmatch) On similar lines, modeling the meta-distribution over the covariates can help inferring the task-specific covariate distribution which can then better inform task-specific features for image classification. In few-shot classification, images for different tasks can lie on different manifolds (saul2003think) of varying complexities. Information about the covariates (by modeling the unlabeled data) can help us better estimate the required complexity of the discriminative features for a given task. For example, distinguishing species of plants can be considered harder than classifying mammals since the features of plant images may be cluttered on a low-dimensional manifold as opposed to the possibly well separated features in the case of mammals. Thus the former may require having complex (non-linear) decision boundaries as opposed to simpler linear classifiers in the case of the latter.
The main contributions of our work are as follows: (1) we identify the need to model the latent structure present in the covariate distributions for a sequence of tasks (2) to the best of our knowledge we are the first to propose a Bayesian framework which exploits this latent information to better infer the posterior over the adapted network parameters (that define ) (3) we propose a gradient based model-agnostic meta-learning algorithm that is an instantiation of our probabilistic theory and demonstrate its benefits on synthetic regression datasets.
2 Related Work
Our methodology is complementary to most existing works in the probabilistic meta-learning literature. We borrow the basic hierarchical Bayes framework from ravi2018amortized; finn2018probabilistic and extend it to model Bayesian variables that generate the covariate distribution for a task. This enables our method to be model-agnostic while having the ability to benefit from the latent relationship between the task covariates and the optimal parameters as mentioned in Sec. 1. In the non-Bayesian setting, the m-maml algorithm proposed by vuorio2019multimodal is mildly similar to our approach in the sense that they learn task specific initializations instead of a single one as originally introduced by finn2017model. m-maml uses the labeled samples to infer an initialization for a given task and hence one can view the covariate distribution as being used indirectly. But they fail to explicitly model the mutual information between the covariates and adapted parameters. On the other hand, our approach is more direct since it first infers the posterior over the latent factors of the covariate distribution via a maximum likelihood objective and then uses the inferred posterior to improve the adaptation of network parameters. Additionally, our framework is capable of modeling the uncertainty of the adaptation which can prove to be be critical in the few shot scenario.
We begin by introducing some notations for the meta-learning setup used in the rest of the paper followed by the proposed probabilistic framework which explicitly exploits (i) the structure of the covariate distributions across tasks (ii) the relation between the covariate distribution and optimal hypothesis for a given task. We then derive the Maximum Likelihood Estimation (mle) objectives for the observed variables in our model and show how the mle derivations can inform a novel meta-learning objective. Finally, we discuss a specific meta-learning algorithm that can efficiently optimize the proposed objective. We do this via an instantiation of the generic approach obtained by making certain simplifying assumptions in the original framework.
Notations We are given a sequence of tasks with each task having labeled samples given by the dataset where and . Following the definitions introduced by finn2018probabilistic we split the dataset into support () and query sets respectively with . Each sample in
is drawn from the joint distributionover with the marginals given by and .
The probabilistic model we consider in our work is summarized in Fig. 2. Without making any assumptions on the nature of , we assume the existence of meta-parameters that govern the common structure shared across the set of joint distributions . Within each task the generative model for the input
involves a random variablewhich we shall refer to as the latent factors of the covariate distribution. Also, each task has an additional latent variable
which plays a role in the generative model for the response variablegiven the input . In most settings, a naive assumption of independence is made over the latent factors and . On the other hand, we refrain from making such assumptions and instead exploit the information present in the covariates to better infer the posterior over the latent variable which influences the conditional .
3.1 Formal Derivations
In this section, we derive a lower bound for the likelihood of the observed data using hierarchical variational inference. This gives us the meta-learning objective that can be optimized using standard gradient-based approaches.
In the above equation, the distribution is a variational approximation (with parameters ) for the true posterior over the meta-parameter . For the derivations henceforth we shall drop the notations and when understood from context. The log-likelihood of the dataset given by , can be written as an integral over the factors , and .
To lower bound the log of the above objective we introduce two variational approximations (i) with parameters for the true posterior and (ii) with parameters for the true posterior .
Since is the latent factor in the generative model for and is the corresponding latent variable for , we arrive at the independence: and . Based on this, we finally arrive at the following Evidence Lower Bound (elbo) for which we shall refer to as .
Therefore, the elbo on the likelihood of the dataset for task is a function of the task-specific variational parameters and the variational meta-parameter . For each task, the optimal variational parameters that approximate the true posteriors are distinct. Hence, need to be adapted for each task individually. Given we can re-write the lower bound in Eq. 1 as:
The primary aim of any meta-learning algorithm is to optimize for the meta-parameter given the datasets from the corresponding sequence of tasks. This is generally a two step process where step-I involves identifying the optimal task-specific parameters using and the support set . In step-II, based on the task-specific adapted parameters from step-I the meta-parameter is optimized over the query set . Within our framework, since both the meta-parameter and the task-specific parameters are Bayesian random variables with variational parameters given by respectively, we instead define an algorithm to optimize the elbo in Eq. 5.
In order to optimize the objectives in Eqs. 6, 7 we introduce certain simplifying assumptions over each of the variational approximations in Sec. 3.1. 111We note that the assumptions we consider are merely to induce a computationally feasible meta-learner and the theory involving a meta-distribution over covariates (in Sec. 3.1) is barely undermined by it.
The variational approximation follows a distribution given by and the prior is given by . We note that the assumptions taken are effective in arriving at a computationally feasible algorithm and are
To be concrete, the task-specific random variable represents the parameters of the neural network for the task, that takes as input and outputs a prediction . On the other hand, as stated previously represents the latent variable for the covariate distribution . Furthermore, the optimal variational parameters for the distribution is given by which consists of the mean and std. deviation of a normal distribution. Here, represent neural networks which take as input the support set and output . It is important to note that even though the parameters of are task-agnostic, the variational parameter is still different for each task and is determined using the covariates in .
Having identified we now describe the optimization algorithm for in Eq. 9. Notice that to obtain it is sufficient to only minimize the objective . In Eq. 3.1 the kl term acts as a regularizer in the optimization objective for
. Since the most common algorithm for optimization is Stochastic Gradient Descent (sgd) many meta-learning algorithms avoid the kl term by choosing a regularization specific to sgd. In most works (ravi2018amortized; finn2018probabilistic; kim2018bayesian), the kl term is a function of only the meta-parameter (or given Assm. 1). Hence the regularization is induced by letting the initialization for the optimization of (given by ) be determined by . In our framework, we realize that the kl term is a function of both and the latent variable for the task specific covariate distribution. Hence we model the initialization using a neural-network whose parameters are subsumed in and thus without loss of expressivity . Thus, the optimal parameters of the variational approximation () would be given by performing steps of sgd on the mle objective in Eq. 3.1 with the initialization given by .
The expectation in Eq 11 is computed using monte-carlo approximation. We find that sampling a single value of is sufficient to optimize for .
Finally, we note that the meta-parameter constitutes the parameters of the network which determines the initialization as well as the parameters of which output . Thus, is optimized to jointly maximize the likelihood of the covariates of a sequence of tasks as well as for learning to choose covariate dependent initializations suitable for few-shot adaptation. For the optimization objective in Eq. 8, we use the standard re-parameterization trick ( step of the outer for-loop in Algorithm 1) commonly used in Variational Auto-Encoders (vaes). This is done so as to be able to differentiate through the expectation over in Eqs. 3, 11. The step-by-step procedure for the meta-training and meta-testing phases are given by Algorithm 1 and Algorithm 2 respectively.
4 Experiments and Results
In order to first litmus test our approach on a simpler task we begin by evaluating it on a synthetic regression dataset borrowed from vuorio2019multimodal and defer further experimentation on more complicated real world datasets to future work. Most meta-learning algorithms have been tested on regression datasets where the covariate distribution is same across all tasks. We thus describe how we suitably modify the original dataset so that there exists a structure over the set of covariate distributions across tasks. This enables us to fairly evaluate our method against the baselines in the proposed setting.
We compare our algorithm with the popular gradient based meta-learning approach maml introduced by finn2017model. Additionally, we also chose as baselines Amortized maml (ravi2018amortized) and m-maml (vuorio2019multimodal) which are exemplars of the Bayesian and task-specific initialization type approaches respectively. These methods either model the task-parameters as Bayesian random variables (former) or adapt the parameters of the optimizer based on the input dataset (latter) and hence warrant a close comparison with our proposed methodology.
The covariate distribution for each task is given by a normal whose parameters are sampled from a discrete distribution over pairs . The pairs are fixed in the beginning once they are sampled from a pair of independent uniform priors, . The parameters of the discrete distribution are sampled from a Dirichlet prior . Following (vuorio2019multimodal), the optimal hypothesis for each task is sampled from one of many modalities or hypothesis classes. The hypothesis classes considered are : sine, linear, quad, transformed-L and tanh
. For each task, having chosen a hypothesis class, the parameters of the optimal hypothesis (like slope of a linear functions) is chosen based on uniform distributions over the following333Taken as is from (vuorio2019multimodal). Re-iterated for the sake of completion.:
sine: , with and .
quad: with and .
linear: , .
transformed-L: , with and .
tanh: with .
For each of the true hypothesis classes above the final value of is generated by adding an independent error term sampled from a normal distribution with mean
and standard deviation ofi.e .
We consider two cases, first where there exists a relation between the parameters (mean, variance) of the covariate distribution and the optimal hypothesis class chosen for a task and second when the optimal hypothesis class is chosen independent of the mean, variance of the covariate distribution. We highlight the results in each case separately.
Case-I: With a specific relation
This setting conforms to the case when in Fig. 2. We experiment with three different meta-distributions which are of different complexities owing to the variety (number) of hypothesis classes each of them span. The datasets are listed as follows:
sine: true hypothesis for each class is given by a sinusoidal function with parameters sampled from distributions mentioned previously. The range for each of the parameters is split into disjoint sets, each one corresponding to a specific covariate distribution.
sine-quad-linear: each of the three hypothesis classes are mapped to a specific covariate distribution. Thus, once the parameters of the covariate distribution are sampled based on the discrete prior, the corresponding hypothesis class is also chosen and a task is sampled from it.
five: similar to the previous case with the distinction that all five hypothesis classes are considered in this dataset i.e .
Table 1 highlights the Mean Squared Errors (mse) achieved by our method and the baselines on the three regression datasets described above. We can see that when there exists a relation between the true hypothesis class and the covariate distribution our approach performs significantly better than other state-of-the-art approaches for regression. Interestingly, improvements are also observed for the five dataset which was specifically introduced by vuorio2019multimodal for evaluating m-maml. The Bayesian model Amortized maml performs poorly since unlike our approach it fails to acknowledge the latent relationship between the covariate distribution and the posterior over the five hypothesis classes.
This setting conforms to the case when in Fig. 2. In Table 2 we demonstrate that the performance of our approach is no worse (if not better) than other methods which assume the independence by default. Once again we experiment with the same three types of meta-distribution mentioned in the previous case.
The conditional is modeled using a three-layered neural network with hidden sizes . The networks which take as input the sequence of labeled samples in a dataset
are modeled using a common Recurrent Neural Networkrnn backbone with output of dimension . This can be referred to as the task-embedding (vuorio2019multimodal). The mapping of task-embedding to , given by is modeled similar to m-maml as well as the modulation over which gives us the final initialization of the network parameters 444The code for our implementation was in part borrowed from https://github.com/vuoristo/MMAML-Regression. Following prior work (finn2017meta), a bias-transformation of size was appended to the inputs. The total number of training tasks used were with samples for the support and query sets for each task. Adam optimizer with an initial learning rate of was used to train the meta-learner. Additionally, all kl terms were re-weighted with a weight of .
Cognizant of the fact that the generalization performance of few-shot algorithms depends on a varying number of factors ranging from sample size, hypothesis class complexity to the optimization algorithm, input distribution; in this work we focus our efforts on improving meta-learning algorithms by using the covariate distribution to infer the adapted parameters via a principled Bayesian approach. We begin by deriving elbo bounds for the hierarchical Bayes formulation and follow it up with a meta-learning algorithm to infer the posterior over the network parameters. Finally, we show some preliminary results on a synthetic regression dataset designed to test the usefulness of our method.
Motivated by the empirical gains observed, we plan to extend our work to more challenging few-shot image classification benchmarks like mini-imagenet(ravi2016optimization) and FC100 (oreshkin2018tadam). We also acknowledge that the proposed framework warrants a more rigorous theoretical analysis to understand exactly how inferring the covariate distribution can impact regret bounds (khodak2019adaptive; khodak2019provable) in the online few-shot setting or even generalization error bounds in the classical one.