1 Introduction
Learning to learn, or metalearning (Schmidhuber, 1992; Thrun & Pratt, 2012), offers a promising tool for fewshot learning (Andrychowicz et al., 2016; Ravi & Larochelle, 2017; Finn et al., 2017) and has recently generated increasing popularity in machine learning. The crux of metalearning for fewshot learning is to extract prior knowledge from related tasks to enable fast adaptation to a new task with a limited amount of data. Generally speaking, existing metalearning algorithms (Ravi & Larochelle, 2017; Bertinetto et al., 2019) design the metalearner to extract metaknowledge that improves the performance of the baselearner on individual tasks. Meta knowledge, like a good parameter initialization (Finn et al., 2017), or an efficient optimization update rule shared across tasks (Andrychowicz et al., 2016; Ravi & Larochelle, 2017) has been extensively explored in general learning framework, but how to define and use in fewshot learning remains an open question.
An effective baselearner should be powerful enough to solve individual tasks and able to absorb information provided by the metalearner to improve its own performance. While potentially strong baselearners, kernels (Hofmann et al., 2008) have not yet been studied in the metalearning scenario for fewshot learning. Learning adaptive kernels (Bach et al., 2004) in a datadriven way via random features (Rahimi & Recht, 2007) has demonstrated great success in regular learning tasks and remains of broad interest in machine learning (Sinha & Duchi, 2016; Hensman et al., 2017; Carratino et al., 2018; Bullins et al., 2018; Li et al., 2019). However, due to the limited availability of data, it is challenging for fewshot learning to establish informative and discriminant kernels. We thus explore the relatedness among distinctive but relevant tasks to generate rich random features to build strong kernels for baselearners, while still maintaining their ability to adapt quickly to individual tasks.
In this paper, we make three important contributions. First, we propose meta variational random features (MetaVRF), integrating, for the first time, kernel learning with random features and variational inference into the metalearning framework for fewshot learning. We develop MetaVRF in a latent variable model by treating the random Fourier basis of translationinvariant kernels as the latent variable. Second, we formulate the optimization of MetaVRF as a variational inference problem by deriving a new evidence lower bound (ELBO) in the metalearning setting, where the posterior over the random feature basis corresponds to the spectral distribution associated with the kernel. This formulation under probabilistic modeling provides a principled way of learning datadriven kernels with random Fourier features and more importantly, fits well in the metalearning framework for fewshot learning allowing us to flexibly customize the variational posterior to leverage the meta knowledge for inference. As the third contribution, we propose a context inference which puts the inference of random feature bases of the current task into the context of all previous, related tasks. The context inference provides a generalized way to integrate context information of related tasks with taskspecific information for the inference of random feature bases. To establish the context inference, we introduce a recurrent LSTM architecture (Hochreiter & Schmidhuber, 1997), leveraging its innate capability of learning longterm dependencies, which can be adopted to explore shared metaknowledge from a large set of previous tasks. The LSTMbased inference connects knowledge from previous tasks to the current task, gradually collecting and refreshing the knowledge across the course of learning. The learning process with an LSTMbased inference network is illustrated in Figure 1. Once learning ceases, the ultimate LSTM state gains metaknowledge from related experienced tasks, which enables fast adaptation to new tasks.
We demonstrate the effectiveness of the proposed MetaVRF by extensive experiments on a variety of fewshot regression and classification tasks. Results show that our MetaVRF achieves better, or at least competitive, performance compared to previous methods. Moreover, we conduct further analysis on MetaVRF to demonstrate its ability to be integrated with deeper architectures and its efficiency with relatively low sampling rates. We also apply MetaVRF to versatile and challenging settings with inconsistent training and test conditions, and it can still deliver promising results, which further demonstrates its strong learning ability.
2 Method
We first describe the baselearner based on the kernel ridge regression in metalearning for fewshot learning, and then introduce kernel learning with random features, based on which our meta variational random features are developed.
2.1 MetaLearning with Kernels
We adopt the episodic training strategy commonly used for fewshot classification in metalearning (Ravi & Larochelle, 2017), which involves metatraining and metatesting stages. In the metatraining stage, a metalearner is trained to enhance the performance of a baselearner on a metatraining set with a batch of fewshot learning tasks, where a task is usually referred as an episode (Ravi & Larochelle, 2017). In the metatest stage, the baselearner is evaluated on a metatesting set with different classes of data samples from the metatraining set.
For the fewshot classification problem, we sample way shot classification tasks from the metatraining set, where is the number of labelled examples for each of the classes. Given the th task with a support set and query set (), we learn the parameters of the predictor using a standard learning algorithm with kernel trick , where . Here, is the baselearner and is a mapping function from to a dot product space . The similarity measure is usually called a kernel (Hofmann et al., 2008).
As in traditional supervised learning problems, the baselearner for the
th single task can use a predefined kernel, e.g., radius base function, to map the input into a dot product space for efficient learning. Once the baselearner is obtained on the support set, its performance is evaluated on the query set by the following loss function:
(1) 
where can be any differentiable function, e.g., crossentropy loss. In the metalearning setting for fewshot learning, we usually consider a batch of tasks. Thus, the metalearner is trained by optimizing the following objective function w.r.t. the empirical loss on tasks
(2) 
where is the feature mapping function which can be obtained by learning a taskspecific kernel for each task with datadriven random Fourier features.
In this work, we employ kernel ridge regression (KRR), which has an efficient closedform solution, as the baselearner for fewshot learning. The kernel value in the Gram matrix can be computed as , where “” is the transpose operation. The baselearner for a single task can be obtained by solving the following objective w.r.t. the support set of this task,
(3) 
which admits a closedform solution
(4) 
The learned predictor is then applied to samples in the query set :
(5) 
Here, , with each element as between the samples from the support and query sets. Note that we also treat in (3) as a trainable parameter by leveraging the metalearning setting, and all these parameters are learned by the metalearner.
Rather than using predefined kernels, we consider learning adaptive kernels with random Fourier features in a datadriven way. Moreover, we leverage the shared knowledge by exploring dependencies among related tasks to learn rich features for building up informative kernels.
2.2 Random Fourier Features
Random Fourier features (RFFs) were proposed to construct approximate translationinvariant kernels using explicit feature maps (Rahimi & Recht, 2007), based on Bochner’s theorem (Rudin, 1962).
Theorem 1 (Bochner’s theorem)
(Rudin, 1962) A continuous, real valued, symmetric and shiftinvariant function on
is a positive definite kernel if and only if it is the Fourier transform
of a positive finite measure such that(6) 
It is guaranteed that
is an unbiased estimation of
with sufficient RFF bases drawn from (Rahimi & Recht, 2007).For a predefined kernel, e.g., radius basis function (RBF), we use Monte Carlo sampling to draw bases from the spectral distribution, which gives rise to the explicit feature map:
(7) 
where are the random bases sampled from , and are
biases sampled from a uniform distribution with a range of
. Finally, the kernel values in are computed as the dot product of their random feature maps with the same bases.3 Meta Variational Random Features
We introduce our MetaVRF using a latent variable model in which we treat random Fourier bases as latent variables inferred from data. Learning kernels with random Fourier features is tantamount to finding the posterior distribution over random bases in a datadriven way. It is naturally cast into a variational inference problem, where the optimization objective is derived from an evidence lower bound (ELBO) under the metalearning framework.
3.1 Evidence Lower Bound
From a probabilistic perspective, under the metalearning setting for fewshot learning, the random feature basis can be obtained by maximizing the conditional predictive loglikelihood of samples from the query set .
(8)  
(9) 
We adopt a conditional prior distribution over the base as in the conditional variational autoencoder (CVAE) (Sohn et al., 2015) rather than an uninformative prior (Kingma & Welling, 2013; Rezende et al., 2014). By depending on the input , we infer the bases that can specifically represent the data, while leveraging the context of the current task by conditioning on the support set .
In order to infer the posterior over , which is generally intractable, we resort to using a variational distribution to approximate it, where the base is conditioned on the support set by leveraging metalearning. We can obtain the variational distribution by minimizing the KullbackLeibler (KL) divergence
(10) 
By applying the Bayes’ rule to the posterior , we can derive the ELBO as
(11) 
The first term of the ELBO is the predictive loglikelihood conditioned on the observation , and the inferred RFF bases . Maximizing it enables us to make an accurate prediction for the query set by utilizing the inferred bases from the support set. The second term in the ELBO minimizes the discrepancy between the meta variational distribution and the meta prior , which encourages samples from the support and query sets to share the same random Fourier bases. The full derivation of the ELBO is provided in the supplementary material.
We now obtain the objective by maximizing the ELBO with respect to a batch of tasks:
(12) 
where is the support set of the th task associated with its specific bases and is the sample from the query set of the th task. Directly optimizing the above objective does not take into account the task dependency. Thus, we introduce context inference by conditioning the posterior on both the support set of the current task and the shared knowledge extracted from previous tasks.
3.2 Context Inference
We propose a context inference which puts the inference of random feature bases for the current task in the context of related tasks. We replace the variational distribution in (10) with a conditional distribution that makes the bases of the current th task conditioned also on the context of related tasks.
The context inference gives rise to a new ELBO, as follows:
(13)  
which can be represented in a directed graphical model as shown in Figure 2. In a practical sense, the KL term in (13) encourages the model to extract useful information from previous tasks for inferring the spectral distribution associated with each individual sample of the query set in the current task.
The context inference integrates the knowledge shared across tasks with the taskspecific knowledge to build up adaptive kernels for individual tasks. The inferred random features are highly informative due to the absorbed information from prior knowledge of experienced tasks. The baselearner built on the inferred kernel with the informative random features can effectively solve the current task.
However, since there is usually a huge number of related tasks, it is nontrivial to model all these tasks simultaneously. We consider using recurrent neural networks to gradually accumulate information episodically along with the learning process by organizing tasks in a sequence. We propose an LSTMbased inference network by leveraging its innate capability of remembering longterm information
(Gers & Schmidhuber, 2000). The LSTM offers a wellsuited structure to implement the context inference. The cell state can store and accrue the meta knowledge shared among related tasks, which can also be updated when experiencing a new task in each episode during the course of learning; the output is used to adapt to each specific task.To be more specific, we model the variational posterior through
which is parameterized as a multilayer perceptron (MLP)
. Note that is the output from an LSTM that takes and as inputs. We implement the inference network with both vanilla and bidirectional LSTMs (Schuster & Paliwal, 1997; Graves & Schmidhuber, 2005). For a vanilla LSTM, we have(14) 
where is a vanilla LSTM network that takes the current support set, the output and the cell state as the input.
is the average over the feature representation vectors of samples in the support set
(Zaheer et al., 2017). The feature representation is obtained by a shared convolutional network . To incorporate more context information, we also implement the inference with a bidirectional LSTM, and we have , where and are the outputs from forward and backward LSTMs, respectively, and indicates a concatenation operation.Therefore, the optimization objective with the context inference is:
(15)  
where the variational approximate posterior
is taken as a multivariate Gaussian with a diagonal covariance. Given the support set as input, the mean
are output from the inference network . The conditional prior is implemented with a prior network which takes an aggregated representation by using the cross attention (Kim et al., 2019) between and . The details of the prior network are provided in the supplementary material. To enable backpropagation with the sampling operation during training, we adopt the reparametrization trick (Rezende et al., 2014; Kingma & Welling, 2013) as , whereDuring the course of learning, the LSTMs accumulate knowledge in the cell state by updating their cells using information extracted from each task. For the current task , the knowledge stored in the cell is combined with the taskspecific information from the support set to infer the spectral distribution for this task. To accrue the information across all the tasks in the metatraining set, the output and the cell state of LSTMs are passed down across batches. As a result, the finial cell state contains the distilled prior knowledge from all those experienced tasks in the metatraining set.
Fast Adaptation. Once metatraining ceases, the output and the cell state are directly used for a new incoming task in the metatest set to achieve fast adaptation with a simple feedforward computation operation. To be more specific, for a task with the support set in the metatest set, we draw samples as the bases: , where
is output from either a vanilla LSTM or a bidirectional LSTM, depending on which is used during the metatraining stage. The bases are adopted to compute the kernels on the support set and construct the classifier of the baselearner for the task, using (
4). The classifier is then used to make predictions of samples in the query set for performance evaluation.4 Related Work
Metalearning, or learning to learn, endues machine learning models the ability to improve their performance by leveraging knowledge extracted from a number of prior tasks. It has received increasing research interest with breakthroughs in many directions (Finn et al., 2017; Rusu et al., 2019; Gordon et al., 2019; Aravind Rajeswaran, 2019). Gradientbased methods (e.g., MAML (Finn et al., 2017)) learn an appropriate initialization of model parameters and adapt it to new tasks with only a few gradient steps (Finn & Levine, 2018; Zintgraf et al., 2019; Rusu et al., 2019). Learning a shared optimization algorithm has also been explored in order to quickly learn of new tasks (Ravi & Larochelle, 2017; Andrychowicz et al., 2016; Chen et al., 2017).
Metric learning has been widely studied with great success for fewshot learning (Vinyals et al., 2016; Snell et al., 2017; Satorras & Estrach, 2018; Oreshkin et al., 2018; Allen et al., 2019). The basic assumption is that a common metric space is shared across related tasks. Snell et al. (Snell et al., 2017) constructed a prototype for each class by averaging feature representations of samples from the class in the metric space. The query images are matched to prototypes by computing their distances to conduct classification. To enhance the prototype representation, Allen et al. (Allen et al., 2019) proposed an infinite mixture of prototypes (IMP) to adaptively represent data distribution of each class instead of using a single vector. In addition, Oreshkin et al. (Oreshkin et al., 2018) proposed task dependent adaptive metric for improved fewshot learning. They established prototypes of classes conditioning on a task representation encoded by a task embedding network.
While these metalearning algorithms have made great progress in fewshot learning tasks, exploring prior knowledge from previous tasks remains an open challenge (Titsias et al., 2019). In this work, we introduce kernels based on random features as the baselearners, which enables us to acquire shared knowledge across tasks by modeling their dependency via the random feature basis of kernels.
Kernel learning with random Fourier features is a versatile and powerful tool in machine learning (Bishop, 2006; Hofmann et al., 2008; Shervashidze et al., 2011). Pioneering works (Bach et al., 2004; Gönen & Alpaydın, 2011; Duvenaud et al., 2013) learn to combine predefined kernels in a multikernel learning manner. Kernel approximation by random Fourier features (RFFs) (Rahimi & Recht, 2007) is an effective technique for efficient kernel learning (Gärtner et al., 2002), which has recently become increasingly popular (Sinha & Duchi, 2016; Carratino et al., 2018). Recent works (Wilson & Adams, 2013)
learn kernels in the frequency domain by modeling the spectral distribution as a mixture of Gaussians and computing its optimal linear combination. Instead of modeling the spectral distribution with explicit density functions, other works focus on optimizing the random base sampling strategy
(Yang et al., 2015; Sinha & Duchi, 2016). Nonetheless, it has been shown that accurate approximation of kernels does not necessarily result in high classification performance (Avron et al., 2016; Chang et al., 2017). This suggests that learning adaptive kernels with random features by datadriven sampling strategies (Sinha & Duchi, 2016) can improve the performance, even with a low sampling rate compared to using universal random features (Avron et al., 2016; Chang et al., 2017).Our MetaVRF is the first work to introduce kernel learning with random features to the metalearning framework for fewshot learning. The optimization of MetaVRF is naturally cast as a variational inference and the context inference offers a principled way to incorporate prior knowledge and achieve informative and adaptive kernels.
5 Experiments
We evaluate our MetaVRF on several fewshot learning problems for both regression and classification. We demonstrate the benefit of exploring task dependency by implementing a baseline MetaVRF (12) without using the LSTM, which infers the random base solely from the support set. We also conduct further analysis to validate the effectiveness of our MetaVRF by showing its performance with deep embedding architectures, different numbers of bases, and under versatile and challenging settings with inconsistent training and test conditions.
5.1 FewShot Regression
We conduct regression tasks with different numbers of shots , and compare our MetaVRF with MAML (Finn et al., 2017), a representative metalearning algorithm. We follow the MAML work (Finn et al., 2017) to fit a target sine function , with only a few annotated samples. , , and denote the amplitude, frequency, and phase, respectively, which follow a uniform distribution within the corresponding interval. The goal is to estimate the target sine function given only randomly sampled data points. In our experiments, we consider the input in the range of , and conduct three tests under the conditions of . For a fair comparison, we compute the feature embedding using a small multilayer perception (MLP) with two hidden layers of size , following the same settings used in MAML.
miniImageNet, 5way  cifarfs, 5way  
Method  1shot  5shot  1shot  5shot 
Matching net (Vinyals et al., 2016)  44.2  57  —  — 
MAML (Finn et al., 2017)  48.71.8  63.10.9  58.91.9  71.51.0 
MAML (C)  46.71.7  61.10.1  58.91.8  71.51.1 
MetaLSTM (Ravi & Larochelle, 2017)  43.40.8  60.60.7  —  — 
Proto net (Snell et al., 2017)  47.40.6  65.40.5  55.50.7  72.00.6 
Relation net (Sung et al., 2018)  50.40.8  65.30.7  55.01.0  69.30.8 
SNAIL (32C) by (Bertinetto et al., 2019)  45.1  55.2  —  — 
GNN (Garcia & Bruna, 2018)  50.3  66.4  61.9  75.3 
PLATIPUS (Finn et al., 2018)  50.11.9  —  —  — 
VERSA (Gordon et al., 2019)  53.31.8  67.30.9  62.51.7  75.10.9 
R2D2 (C) (Bertinetto et al., 2019)  49.50.2  65.40.2  62.30.2  77.40.2 
R2D2 (Devos et al., 2019)  51.71.8  63.30.9  60.21.8  70.90.9 
CAVIA (Zintgraf et al., 2019)  51.80.7  65.60.6  —  — 
iMAML (Aravind Rajeswaran, 2019)  49.31.9  —  —  — 
RFFs (2048d)  52.80.9  65.40.9  61.10.8  74.70.9 
MetaVRF (w/o lstm, 780d)  51.30.8  66.10.7  61.10.7  74.3 0.9 
MetaVRF (vanilla lstm, 780d)  53.10.9  66.80.7  62.10.8  76.00.8 
MetaVRF (bilstm, 780d)  54.20.8  67.80.7  63.10.7  76.50.9 
Omniglot, 5way  Omniglot, 20way  
Method  1shot  5shot  1shot  5shot 
Siamese net (Koch, 2015)  96.7  98.4  88  96.5 
Matching net (Vinyals et al., 2016)  98.1  98.9  93.8  98.5 
MAML (Finn et al., 2017)  98.7  99.90.1  95.80.3  98.90.2 
Proto net (Snell et al., 2017)  98.50.2  99.50.1  95.30.2  98.70.1 
SNAIL (Mishra et al., 2018)  99.10.2  99.8 0.1  97.6 0.3  99.4 0.2 
GNN (Garcia & Bruna, 2018)  99.2  99.7  97.4  99.0 
VERSA (Gordon et al., 2019)  99.70.2  99.80.1  97.70.3  98.80.2 
R2D2 (Bertinetto et al., 2019)  98.6  99.7  94.7  98.9 
IMP (Allen et al., 2019)  98.40.3  99.50.1  95.00.1  98.60.1 
RFFs (2048d)  99.50.2  99.50.2  97.20.3  98.30.2 
MetaVRF (w/o lstm, 780d)  99.60.2  99.60.2  97.00.3  98.40.2 
MetaVRF (vanilla lstm, 780d)  99.70.2  99.80.1  97.50.3  99.00.2 
MetaVRF (bilstm, 780d)  99.80.1  99.90.1  97.80.3  99.20.2 
The results in Figure 3 show that our MetaVRF fits the function well with only three shots but performs better with an increasing number of shots, almost entirely fitting the target function with ten shots. Moreover, the results demonstrate the advantage of exploring task dependency by LSTMbased inference. MetaVRF with bilstm performs better than regular LSTM since more context tasks are incorporated by bilstm. In addition, we observe that MetaVRF performs better than MAML for all three settings with varying numbers of shots. We provide more results on fewshot regression tasks in the supplementary material.
5.2 FewShot Classification
The classification experiments are conducted on three commonlyused benchmark datasets, i.e., Omniglot (Lake et al., 2015), miniImageNet (Vinyals et al., 2016) and CIFARFS (Krizhevsky et al., 2009)
; for more details, please refer to the supplementary material. We extract image features using a shallow convolutional neural network with the same architecture as in
(Gordon et al., 2019). We do not use any fully connected layers for these CNNs. The dimension of all feature vectors is . We also evaluate the baseline method, random Fourier features (RFFs), to approximate the common Gaussian kernel. The inference network is a threelayer MLP with units in the hidden layers and rectifier nonlinearity where input sizes are and for the vanilla and bidirectional LSTMs, respectively.The key hyperparameter for the number of bases
in (7) is set to for MetaVRF in all experiments, while we use RFFs with as this produces the best performance. The sampling rate in our MetaVRF is much lower than in previous works using RFFs, in which is usually set to be to times the dimension of the input features (Yu et al., 2016; Rahimi & Recht, 2007). We adopt a similar metatesting protocol as (Gordon et al., 2019; Finn et al., 2017), but we test on episodes rather than and present the results with confidence intervals. All reported results are produced by models trained from scratch. We compare with previous methods that use the same training procedures and similar shallow conventional CNN architectures as ours. The comparison results on three benchmark datasets are reported in Tables 1 and 2.On all benchmark datasets, MetaVRF delivers the stateoftheart performance. Even with a relatively low sampling rate, MetaVRF produces consistently better performance compared with the baseline RFFs. MetaVRF with bilstm outperforms the one with vanilla lstm since it can leverage more information. It is worth mentioning that MetaVRF with bilstm achieves good performance () under the way shot setting on the miniImageNet dataset, surpassing the second best model by . The MetaVRFs with bilstm and vanilla lstm consistently outperform the one without the lstm, which demonstrates the effectiveness of using lstm to explore task dependency. Note that on Omniglot, the performance of existing methods saturates and MetaVRF with bilstm achieves the best performance for most settings, including way shot, way shot, and way shot. It is also competitive under the way shot setting falling within the error bars of the stateofthearts. Note that, in Table 1, we also implement a MAML () with channels in each convolutional layer. However, while it obtains modest performance, we believe the increased model size leads to overfitting. Since in the original SNAIL, a very deep ResNet12 network is used for embedding, we cite the result of SNAIL reported in Bertinetto et al. (2019) using similar shallow networks as ours. We cite the original results of R2D2 (Bertinetto et al., 2019) using channels for fair comparison.
5.3 Further Analysis
Deep embedding.
Our MetaVRF is independent of the convolutional architectures for feature extraction and can work with deeper embeddings either pretrained or trained from scratch. In general, the performance improves with more powerful feature extraction architectures. We evaluate our method using pretrained embeddings in order to compare with existing methods using deep embedding architectures. To benchmark with those methods, we adopt the pretrained embeddings from a 28layer wide residual network (WRN2810)
(Zagoruyko & Komodakis, 2016), in a similar fashion to (Rusu et al., 2019; Bauer et al., 2017; Qiao et al., 2018). We choose activations in the 21st layer, with average pooling over spatial dimensions, as feature embeddings. The dimension of pretrained embeddings is . We show the comparison results on the miniImageNet dataset for 5way 1shot and 5shot settings in Table. 3. Our MetaVRF with bilstm achieves the best performance under both settings and largely surpasses LEO, a recently proposed metalearning method, especially on the challenging 5way 1shot setting. Note that the MetaVRF with vanilla lstm and without lstm also produce competitive performance.Method  1shot  5shot 

MetaSGD (Li et al., 2017)  54.240.03  70.860.04 
(Gidaris & Komodakis, 2018)  56.200.86  73.000.64 
(Bauer et al., 2017)  56.300.40  73.900.30 
(Munkhdalai et al., 2017)  57.100.70  70.040.63 
(Qiao et al., 2018)  59.600.41  73.540.19 
LEO (Rusu et al., 2019)  61.760.08  77.590.12 
SNAIL (Mishra et al., 2018)  55.710.99  68.880.92 
TADAM (Oreshkin et al., 2018)  58.500.30  76.700.30 
MetaVRF (w/o lstm, 780d)  62.120.07  77.050.28 
MetaVRF (vanilla lstm, 780d)  63.210.06  77.830.28 
MetaVRF (bilstm, 780d)  63.800.05  77.970.28 
Efficiency. Regular random Fourier features (RFFs) usually require high sampling rates to achieve satisfactory performance. However, our MetaVRF can achieve high performance with a relatively low sampling rate compared, which guarantees its high efficiency. In Figure 4, we compare with regular RFFs using different sampling rates. We show the performance change of fully trained models using RFFs and our MetaVRF with bilstm under a different number of bases. We show the comparison results for the way shot setting in Figure 4. MetaVRF with bilstm consistently yields higher performance than regular RFFs with the same number of sampled bases. The results verify the efficiency of our MetaVRF in learning adaptive kernels and the effectiveness in improving performance by exploring dependencies of related tasks.
Versatility. In contrast to most existing metalearning methods, our MetaVRF can be used for versatile settings. We evaluate the performance of MetaVRF on more challenging scenarios where the number of ways and shots between training and testing are inconsistent. Specifically, we test the performance of MetaVRF on tasks with varied and , when it is trained on one particular wayshot task. As shown in Figure 5, the results demonstrate that the trained model can still produce good performance, even on the challenging condition with a far higher number of ways. In particular, the model trained on the wayshot task can retain a high accuracy of on the way setting, as shown in Figure 5(a). The results also indicate that our MetaVRF exhibits considerable robustness and flexibility to a great variety of testing conditions.
6 Conclusion
In this paper, we introduce kernel approximation based on random Fourier features into the metalearning framework for fewshot learning. We propose meta variational random features (MetaVRF), which leverage variational inference and metalearning to infer the spectral distribution of random Fourier features in a datadriven way. MetaVRF generates random Fourier features of high representational power with a relatively low spectral sampling rate by using an LSTM based inference network to explore the shared knowledge. In practice, our LSTMbased inference network demonstrates a great ability to quickly adapt to specific tasks for improved performance. Experimental results on both regression and classification tasks demonstrate the effectiveness for fewshot learning.
Acknowledgements
This research was supported in part by Natural Science Foundation of China (No. 61976060, 61871016, 61876098).
References
 Allen et al. (2019) Allen, K. R., Shelhamer, E., Shin, H., and Tenenbaum, J. B. Infinite mixture prototypes for fewshot learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 232–241, 2019.
 Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.
 Aravind Rajeswaran (2019) Aravind Rajeswaran, Chelsea Finn, S. K. S. L. Metalearning with implicit gradients. arXiv preprint arXiv:1909.04630, 2019.
 Avron et al. (2016) Avron, H., Sindhwani, V., Yang, J., and Mahoney, M. W. Quasimonte carlo feature maps for shiftinvariant kernels. The Journal of Machine Learning Research, 17(1):4096–4133, 2016.
 Bach et al. (2004) Bach, F. R., Lanckriet, G. R., and Jordan, M. I. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the twentyfirst international conference on Machine learning, pp. 6, 2004.
 Bauer et al. (2017) Bauer, M., RojasCarulla, M., Świątkowski, J. B., Schölkopf, B., and Turner, R. E. Discriminative kshot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
 Bertinetto et al. (2019) Bertinetto, L., Henriques, J. F., Torr, P. H., and Vedaldi, A. Metalearning with differentiable closedform solvers. In International Conference on Learning Representations, 2019.
 Bishop (2006) Bishop, C. M. Pattern recognition and machine learning. springer, 2006.
 Bullins et al. (2018) Bullins, B., Zhang, C., and Zhang, Y. Notsorandom features. In International Conference on Learning Representations, 2018.
 Carratino et al. (2018) Carratino, L., Rudi, A., and Rosasco, L. Learning with sgd and random features. In Advances in Neural Information Processing Systems, pp. 10192–10203, 2018.
 Chang et al. (2017) Chang, W.C., Li, C.L., Yang, Y., and Poczos, B. Datadriven random fourier features using stein effect. arXiv preprint arXiv:1705.08525, 2017.
 Chen et al. (2017) Chen, Y., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Lillicrap, T. P., Botvinick, M., and De Freitas, N. Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 748–756. JMLR. org, 2017.
 Devos et al. (2019) Devos, A., Chatel, S., and Grossglauser, M. Reproducing metalearning with differentiable closedform solvers. In ICLR Workshop, 2019.
 Duvenaud et al. (2013) Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B., and Ghahramani, Z. Structure discovery in nonparametric regression through compositional kernel search. arXiv preprint arXiv:1302.4922, 2013.
 Finn & Levine (2018) Finn, C. and Levine, S. Metalearning and universality: Deep representations and gradient descent can approximate any learning algorithm. In International Conference on Learning Representations, 2018.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. JMLR. org, 2017.
 Finn et al. (2018) Finn, C., Xu, K., and Levine, S. Probabilistic modelagnostic metalearning. In Advances in Neural Information Processing Systems, pp. 9516–9527, 2018.
 Garcia & Bruna (2018) Garcia, V. and Bruna, J. Fewshot learning with graph neural networks. In International Conference on Learning Representations, 2018.
 Gärtner et al. (2002) Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. J. Multiinstance kernels. In International Conference on Machine Learning, 2002.
 Gers & Schmidhuber (2000) Gers, F. A. and Schmidhuber, J. Recurrent nets that time and count. In Proceedings of the IEEEINNSENNS International Joint Conference on Neural Networks, volume 3, pp. 189–194. IEEE, 2000.

Gidaris & Komodakis (2018)
Gidaris, S. and Komodakis, N.
Dynamic fewshot visual learning without forgetting.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4367–4375, 2018.  Gönen & Alpaydın (2011) Gönen, M. and Alpaydın, E. Multiple kernel learning algorithms. Journal of machine learning research, 12(Jul):2211–2268, 2011.
 Gordon et al. (2019) Gordon, J., Bronskill, J., Bauer, M., Nowozin, S., and Turner, R. E. Metalearning probabilistic inference for prediction. In International Conference on Learning Representations, 2019.
 Graves & Schmidhuber (2005) Graves, A. and Schmidhuber, J. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(56):602–610, 2005.
 Hensman et al. (2017) Hensman, J., Durrande, N., and Solin, A. Variational fourier features for gaussian processes. The Journal of Machine Learning Research, 18(1):5537–5588, 2017.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hofmann et al. (2008) Hofmann, T., Schölkopf, B., and Smola, A. J. Kernel methods in machine learning. The annals of statistics, pp. 1171–1220, 2008.
 Kim et al. (2019) Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. In International Conference on Learning Representations, 2019.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Koch (2015) Koch, G. Siamese neural networks for oneshot image recognition. In ICML Workshop, 2015.
 Krizhevsky et al. (2009) Krizhevsky, A. et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.

Li et al. (2019)
Li, C.L., Chang, W.C., Mroueh, Y., Yang, Y., and Poczos, B.
Implicit kernel learning.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 2007–2016, 2019.  Li et al. (2017) Li, Z., Zhou, F., Chen, F., and Li, H. Metasgd: Learning to learn quickly for fewshot learning. arXiv preprint arXiv:1707.09835, 2017.
 Mishra et al. (2018) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive metalearner. In International Conference on Learning Representations, 2018.
 Munkhdalai et al. (2017) Munkhdalai, T., Yuan, X., Mehri, S., and Trischler, A. Rapid adaptation with conditionally shifted neurons. arXiv preprint arXiv:1712.09926, 2017.
 Oreshkin et al. (2018) Oreshkin, B., López, P. R., and Lacoste, A. Tadam: Task dependent adaptive metric for improved fewshot learning. In Advances in Neural Information Processing Systems, pp. 721–731, 2018.
 Qiao et al. (2018) Qiao, S., Liu, C., Shen, W., and Yuille, A. L. Fewshot image recognition by predicting parameters from activations. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238, 2018.
 Rahimi & Recht (2007) Rahimi, A. and Recht, B. Random features for largescale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177–1184, 2007.
 Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. In International Conference on Learning Representations, 2017.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Rudin (1962) Rudin, W. Fourier analysis on groups, volume 121967. Wiley Online Library, 1962.
 Rusu et al. (2019) Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Metalearning with latent embedding optimization. In International Conference on Learning Representations, 2019.
 Satorras & Estrach (2018) Satorras, V. G. and Estrach, J. B. Fewshot learning with graph neural networks. In International Conference on Learning Representations, 2018.
 Schmidhuber (1992) Schmidhuber, J. Learning to control fastweight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
 Schuster & Paliwal (1997) Schuster, M. and Paliwal, K. K. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
 Shervashidze et al. (2011) Shervashidze, N., Schweitzer, P., Leeuwen, E. J. v., Mehlhorn, K., and Borgwardt, K. M. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
 Sinha & Duchi (2016) Sinha, A. and Duchi, J. C. Learning kernels with random features. In Advances in Neural Information Processing Systems, pp. 1298–1306, 2016.
 Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087, 2017.
 Sohn et al. (2015) Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491, 2015.
 Sung et al. (2018) Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. Learning to compare: Relation network for fewshot learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208, 2018.
 Thrun & Pratt (2012) Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.
 Titsias et al. (2019) Titsias, M. K., Schwarz, J., Matthews, A. G. d. G., Pascanu, R., and Teh, Y. W. Functional regularisation for continual learning using gaussian processes. arXiv preprint arXiv:1901.11356, 2019.
 Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638, 2016.
 Wilson & Adams (2013) Wilson, A. and Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning, pp. 1067–1075, 2013.
 Yang et al. (2015) Yang, Z., Wilson, A., Smola, A., and Song, L. A la carte–learning fast kernels. In Artificial Intelligence and Statistics, pp. 1098–1106, 2015.
 Yu et al. (2016) Yu, F. X. X., Suresh, A. T., Choromanski, K. M., HoltmannRice, D. N., and Kumar, S. Orthogonal random features. In Advances in Neural Information Processing Systems, pp. 1975–1983, 2016.
 Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401, 2017.
 Zintgraf et al. (2019) Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., and Whiteson, S. Fast context adaptation via metalearning. In International Conference on Machine Learning, pp. 7693–7702, 2019.