Log In Sign Up

Learning to Learn Kernels with Variational Random Features

by   Xiantong Zhen, et al.

In this work, we introduce kernels with random Fourier features in the meta-learning framework to leverage their strong few-shot learning ability. We propose meta variational random features (MetaVRF) to learn adaptive kernels for the base-learner, which is developed in a latent variable model by treating the random feature basis as the latent variable. We formulate the optimization of MetaVRF as a variational inference problem by deriving an evidence lower bound under the meta-learning framework. To incorporate shared knowledge from related tasks, we propose a context inference of the posterior, which is established by an LSTM architecture. The LSTM-based inference network can effectively integrate the context information of previous tasks with task-specific information, generating informative and adaptive features. The learned MetaVRF can produce kernels of high representational power with a relatively low spectral sampling rate and also enables fast adaptation to new tasks. Experimental results on a variety of few-shot regression and classification tasks demonstrate that MetaVRF delivers much better, or at least competitive, performance compared to existing meta-learning alternatives.


MetaKernel: Learning Variational Random Features with Limited Labels

Few-shot learning deals with the fundamental and challenging problem of ...

Automatic Validation of Textual Attribute Values in E-commerce Catalog by Learning with Limited Labeled Data

Product catalogs are valuable resources for eCommerce website. In the ca...

Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

In this paper, we present an approach to incorporate retrieved datapoint...

Complementing Representation Deficiency in Few-shot Image Classification: A Meta-Learning Approach

Few-shot learning is a challenging problem that has attracted more and m...

Meta Reinforcement Learning with Latent Variable Gaussian Processes

Data efficiency, i.e., learning from small data sets, is critical in man...

Empirical Bayes Transductive Meta-Learning with Synthetic Gradients

We propose a meta-learning approach that learns from multiple tasks in a...

Meta-learning Feature Representations for Adaptive Gaussian Processes via Implicit Differentiation

We propose Adaptive Deep Kernel Fitting (ADKF), a general framework for ...

1 Introduction

Learning to learn, or meta-learning (Schmidhuber, 1992; Thrun & Pratt, 2012), offers a promising tool for few-shot learning (Andrychowicz et al., 2016; Ravi & Larochelle, 2017; Finn et al., 2017) and has recently generated increasing popularity in machine learning. The crux of meta-learning for few-shot learning is to extract prior knowledge from related tasks to enable fast adaptation to a new task with a limited amount of data. Generally speaking, existing meta-learning algorithms (Ravi & Larochelle, 2017; Bertinetto et al., 2019) design the meta-learner to extract meta-knowledge that improves the performance of the base-learner on individual tasks. Meta knowledge, like a good parameter initialization (Finn et al., 2017), or an efficient optimization update rule shared across tasks (Andrychowicz et al., 2016; Ravi & Larochelle, 2017) has been extensively explored in general learning framework, but how to define and use in few-shot learning remains an open question.

Figure 1: The learning framework of our meta variational random features (MetaVRF). The meta-learner employs an LSTM-based inference network to infer the spectral distribution over , the kernel from the support set of the current task , and the outputs and of the previous task. During the learning process, the cell state in the LSTM is deployed to accumulate the shared knowledge through experiencing a set of prior tasks. The remember and forget gates in the LSTM episodically refine the cell state by absorbing information from each experienced task. For each individual task, the task-specific information extracted from its support set is combined with distilled information from the previous tasks to infer the adaptive spectral distribution of the kernels.

An effective base-learner should be powerful enough to solve individual tasks and able to absorb information provided by the meta-learner to improve its own performance. While potentially strong base-learners, kernels (Hofmann et al., 2008) have not yet been studied in the meta-learning scenario for few-shot learning. Learning adaptive kernels (Bach et al., 2004) in a data-driven way via random features (Rahimi & Recht, 2007) has demonstrated great success in regular learning tasks and remains of broad interest in machine learning (Sinha & Duchi, 2016; Hensman et al., 2017; Carratino et al., 2018; Bullins et al., 2018; Li et al., 2019). However, due to the limited availability of data, it is challenging for few-shot learning to establish informative and discriminant kernels. We thus explore the relatedness among distinctive but relevant tasks to generate rich random features to build strong kernels for base-learners, while still maintaining their ability to adapt quickly to individual tasks.

In this paper, we make three important contributions. First, we propose meta variational random features (MetaVRF), integrating, for the first time, kernel learning with random features and variational inference into the meta-learning framework for few-shot learning. We develop MetaVRF in a latent variable model by treating the random Fourier basis of translation-invariant kernels as the latent variable. Second, we formulate the optimization of MetaVRF as a variational inference problem by deriving a new evidence lower bound (ELBO) in the meta-learning setting, where the posterior over the random feature basis corresponds to the spectral distribution associated with the kernel. This formulation under probabilistic modeling provides a principled way of learning data-driven kernels with random Fourier features and more importantly, fits well in the meta-learning framework for few-shot learning allowing us to flexibly customize the variational posterior to leverage the meta knowledge for inference. As the third contribution, we propose a context inference which puts the inference of random feature bases of the current task into the context of all previous, related tasks. The context inference provides a generalized way to integrate context information of related tasks with task-specific information for the inference of random feature bases. To establish the context inference, we introduce a recurrent LSTM architecture (Hochreiter & Schmidhuber, 1997), leveraging its innate capability of learning long-term dependencies, which can be adopted to explore shared meta-knowledge from a large set of previous tasks. The LSTM-based inference connects knowledge from previous tasks to the current task, gradually collecting and refreshing the knowledge across the course of learning. The learning process with an LSTM-based inference network is illustrated in Figure 1. Once learning ceases, the ultimate LSTM state gains meta-knowledge from related experienced tasks, which enables fast adaptation to new tasks.

We demonstrate the effectiveness of the proposed MetaVRF by extensive experiments on a variety of few-shot regression and classification tasks. Results show that our MetaVRF achieves better, or at least competitive, performance compared to previous methods. Moreover, we conduct further analysis on MetaVRF to demonstrate its ability to be integrated with deeper architectures and its efficiency with relatively low sampling rates. We also apply MetaVRF to versatile and challenging settings with inconsistent training and test conditions, and it can still deliver promising results, which further demonstrates its strong learning ability.

2 Method

We first describe the base-learner based on the kernel ridge regression in meta-learning for few-shot learning, and then introduce kernel learning with random features, based on which our meta variational random features are developed.

2.1 Meta-Learning with Kernels

We adopt the episodic training strategy commonly used for few-shot classification in meta-learning (Ravi & Larochelle, 2017), which involves meta-training and meta-testing stages. In the meta-training stage, a meta-learner is trained to enhance the performance of a base-learner on a meta-training set with a batch of few-shot learning tasks, where a task is usually referred as an episode (Ravi & Larochelle, 2017). In the meta-test stage, the base-learner is evaluated on a meta-testing set with different classes of data samples from the meta-training set.

For the few-shot classification problem, we sample -way -shot classification tasks from the meta-training set, where is the number of labelled examples for each of the classes. Given the -th task with a support set and query set (), we learn the parameters of the predictor using a standard learning algorithm with kernel trick , where . Here, is the base-learner and is a mapping function from to a dot product space . The similarity measure is usually called a kernel (Hofmann et al., 2008).

As in traditional supervised learning problems, the base-learner for the

-th single task can use a predefined kernel, e.g., radius base function, to map the input into a dot product space for efficient learning. Once the base-learner is obtained on the support set, its performance is evaluated on the query set by the following loss function:


where can be any differentiable function, e.g., cross-entropy loss. In the meta-learning setting for few-shot learning, we usually consider a batch of tasks. Thus, the meta-learner is trained by optimizing the following objective function w.r.t. the empirical loss on tasks


where is the feature mapping function which can be obtained by learning a task-specific kernel for each task with data-driven random Fourier features.

In this work, we employ kernel ridge regression (KRR), which has an efficient closed-form solution, as the base-learner for few-shot learning. The kernel value in the Gram matrix can be computed as , where “” is the transpose operation. The base-learner for a single task can be obtained by solving the following objective w.r.t. the support set of this task,


which admits a closed-form solution


The learned predictor is then applied to samples in the query set :


Here, , with each element as between the samples from the support and query sets. Note that we also treat in (3) as a trainable parameter by leveraging the meta-learning setting, and all these parameters are learned by the meta-learner.

Rather than using pre-defined kernels, we consider learning adaptive kernels with random Fourier features in a data-driven way. Moreover, we leverage the shared knowledge by exploring dependencies among related tasks to learn rich features for building up informative kernels.

2.2 Random Fourier Features

Random Fourier features (RFFs) were proposed to construct approximate translation-invariant kernels using explicit feature maps (Rahimi & Recht, 2007), based on Bochner’s theorem (Rudin, 1962).

Theorem 1 (Bochner’s theorem)

(Rudin, 1962) A continuous, real valued, symmetric and shift-invariant function on

is a positive definite kernel if and only if it is the Fourier transform

of a positive finite measure such that


It is guaranteed that

is an unbiased estimation of

with sufficient RFF bases drawn from  (Rahimi & Recht, 2007).

For a predefined kernel, e.g., radius basis function (RBF), we use Monte Carlo sampling to draw bases from the spectral distribution, which gives rise to the explicit feature map:


where are the random bases sampled from , and are

biases sampled from a uniform distribution with a range of

. Finally, the kernel values in are computed as the dot product of their random feature maps with the same bases.

3 Meta Variational Random Features

We introduce our MetaVRF using a latent variable model in which we treat random Fourier bases as latent variables inferred from data. Learning kernels with random Fourier features is tantamount to finding the posterior distribution over random bases in a data-driven way. It is naturally cast into a variational inference problem, where the optimization objective is derived from an evidence lower bound (ELBO) under the meta-learning framework.

3.1 Evidence Lower Bound

From a probabilistic perspective, under the meta-learning setting for few-shot learning, the random feature basis can be obtained by maximizing the conditional predictive log-likelihood of samples from the query set .


We adopt a conditional prior distribution over the base as in the conditional variational auto-encoder (CVAE) (Sohn et al., 2015) rather than an uninformative prior (Kingma & Welling, 2013; Rezende et al., 2014). By depending on the input , we infer the bases that can specifically represent the data, while leveraging the context of the current task by conditioning on the support set .

In order to infer the posterior over , which is generally intractable, we resort to using a variational distribution to approximate it, where the base is conditioned on the support set by leveraging meta-learning. We can obtain the variational distribution by minimizing the Kullback-Leibler (KL) divergence


By applying the Bayes’ rule to the posterior , we can derive the ELBO as


The first term of the ELBO is the predictive log-likelihood conditioned on the observation , and the inferred RFF bases . Maximizing it enables us to make an accurate prediction for the query set by utilizing the inferred bases from the support set. The second term in the ELBO minimizes the discrepancy between the meta variational distribution and the meta prior , which encourages samples from the support and query sets to share the same random Fourier bases. The full derivation of the ELBO is provided in the supplementary material.

We now obtain the objective by maximizing the ELBO with respect to a batch of tasks:


where is the support set of the -th task associated with its specific bases and is the sample from the query set of the -th task. Directly optimizing the above objective does not take into account the task dependency. Thus, we introduce context inference by conditioning the posterior on both the support set of the current task and the shared knowledge extracted from previous tasks.

Figure 2: Illustration of MetaVRF in a directed graphical model, where is a sample in the query set . The base of the -th task is dependent on the support set of the current task and the context of related tasks.

3.2 Context Inference

We propose a context inference which puts the inference of random feature bases for the current task in the context of related tasks. We replace the variational distribution in (10) with a conditional distribution that makes the bases of the current -th task conditioned also on the context of related tasks.

The context inference gives rise to a new ELBO, as follows:


which can be represented in a directed graphical model as shown in Figure 2. In a practical sense, the KL term in (13) encourages the model to extract useful information from previous tasks for inferring the spectral distribution associated with each individual sample of the query set in the current task.

The context inference integrates the knowledge shared across tasks with the task-specific knowledge to build up adaptive kernels for individual tasks. The inferred random features are highly informative due to the absorbed information from prior knowledge of experienced tasks. The base-learner built on the inferred kernel with the informative random features can effectively solve the current task.

However, since there is usually a huge number of related tasks, it is non-trivial to model all these tasks simultaneously. We consider using recurrent neural networks to gradually accumulate information episodically along with the learning process by organizing tasks in a sequence. We propose an LSTM-based inference network by leveraging its innate capability of remembering long-term information 

(Gers & Schmidhuber, 2000). The LSTM offers a well-suited structure to implement the context inference. The cell state can store and accrue the meta knowledge shared among related tasks, which can also be updated when experiencing a new task in each episode during the course of learning; the output is used to adapt to each specific task.

To be more specific, we model the variational posterior through

which is parameterized as a multi-layer perceptron (MLP)

. Note that is the output from an LSTM that takes and as inputs. We implement the inference network with both vanilla and bidirectional LSTMs (Schuster & Paliwal, 1997; Graves & Schmidhuber, 2005). For a vanilla LSTM, we have


where is a vanilla LSTM network that takes the current support set, the output and the cell state as the input.

is the average over the feature representation vectors of samples in the support set 

(Zaheer et al., 2017). The feature representation is obtained by a shared convolutional network . To incorporate more context information, we also implement the inference with a bidirectional LSTM, and we have , where and are the outputs from forward and backward LSTMs, respectively, and indicates a concatenation operation.

Therefore, the optimization objective with the context inference is:


where the variational approximate posterior

is taken as a multivariate Gaussian with a diagonal co-variance. Given the support set as input, the mean

and standard deviation

are output from the inference network . The conditional prior is implemented with a prior network which takes an aggregated representation by using the cross attention (Kim et al., 2019) between and . The details of the prior network are provided in the supplementary material. To enable back-propagation with the sampling operation during training, we adopt the reparametrization trick (Rezende et al., 2014; Kingma & Welling, 2013) as , where

During the course of learning, the LSTMs accumulate knowledge in the cell state by updating their cells using information extracted from each task. For the current task , the knowledge stored in the cell is combined with the task-specific information from the support set to infer the spectral distribution for this task. To accrue the information across all the tasks in the meta-training set, the output and the cell state of LSTMs are passed down across batches. As a result, the finial cell state contains the distilled prior knowledge from all those experienced tasks in the meta-training set.

Fast Adaptation. Once meta-training ceases, the output and the cell state are directly used for a new incoming task in the meta-test set to achieve fast adaptation with a simple feed-forward computation operation. To be more specific, for a task with the support set in the meta-test set, we draw samples as the bases: , where

is output from either a vanilla LSTM or a bidirectional LSTM, depending on which is used during the meta-training stage. The bases are adopted to compute the kernels on the support set and construct the classifier of the base-learner for the task, using (

4). The classifier is then used to make predictions of samples in the query set for performance evaluation.

4 Related Work

Meta-learning, or learning to learn, endues machine learning models the ability to improve their performance by leveraging knowledge extracted from a number of prior tasks. It has received increasing research interest with breakthroughs in many directions (Finn et al., 2017; Rusu et al., 2019; Gordon et al., 2019; Aravind Rajeswaran, 2019). Gradient-based methods (e.g., MAML (Finn et al., 2017)) learn an appropriate initialization of model parameters and adapt it to new tasks with only a few gradient steps (Finn & Levine, 2018; Zintgraf et al., 2019; Rusu et al., 2019). Learning a shared optimization algorithm has also been explored in order to quickly learn of new tasks (Ravi & Larochelle, 2017; Andrychowicz et al., 2016; Chen et al., 2017).

Metric learning has been widely studied with great success for few-shot learning (Vinyals et al., 2016; Snell et al., 2017; Satorras & Estrach, 2018; Oreshkin et al., 2018; Allen et al., 2019). The basic assumption is that a common metric space is shared across related tasks. Snell et al. (Snell et al., 2017) constructed a prototype for each class by averaging feature representations of samples from the class in the metric space. The query images are matched to prototypes by computing their distances to conduct classification. To enhance the prototype representation, Allen et al. (Allen et al., 2019) proposed an infinite mixture of prototypes (IMP) to adaptively represent data distribution of each class instead of using a single vector. In addition, Oreshkin et al. (Oreshkin et al., 2018) proposed task dependent adaptive metric for improved few-shot learning. They established prototypes of classes conditioning on a task representation encoded by a task embedding network.

While these meta-learning algorithms have made great progress in few-shot learning tasks, exploring prior knowledge from previous tasks remains an open challenge (Titsias et al., 2019). In this work, we introduce kernels based on random features as the base-learners, which enables us to acquire shared knowledge across tasks by modeling their dependency via the random feature basis of kernels.

Kernel learning with random Fourier features is a versatile and powerful tool in machine learning (Bishop, 2006; Hofmann et al., 2008; Shervashidze et al., 2011). Pioneering works (Bach et al., 2004; Gönen & Alpaydın, 2011; Duvenaud et al., 2013) learn to combine predefined kernels in a multi-kernel learning manner. Kernel approximation by random Fourier features (RFFs) (Rahimi & Recht, 2007) is an effective technique for efficient kernel learning (Gärtner et al., 2002), which has recently become increasingly popular (Sinha & Duchi, 2016; Carratino et al., 2018). Recent works (Wilson & Adams, 2013)

learn kernels in the frequency domain by modeling the spectral distribution as a mixture of Gaussians and computing its optimal linear combination. Instead of modeling the spectral distribution with explicit density functions, other works focus on optimizing the random base sampling strategy 

(Yang et al., 2015; Sinha & Duchi, 2016). Nonetheless, it has been shown that accurate approximation of kernels does not necessarily result in high classification performance (Avron et al., 2016; Chang et al., 2017). This suggests that learning adaptive kernels with random features by data-driven sampling strategies (Sinha & Duchi, 2016) can improve the performance, even with a low sampling rate compared to using universal random features (Avron et al., 2016; Chang et al., 2017).

Our MetaVRF is the first work to introduce kernel learning with random features to the meta-learning framework for few-shot learning. The optimization of MetaVRF is naturally cast as a variational inference and the context inference offers a principled way to incorporate prior knowledge and achieve informative and adaptive kernels.

5 Experiments

We evaluate our MetaVRF on several few-shot learning problems for both regression and classification. We demonstrate the benefit of exploring task dependency by implementing a baseline MetaVRF (12) without using the LSTM, which infers the random base solely from the support set. We also conduct further analysis to validate the effectiveness of our MetaVRF by showing its performance with deep embedding architectures, different numbers of bases, and under versatile and challenging settings with inconsistent training and test conditions.

5.1 Few-Shot Regression

We conduct regression tasks with different numbers of shots , and compare our MetaVRF with MAML (Finn et al., 2017), a representative meta-learning algorithm. We follow the MAML work (Finn et al., 2017) to fit a target sine function , with only a few annotated samples. , , and denote the amplitude, frequency, and phase, respectively, which follow a uniform distribution within the corresponding interval. The goal is to estimate the target sine function given only randomly sampled data points. In our experiments, we consider the input in the range of , and conduct three tests under the conditions of . For a fair comparison, we compute the feature embedding using a small multi-layer perception (MLP) with two hidden layers of size , following the same settings used in MAML.

Figure 3: Performance (MSE) comparison for few-shot regression. Our MetaVRF fits the target function well, even with only three shots, and consistently outperforms regular RFFs and the counterpart MAML. ( MetaVRF with bi-lstm;  MetaVRF with lstm;  MetaVRF w/o lstm;  MAML;  Ground Truth;  Support Samples.)
miniImageNet, 5-way cifar-fs, 5-way
Method 1-shot 5-shot 1-shot 5-shot
Matching net (Vinyals et al., 2016) 44.2 57
MAML (Finn et al., 2017) 48.71.8 63.10.9 58.91.9 71.51.0
MAML (C) 46.71.7 61.10.1 58.91.8 71.51.1
Meta-LSTM (Ravi & Larochelle, 2017) 43.40.8 60.60.7
Proto net (Snell et al., 2017) 47.40.6 65.40.5 55.50.7 72.00.6
Relation net (Sung et al., 2018) 50.40.8 65.30.7 55.01.0 69.30.8
SNAIL (32C) by (Bertinetto et al., 2019) 45.1 55.2
GNN (Garcia & Bruna, 2018) 50.3 66.4 61.9 75.3
PLATIPUS (Finn et al., 2018) 50.11.9
VERSA (Gordon et al., 2019) 53.31.8 67.30.9 62.51.7 75.10.9
R2-D2 (C) (Bertinetto et al., 2019) 49.50.2 65.40.2 62.30.2 77.40.2
R2-D2 (Devos et al., 2019) 51.71.8 63.30.9 60.21.8 70.90.9
CAVIA (Zintgraf et al., 2019) 51.80.7 65.60.6
iMAML (Aravind Rajeswaran, 2019) 49.31.9
RFFs (2048d) 52.80.9 65.40.9 61.10.8 74.70.9
MetaVRF (w/o lstm, 780d) 51.30.8 66.10.7 61.10.7 74.3 0.9
MetaVRF (vanilla lstm, 780d) 53.10.9 66.80.7 62.10.8 76.00.8
MetaVRF (bi-lstm, 780d) 54.20.8 67.80.7 63.10.7 76.50.9
Table 1: Performance () on miniImageNet and cifar-fs.
Omniglot, 5-way Omniglot, 20-way
Method 1-shot 5-shot 1-shot 5-shot
Siamese net (Koch, 2015) 96.7 98.4 88 96.5
Matching net (Vinyals et al., 2016) 98.1 98.9 93.8 98.5
MAML (Finn et al., 2017) 98.7 99.90.1 95.80.3 98.90.2
Proto net (Snell et al., 2017) 98.50.2 99.50.1 95.30.2 98.70.1
SNAIL (Mishra et al., 2018) 99.10.2 99.8 0.1 97.6 0.3 99.4 0.2
GNN (Garcia & Bruna, 2018) 99.2 99.7 97.4 99.0
VERSA (Gordon et al., 2019) 99.70.2 99.80.1 97.70.3 98.80.2
R2-D2 (Bertinetto et al., 2019) 98.6 99.7 94.7 98.9
IMP (Allen et al., 2019) 98.40.3 99.50.1 95.00.1 98.60.1
RFFs (2048d) 99.50.2 99.50.2 97.20.3 98.30.2
MetaVRF (w/o lstm, 780d) 99.60.2 99.60.2 97.00.3 98.40.2
MetaVRF (vanilla lstm, 780d) 99.70.2 99.80.1 97.50.3 99.00.2
MetaVRF (bi-lstm, 780d) 99.80.1 99.90.1 97.80.3 99.20.2
Table 2: Performance () on Omniglot.

The results in Figure 3 show that our MetaVRF fits the function well with only three shots but performs better with an increasing number of shots, almost entirely fitting the target function with ten shots. Moreover, the results demonstrate the advantage of exploring task dependency by LSTM-based inference. MetaVRF with bi-lstm performs better than regular LSTM since more context tasks are incorporated by bi-lstm. In addition, we observe that MetaVRF performs better than MAML for all three settings with varying numbers of shots. We provide more results on few-shot regression tasks in the supplementary material.

5.2 Few-Shot Classification

The classification experiments are conducted on three commonly-used benchmark datasets, i.e., Omniglot (Lake et al., 2015), miniImageNet (Vinyals et al., 2016) and CIFAR-FS (Krizhevsky et al., 2009)

; for more details, please refer to the supplementary material. We extract image features using a shallow convolutional neural network with the same architecture as in 

(Gordon et al., 2019). We do not use any fully connected layers for these CNNs. The dimension of all feature vectors is . We also evaluate the baseline method, random Fourier features (RFFs), to approximate the common Gaussian kernel. The inference network is a three-layer MLP with units in the hidden layers and rectifier non-linearity where input sizes are and for the vanilla and bidirectional LSTMs, respectively.

The key hyperparameter for the number of bases

in (7) is set to for MetaVRF in all experiments, while we use RFFs with as this produces the best performance. The sampling rate in our MetaVRF is much lower than in previous works using RFFs, in which is usually set to be to times the dimension of the input features (Yu et al., 2016; Rahimi & Recht, 2007). We adopt a similar meta-testing protocol as (Gordon et al., 2019; Finn et al., 2017), but we test on episodes rather than and present the results with confidence intervals. All reported results are produced by models trained from scratch. We compare with previous methods that use the same training procedures and similar shallow conventional CNN architectures as ours. The comparison results on three benchmark datasets are reported in Tables 1 and 2.

On all benchmark datasets, MetaVRF delivers the state-of-the-art performance. Even with a relatively low sampling rate, MetaVRF produces consistently better performance compared with the baseline RFFs. MetaVRF with bi-lstm outperforms the one with vanilla lstm since it can leverage more information. It is worth mentioning that MetaVRF with bi-lstm achieves good performance () under the -way -shot setting on the miniImageNet dataset, surpassing the second best model by . The MetaVRFs with bi-lstm and vanilla lstm consistently outperform the one without the lstm, which demonstrates the effectiveness of using lstm to explore task dependency. Note that on Omniglot, the performance of existing methods saturates and MetaVRF with bi-lstm achieves the best performance for most settings, including -way -shot, -way -shot, and -way -shot. It is also competitive under the -way -shot setting falling within the error bars of the state-of-the-arts. Note that, in Table 1, we also implement a MAML () with channels in each convolutional layer. However, while it obtains modest performance, we believe the increased model size leads to overfitting. Since in the original SNAIL, a very deep ResNet-12 network is used for embedding, we cite the result of SNAIL reported in Bertinetto et al. (2019) using similar shallow networks as ours. We cite the original results of R2-D2 (Bertinetto et al., 2019) using channels for fair comparison.

5.3 Further Analysis

Deep embedding.

Our MetaVRF is independent of the convolutional architectures for feature extraction and can work with deeper embeddings either pre-trained or trained from scratch. In general, the performance improves with more powerful feature extraction architectures. We evaluate our method using pre-trained embeddings in order to compare with existing methods using deep embedding architectures. To benchmark with those methods, we adopt the pre-trained embeddings from a 28-layer wide residual network (WRN-28-10)

(Zagoruyko & Komodakis, 2016), in a similar fashion to (Rusu et al., 2019; Bauer et al., 2017; Qiao et al., 2018). We choose activations in the 21-st layer, with average pooling over spatial dimensions, as feature embeddings. The dimension of pre-trained embeddings is . We show the comparison results on the miniImageNet dataset for 5-way 1-shot and 5-shot settings in Table. 3. Our MetaVRF with bi-lstm achieves the best performance under both settings and largely surpasses LEO, a recently proposed meta-learning method, especially on the challenging 5-way 1-shot setting. Note that the MetaVRF with vanilla lstm and without lstm also produce competitive performance.

Method 1-shot 5-shot
Meta-SGD (Li et al., 2017) 54.240.03 70.860.04
(Gidaris & Komodakis, 2018) 56.200.86 73.000.64
(Bauer et al., 2017) 56.300.40 73.900.30
(Munkhdalai et al., 2017) 57.100.70 70.040.63
(Qiao et al., 2018) 59.600.41 73.540.19
LEO (Rusu et al., 2019) 61.760.08 77.590.12
SNAIL (Mishra et al., 2018) 55.710.99 68.880.92
TADAM (Oreshkin et al., 2018) 58.500.30 76.700.30
MetaVRF (w/o lstm, 780d) 62.120.07 77.050.28
MetaVRF (vanilla lstm, 780d) 63.210.06 77.830.28
MetaVRF (bi-lstm, 780d) 63.800.05 77.970.28
Table 3: Performance () on miniImageNet (5-way)

Efficiency. Regular random Fourier features (RFFs) usually require high sampling rates to achieve satisfactory performance. However, our MetaVRF can achieve high performance with a relatively low sampling rate compared, which guarantees its high efficiency. In Figure 4, we compare with regular RFFs using different sampling rates. We show the performance change of fully trained models using RFFs and our MetaVRF with bi-lstm under a different number of bases. We show the comparison results for the -way -shot setting in Figure 4. MetaVRF with bi-lstm consistently yields higher performance than regular RFFs with the same number of sampled bases. The results verify the efficiency of our MetaVRF in learning adaptive kernels and the effectiveness in improving performance by exploring dependencies of related tasks.

Figure 4: Performance with different numbers of bases. Our MetaVRF consistently achieves better performance than regular RFFs, especially with relatively low sampling rates.
Figure 5: Performance with varied ways and shots on Omniglot.

Versatility. In contrast to most existing meta-learning methods, our MetaVRF can be used for versatile settings. We evaluate the performance of MetaVRF on more challenging scenarios where the number of ways and shots between training and testing are inconsistent. Specifically, we test the performance of MetaVRF on tasks with varied and , when it is trained on one particular -way--shot task. As shown in Figure 5, the results demonstrate that the trained model can still produce good performance, even on the challenging condition with a far higher number of ways. In particular, the model trained on the -way--shot task can retain a high accuracy of on the -way setting, as shown in Figure 5(a). The results also indicate that our MetaVRF exhibits considerable robustness and flexibility to a great variety of testing conditions.

6 Conclusion

In this paper, we introduce kernel approximation based on random Fourier features into the meta-learning framework for few-shot learning. We propose meta variational random features (MetaVRF), which leverage variational inference and meta-learning to infer the spectral distribution of random Fourier features in a data-driven way. MetaVRF generates random Fourier features of high representational power with a relatively low spectral sampling rate by using an LSTM based inference network to explore the shared knowledge. In practice, our LSTM-based inference network demonstrates a great ability to quickly adapt to specific tasks for improved performance. Experimental results on both regression and classification tasks demonstrate the effectiveness for few-shot learning.


This research was supported in part by Natural Science Foundation of China (No. 61976060, 61871016, 61876098).


  • Allen et al. (2019) Allen, K. R., Shelhamer, E., Shin, H., and Tenenbaum, J. B. Infinite mixture prototypes for few-shot learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 232–241, 2019.
  • Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.
  • Aravind Rajeswaran (2019) Aravind Rajeswaran, Chelsea Finn, S. K. S. L. Meta-learning with implicit gradients. arXiv preprint arXiv:1909.04630, 2019.
  • Avron et al. (2016) Avron, H., Sindhwani, V., Yang, J., and Mahoney, M. W. Quasi-monte carlo feature maps for shift-invariant kernels. The Journal of Machine Learning Research, 17(1):4096–4133, 2016.
  • Bach et al. (2004) Bach, F. R., Lanckriet, G. R., and Jordan, M. I. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the twenty-first international conference on Machine learning, pp.  6, 2004.
  • Bauer et al. (2017) Bauer, M., Rojas-Carulla, M., Świątkowski, J. B., Schölkopf, B., and Turner, R. E. Discriminative k-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
  • Bertinetto et al. (2019) Bertinetto, L., Henriques, J. F., Torr, P. H., and Vedaldi, A. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, 2019.
  • Bishop (2006) Bishop, C. M. Pattern recognition and machine learning. springer, 2006.
  • Bullins et al. (2018) Bullins, B., Zhang, C., and Zhang, Y. Not-so-random features. In International Conference on Learning Representations, 2018.
  • Carratino et al. (2018) Carratino, L., Rudi, A., and Rosasco, L. Learning with sgd and random features. In Advances in Neural Information Processing Systems, pp. 10192–10203, 2018.
  • Chang et al. (2017) Chang, W.-C., Li, C.-L., Yang, Y., and Poczos, B. Data-driven random fourier features using stein effect. arXiv preprint arXiv:1705.08525, 2017.
  • Chen et al. (2017) Chen, Y., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Lillicrap, T. P., Botvinick, M., and De Freitas, N. Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 748–756. JMLR. org, 2017.
  • Devos et al. (2019) Devos, A., Chatel, S., and Grossglauser, M. Reproducing meta-learning with differentiable closed-form solvers. In ICLR Workshop, 2019.
  • Duvenaud et al. (2013) Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B., and Ghahramani, Z. Structure discovery in nonparametric regression through compositional kernel search. arXiv preprint arXiv:1302.4922, 2013.
  • Finn & Levine (2018) Finn, C. and Levine, S. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. In International Conference on Learning Representations, 2018.
  • Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. JMLR. org, 2017.
  • Finn et al. (2018) Finn, C., Xu, K., and Levine, S. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 9516–9527, 2018.
  • Garcia & Bruna (2018) Garcia, V. and Bruna, J. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.
  • Gärtner et al. (2002) Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. J. Multi-instance kernels. In International Conference on Machine Learning, 2002.
  • Gers & Schmidhuber (2000) Gers, F. A. and Schmidhuber, J. Recurrent nets that time and count. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, volume 3, pp. 189–194. IEEE, 2000.
  • Gidaris & Komodakis (2018) Gidaris, S. and Komodakis, N. Dynamic few-shot visual learning without forgetting. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 4367–4375, 2018.
  • Gönen & Alpaydın (2011) Gönen, M. and Alpaydın, E. Multiple kernel learning algorithms. Journal of machine learning research, 12(Jul):2211–2268, 2011.
  • Gordon et al. (2019) Gordon, J., Bronskill, J., Bauer, M., Nowozin, S., and Turner, R. E. Meta-learning probabilistic inference for prediction. In International Conference on Learning Representations, 2019.
  • Graves & Schmidhuber (2005) Graves, A. and Schmidhuber, J. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6):602–610, 2005.
  • Hensman et al. (2017) Hensman, J., Durrande, N., and Solin, A. Variational fourier features for gaussian processes. The Journal of Machine Learning Research, 18(1):5537–5588, 2017.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hofmann et al. (2008) Hofmann, T., Schölkopf, B., and Smola, A. J. Kernel methods in machine learning. The annals of statistics, pp. 1171–1220, 2008.
  • Kim et al. (2019) Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. In International Conference on Learning Representations, 2019.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Koch (2015) Koch, G. Siamese neural networks for one-shot image recognition. In ICML Workshop, 2015.
  • Krizhevsky et al. (2009) Krizhevsky, A. et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Li et al. (2019) Li, C.-L., Chang, W.-C., Mroueh, Y., Yang, Y., and Poczos, B. Implicit kernel learning. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    , pp. 2007–2016, 2019.
  • Li et al. (2017) Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
  • Mishra et al. (2018) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
  • Munkhdalai et al. (2017) Munkhdalai, T., Yuan, X., Mehri, S., and Trischler, A. Rapid adaptation with conditionally shifted neurons. arXiv preprint arXiv:1712.09926, 2017.
  • Oreshkin et al. (2018) Oreshkin, B., López, P. R., and Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731, 2018.
  • Qiao et al. (2018) Qiao, S., Liu, C., Shen, W., and Yuille, A. L. Few-shot image recognition by predicting parameters from activations. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238, 2018.
  • Rahimi & Recht (2007) Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177–1184, 2007.
  • Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • Rudin (1962) Rudin, W. Fourier analysis on groups, volume 121967. Wiley Online Library, 1962.
  • Rusu et al. (2019) Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.
  • Satorras & Estrach (2018) Satorras, V. G. and Estrach, J. B. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.
  • Schmidhuber (1992) Schmidhuber, J. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
  • Schuster & Paliwal (1997) Schuster, M. and Paliwal, K. K. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
  • Shervashidze et al. (2011) Shervashidze, N., Schweitzer, P., Leeuwen, E. J. v., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
  • Sinha & Duchi (2016) Sinha, A. and Duchi, J. C. Learning kernels with random features. In Advances in Neural Information Processing Systems, pp. 1298–1306, 2016.
  • Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087, 2017.
  • Sohn et al. (2015) Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491, 2015.
  • Sung et al. (2018) Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208, 2018.
  • Thrun & Pratt (2012) Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.
  • Titsias et al. (2019) Titsias, M. K., Schwarz, J., Matthews, A. G. d. G., Pascanu, R., and Teh, Y. W. Functional regularisation for continual learning using gaussian processes. arXiv preprint arXiv:1901.11356, 2019.
  • Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638, 2016.
  • Wilson & Adams (2013) Wilson, A. and Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning, pp. 1067–1075, 2013.
  • Yang et al. (2015) Yang, Z., Wilson, A., Smola, A., and Song, L. A la carte–learning fast kernels. In Artificial Intelligence and Statistics, pp. 1098–1106, 2015.
  • Yu et al. (2016) Yu, F. X. X., Suresh, A. T., Choromanski, K. M., Holtmann-Rice, D. N., and Kumar, S. Orthogonal random features. In Advances in Neural Information Processing Systems, pp. 1975–1983, 2016.
  • Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401, 2017.
  • Zintgraf et al. (2019) Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., and Whiteson, S. Fast context adaptation via meta-learning. In International Conference on Machine Learning, pp. 7693–7702, 2019.