A Hybrid Approach with Optimization and Metric-based Meta-Learner for Few-Shot Learning

04/04/2019 ∙ by Duo Wang, et al. ∙ ibm Microsoft Tsinghua University 0

Few-shot learning aims to learn classifiers for new classes with only a few training examples per class. Most existing few-shot learning approaches belong to either metric-based meta-learning or optimization-based meta-learning category, both of which have achieved successes in the simplified "k-shot N-way" image classification settings. Specifically, the optimization-based approaches train a meta-learner to predict the parameters of the task-specific classifiers. The task-specific classifiers are required to be homogeneous-structured to ease the parameter prediction, so the meta-learning approaches could only handle few-shot learning problems where the tasks share a uniform number of classes. The metric-based approaches learn one task-invariant metric for all the tasks. Even though the metric-learning approaches allow different numbers of classes, they require the tasks all coming from a similar domain such that there exists a uniform metric that could work across tasks. In this work, we propose a hybrid meta-learning model called Meta-Metric-Learner which combines the merits of both optimization- and metric-based approaches. Our meta-metric-learning approach consists of two components, a task-specific metric-based learner as a base model, and a meta-learner that learns and specifies the base model. Thus our model is able to handle flexible numbers of classes as well as generate more generalized metrics for classification across tasks. We test our approach in the standard "k-shot N-way" few-shot learning setting following previous works and a new realistic few-shot setting with flexible class numbers in both single-source form and multi-source forms. Experiments show that our approach can obtain superior performance in all settings.



There are no comments yet.


page 2

page 3

page 4

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Supervised deep learning methods have been widely used in visual classification tasks and achieved great success

[1, 2, 3, 4]. In practice, those methods usually require a large amount of labeled data for model training, in order to make the learned model generalize well. However, collecting sufficient amount of training data for each task needs a lot of human work and the process is time-consuming or infeasible for rare classes or classes that might be hard to observe.

To solve this issue, few-shot learning (FSL) [5] was proposed, which aims to learn classifiers for new classes with only a few training examples. Generally, two key ideas of few-shot learning are data aggregation and knowledge sharing. First, though each single learning task lacks sufficient annotated data, the union of all the tasks will provide a significantly large amount of labeled data for model training. Therefore the model on a new coming task could benefit from all the previous tasks. Secondly, the experiences of learning model parameters for various of tasks in the past will assist the learning process of the incoming new task. In recent years, deep learning techniques have been successfully exploited for FSL via learning meta-models from a large number of meta-training tasks. Relative methods have been proposed include: 1) learning metric/similarity from multiple few-shot learning tasks with deep networks (metric-based meta-learning) [6, 7]; and 2) learning a meta-model on multiple few-shot learning tasks, which could be then used to predict model weights given a new few-shot learning task (optimization-based meta-learning) [8, 9, 10].

The aforementioned deep few-shot learning models usually are applied to the so-called ”-shot, -way” scenario, in which each few-shot learning task has the same number of class labels and each label has training instances. However, such “-way” simplification is not realistic in real-world few-shot learning applications, because different tasks usually do not have the same number of classes. Existing optimization-based approaches build on the ”-way” simplification to let the meta-learner predict weights of homogeneous-structured task-specific networks. If we allow different tasks with different numbers of labels, the task-specific networks will be heterogeneous. Heterogeneous-structured task-specific networks complicate the weight prediction of the meta-learner. To the best of our knowledge, none of the existing optimization-based few-shot learning approaches could resolve this issue. Although the metric-based approaches could alleviate the variations on the number of class labels, they suffer from the limitation of model expressiveness: these methods usually learn a task-invariant metric for all the few-shot learning tasks. However, because of the variety of tasks, the optimal metric will also vary across tasks. The learned task-invariant metric may not generalize well if the tasks diverge.

Moreover, in real-world applications, the few-shot learning tasks usually come from different domains or different data sources. For example, for classification of hand-written letter images, we could have images from different languages or alphabets. In such few-shot learning scenario, the two aforementioned issues of existing few-shot learning approaches will become more serious: the numbers and meanings of class labels may vary a lot among different tasks, so it will be hard for a meta-learner to learn how to predict weights for heterogeneous neural networks given the few-shot labeled data; and different tasks are not guaranteed to be even closely related to each other, so there will unlikely exist a uniform metric suitable for all the tasks from different domains or data sources.

We propose a hybrid approach which takes the benefits of both metric-based and optimization-based approaches. The model consists of two main learning modules. The meta learner that operates across tasks uses an optimization-based meta-model to discover good parameters and gradient descent in task-specific base learners. The base learner exploits a metric-based model and parameterizes the task metrics using the weight prediction from the optimization-based meta-learner. Metric-based models are essentially non-parametric models so they are not sensitive to the number of classes. Thus the proposed model is now able to handle unbalanced classes in meta-train and meta-test sets as the usage of metric-based learners as well as to generate task-specific metrics leveraging the weight prediction of the meta-learner given task instances so that the metrics would adapt better to different tasks. Our model exploits metric-learning based model to perform classification but learns it via training an optimization-based meta-model, so we call it Meta-Metric-Learner. In this work, we propose two types of Meta-Metric-Learner, i.e. Meta Matching Network (MMN) and Meta Prototypical Network (MPN), which exploit the same model as the meta learner but two different metric-based models (Matching Network and Prototypical Network respectively) as the base learner.

We make the following contributions: 1) we design the meta-metric-learning method that is able to learn task-specific metrics via training a meta-learner; 2) we propose training methods towards the meta-metric-learner in both single-source and multi-source settings; 3) we evaluate our model on several benchmark datasets with various baselines. These contributions make our approach suit better to the few-shot learning problem. We test our approach in both classic “-shot -way” few-shot learning setting following previous work and a new but more realistic few-shot setting with flexible label numbers using single-source and multi-source training data. Experiments show that our approach attains superior performance on all of the settings.

This work is an extension of [11] in several ways. Firstly, we exploit different kinds of models in both of the two component learning modules in our approach. In specific, we use Meta-SGD as meta-learner to predict the parameters of metric-based learners. While, in [11], LSTM-based model is used. And both Matching Networks and Prototypical Networks are selected as metric-based learner in our experiment. Secondly, we conduct all of the experiments with image datasets other than text ones and the experiment settings are different. Besides, a more detailed presentation of our approach and discussion of the most recent related works are given in this paper.

The rest of the paper is organized as follows. In Section 2, related works and literature associated with few-shot learning are reviewed. Section 3 describes the proposed model in detail. A series of experiments are conducted in Section 4. Finally, the conclusion and future work are discussed in Section 5.

Ii Background

Few-shot learning [5, 12]

aims to learn classifiers for new classes with only a few training examples per class. Bayesian Program Induction 

[13], which can be seen as the most pioneering work in recent years, represents concepts as simple programs that best explain observed examples under a Bayesian criterion and reaches human level error on few-shot learning alphabet recognition tasks. This work is a successful instance of meta-learning, which has a long history [14, 15]. The key idea of meta-learning is framing the learning problem at two levels: the lower level is the quick acquisition of knowledge from each separate task presented and the higher level is accumulating knowledge to learn the similarities and differences across all tasks. Since then, great progress has been made in few-shot learning. Most existing few-shot learning approaches belong to either metric-based meta-learning or optimization-based meta-learning category. For metric-based works, [16]

exploits Neural Turing Machine (NTM) 

[17], a famous memory-augmented neural network, to few-shot learning problem and introduces a new attention-based memory accessing method to rapidly assimilate new data used for accurate predictions about new classes. Siamese neural networks rank similarity between inputs [6]. Matching Networks [7] introduce a trainable k-nearest neighbors algorithm to map a small labeled supporting set and an unlabeled example to its label, obviating the need for fine-tuning to adapt to new class types. Prototypical Networks [18] perform classification by computing Euclidean distances to prototype representations of each class. Generative Adversarial Residual Pairwise Networks [19] exploit deep residual modules in the pairwise network and regularize it with an adversarial training strategy.  [20] For optimization-based works, a recent approach [21] casts the hand-designed optimization algorithm as a learning problem, and trains an LSTM-based meta-learner to predict model parameters. The LSTM-based meta-learner is then applied to few-shot learning tasks [9] by training over a bunch of hand-designed few-shot learning tasks. Model-Agnostic Meta-Learning(MAML) [22] explicitly trains the initial parameters of CNN model such that a small number of gradient update steps with a small amount of training data from a new task will produce good generalization performance on that task. Meta-SGD [23] extends the idea from [22] by learning to learn not just the learner initialization, but also the learner update direction and learning rate. TAML [24] presents an entropy-based approach to avoid a biased meta-learner and improve its generalizability to new tasks. [25] and [26] extend MAML to probabilistic forms.

Besides, Deep Meta-Learning [27] proposes to perform few-shot learning on high-level representation space rather than instance space. [28] considers few-shot learning as supervised message passing task and generalizes several proposed models with graph-based neural networks. Some researchers try to alleviate the scarcity of training data by data augmentation. [29] proposes to augment instance semantic features using a novel auto-encoder network dual TriNet. [30] proposes to scale the distance metric with alearnable parameter. They also define a dynamic feature extractor with parameters predicted from a task representation and a task embedding network. [31] tries to learn a data-dependent latent generative representation of model parameters, and performing gradient-based meta-learning in this low-dimensional latent space to tackle the high-dimensional problem in optimization-based meta-learning methods.

Our idea is similar to that of [32]. Both of the works propose to generate task-specific metric that is more adapted to new few-shot learning tasks. However, [32] uses basic fine-tune to adjust the metric-based models, while our paper exploits a meta-learner, Meta-SGD, to calculate the parameters of the metric-learner. [32] is simple and easy to implement since they don’t use additional models to adapt the metric. Our method is more flexible and general. Meta-SGD can be seen as a trainable fine-tune, as the initial parameters and the learning rate are meta-learned, not set by hand. We may exploit other more sophisticated optimization-based meta-learner such as probabilistic MAML [25] [26], and LEO[31], which is one of our future extensions of this work. In this meaning, the work of few-shot learning in [32] can be seen as a special case of our paper.

Among the few-shot learning methods, Matching Network, Prototypical Networks, MAML, and Meta-SGD are closely related to our method. The remaining of this Section will provide more details on these few-shot learning approaches.

Ii-a Few-shot Learning Problem Definition

In the typical machine learning problem setting, a classification task

contains a supporting dataset(or training dataset) to optimize model parameters and a test dataset to evaluate model performance. For a -shot, -class classification task, the supporting dataset consists of labeled samples for each of classes, i.e. there are total samples in the supporting dataset, and the test dataset contains a number of samples of the same classes for evaluation. In the few-shot learning setting, is a very small value(we consider k less than 5 in this work), meaning each supporting dataset in classification tasks will contain few labeled examples.

Recently-proposed methods formulates few-shot learning as a meta-learning problem. Under this thought, we have different meta-sets for meta-training, meta-validation, and meta-testing(, and , respectively), each of which contains a certain number of few-shot learning tasks described above, all drawn from task distribution . With , we train a meta-learner to generate good task-specific metrics for few-shot learning tasks and evaluate its generalization performance on . is used to select good hyper-parameters. Note that in this work, tasks in different meta-sets may contain different numbers of classes following real few-shot learning scenarios.

Ii-B Matching Network(MN) and Prototypical Network(PN)

Matching Networks [7] consist of a neural network as embedding function and an augmented memory. The embedding function, , maps an input to a

-length vector, i.e.,

. The augmented memory stores a supporting set , where is supporting instance and is its corresponding one-hot label. The Matching Networks explicitly define a classifier conditioned on the supporting set. For any new data , the Matching Networks predict its label via a similarity function between the instance and the supporting set :


Specifically, we define the similarity function to be a softmax distribution given some kind of distance between the testing instance and the supporting instance , i.e., , where are the parameters of the embedding function and is distance function. Thus, is a valid distribution over the supporting set’s labels . Here

is parameterized as deep convolutional neural networks for image tasks and cosine distance is adopted as the distance function.

For the training of Matching Networks, we first sample a few-shot learning task with a supporting set and a test set from . The objective function to optimize the embedding parameters is to minimize the prediction error of the testing samples given the supporting set as follows:


The parameters of the embedding function,

, are optimized via stochastic gradient descent methods.

Prototypical Networks [18] can be seen as a variation of Matching Networks, which perform classification in a different way from Eq.(1):


Here is the mean embedded vector of the supporting samples belonging to class :


denotes the set of examples labeled with class in the given supporting set. is called prototype and can be considered as a representation of its belonging class. We choose Euclidean distance as the distance function in as it works better for image few-shot learning tasks[18].

Ii-C MAML and Meta-SGD

MAML [22] does not use an explicit learnable model to perform update of learner’s weights like [21] and [9]. It is just simply based on the gradient-descent method. The underlying key idea is to train the learner’s initial parameters such that the learner has maximal performance on a new few-shot learning task after the parameters have been updated through one or more gradient steps computed with a small amount of supporting samples from that new task. Assume that the learner can be represented by a parametrized function with initial parameters . Given a new task with a supporting set and a testing set , the initial parameters are updated to using one or more gradient descent steps calculated by to adapt to the new task. Take one update step as an example:


The initial parameters are trained so that the learners with updated parameters will have maximal performance across several new tasks sampled from . The meta-training objective is as follows:


Training is implemented using SGD as follows:


Meta-SGD [23] extends the idea of MAML for a little bit. They vectorize the step size in MAML with equal dimension to learner’s parameters and make it trainable as well. So Meta-SGD learns not only the initial parameters but also the update direction and the update rate. The meta-training objective is:


In the above content, we only consider one gradient step, but it is a straightforward extension to use multiple steps in experiments.

Figure 1: Illustration of the training procedure for our model. (, ) are samples of the supporting set(or training set) and (, ) are samples of the testing set in a few-shot learning task. (, ) and (, ) are samples of two subsets of respectively used for the forward pass. The blue rounded rectangles indicate feature extractors parameterized as deep convolutional neural networks(CNN) with parameters and is the output feature. The green rounded rectangles are metric-based classifiers which can be k-nearest neighbor classifiers or prototype-based classifiers in our model. is the prediction of the classifier. CE means cross-entropy loss. The red rectangles are Meta-SGD modules which take as input the gradient to update the model parameters. The black arrows indicate forward pass and the blue dash arrows indicate backward pass. We use (, ), (, ) and the Meta-SGD module to update the parameters of the feature extractor for steps and then evaluate on (, ) of . Meta-SGD module is trained to minimize the cross-entropy loss between and (should be viewed in color).

Iii Meta-Metric-Learner for Few-Shot Learning

In this section, we provide the details of our Meta-Metric-Learner and its training objective. We first describe the Meta-Metric-Learner in a single-source setting. After that, we show it is easy to generalize the model in a multi-source learning setting, which relates to retrieving auxiliary sets from other sources/tasks.


Following the setting in previous few-shot learning works, we construct three meta-datasets, i.e. , and . Each of the meta-dataset consists of a number of few-shot learning tasks. In these previous works, the few-shot learning tasks in the three meta-dataset all have the same number of classes. In our experiment setting, the number of classes in is the same as that in but could be different from that in , taking real scenarios into consideration. Although the CNN base learner used in [9], [22] and [23] is powerful to model image data, it lacks the ability to handle unbalanced classes in train and test datasets in a straightforward way. On the other hand, metric-learning-based models, as trainable non-parametric algorithms by nature, can generalize easily to new datasets, which contain samples from different numbers of classes. Hence, we apply the Meta-SGD in [23] for few-shot learning tasks, but replace the CNN with metric-learning based classification model as the base learner, so that it can tackle class-unbalanced few-shot learning problems. Here, we propose two types of Meta-Metric-Learner, i.e. Meta Matching Network (MMN) and Meta Prototypical Network (MPN), which exploit two different kinds of metric-based models(Matching Network and Prototypical Network respectively) as the base learner. Suppose we have a Meta-Metric-Learner containing a metric-learner with parameters as the base learner and the initial parameters are . The Meta-SGD model can be denoted as , where is update step size.

Meta-Metric-Learner of Single-Source Form

We use the meta-training set to train our Meta-Metric-Learner. Specifically, we first sample a few-shot learning task from , which contains a supporting dataset and a testing dataset , with all sample labels known. At gradient step , the base metric-learner with parameters takes as input to calculate classification loss and its gradient w.r.t . However, the original optimization-based meta-learning approach cannot be applied directly in our model due to the fact that our base learner is metric-based. Metric-learner predicts labels of query samples by exploiting their similarity with labeled supporting samples, i.e. metric-learner itself need two separate datasets for the learning procedure. To tackle this problem, we propose to divide into two subsets, denoted as and , and use them as query set and supporting set respectively for the metric-learner. Thus, the classification loss can be expressed as:


(, ) and (, ) are samples of two subsets and respectively. Then the Meta-SGD updates metric learner’s parameters using the basic gradient-descent method:


After this procedure is repeated for steps, Meta-SGD updates the base metric learner parameters to . We make predictions about samples in with the updated metric learner and supporting set and get evaluation loss:


The evaluation loss across all few-shot learning tasks from is minimized to optimize the parameters of the Meta-SGD:


The overall architecture of our Meta-Metric-Learner is shown in Figure 1 and the meta-training procedure is given in Algorithm 1.

Note that each of the two subsets, and , needs to contain different samples of every class in the few-shot learning task. This means for each class, we need more than one labeled data from , i.e. the method can only directly handle k-shot problems with .

Multi-Source Form with Auxiliary Data

In order to directly handle one-shot learning problems in our model, we propose to borrow instances from other data sources to augment the original meta-training dataset. This makes our method essentially extend to a multi-source learning process. Specifically, we construct an auxiliary meta-set which contains a number of learning tasks and use it to calculate the classification loss and update the parameters of the base metric learner. Given a one-shot learning task sampled from , we randomly choose an auxiliary learning task from . consists of a supporting set and a testing set . The classification loss now is different from Eq.9:


The update of the metric-learner’s parameters and the training of the meta-model is similar to single-source setting described above. Algorithm 2

shows the detail of multi-source meta-metric-learning. In fact, this idea is partly motivated by transfer learning, where data in a source domain are used to acquire knowledge to facilitate learning in a related target domain. Here we apply the thought of transfer learning in meta-learning setting. The meta-learner is trained to extract cross-source knowledge from the auxiliary meta-set that is transferable to the target meta-set.

Next, we discuss how to construct the auxiliary set . The intuition is, in many real-world applications, we can get few-shot learning data from multiple data sources, such as images of hand-written symbols from different alphabets. Such data from multiple sources could increase the training data for our few-shot learning method. However, there is rarely a guarantee that the above data sources are related to each other. When the auxiliary data are from an unrelated source, it will be difficult for the few-shot learning methods to learn a good metric or a good meta-learner. In this case, when adding more significantly unrelated auxiliary data, the performance may decrease. To overcome this difficulty, given a target data source for few-shot learning, we use the following approach to select related data sources similarly with [33].

Consider a list of data sources (such as a list of alphabets in hand-writing recognition) . From each data source we can sample a meta-dataset containing a number of few-shot learning tasks. Because datasets in few-shot learning tasks are too small to reflect any statistical relatedness among them, our approach deal with the problem at the task-level with the following steps: (1) For each data resource we use the sampled meta-dataset to train a metric learner on it. (2) For the target data source , we also sample a group of tasks and apply each model to get the classification accuracy . Note that the accuracy scores are usually low but their relative magnitudes can reflect the relatedness between different sources to . (3) Finally we select the top sources with the highest scores to construct the auxiliary set .

2:Meta-train set , Metric learner with parameters , Meta-Learner Meta-SGD with parameters
3: random initialization
4:while not done do
5:   for all in  do
6:      supporting dataset , testing dataset task
7:      , equally split into two subsets
8:      for  do
9:         , sampled from
10:         , sampled from
13:      end for
14:      , all samples from
15:      , all samples from
17:   end for
18:   Updating and to minimize
19:end while
Algorithm 1 Meta-Metric-Learner Meta-Training in Single-Source Setting
2:Meta-train set , auxiliary dataset , Metric learner with parameters , Meta-Learner Meta-SGD with parameters
3: random initialization
4:while not done do
5:   for all in  do
6:      supporting dataset , testing dataset task
7:      , task sampled from
8:      for  do
9:         , sampled from
10:         , sampled from
13:      end for
14:      , all samples from
15:      , all samples from
17:   end for
18:   Updating and to minimize
19:end while
Algorithm 2 Meta-Metric-Learner Meta-Training in Multi-Source Setting

Iv Experiments and Results

In this section, we conduct experiments with -shot learning in both single-source and multi-source settings. The experiments are conducted on five popular image datasets, comparing Meta-Metric-Learner against several baselines. We first describe the datasets, experimental settings, and baseline models.


The five datasets are MiniImagenet, Caltech-256, Cifar-100, Cub-200, and Omniglot. We use the first four datasets in the single-source setting and Omniglot in the multi-source setting.

1) MiniImagenet: The MiniImagenet dataset, first used in [7], consists of 60,000 color images of 100 classes, with 600 images per class. For our experiments, we use the same splits as [9] to enable the comparison with previous methods. Their splits use a different set of 100 classes, which are divided into three disjoint subsets: 64 classes for meta-training, 16 classes for meta-validation, and 20 classes for meta-testing.

2) Caltech-256: The Caltech-256 dataset [34] is a successor to the well-known dataset Caltech-101. It contains totally 30,607 color images of 256 classes. We split it into three subsets: 150, 56, and 50 classes for meta-training, meta-validation, and meta-testing, respectively as [27].

3) Cifar-100: The CIFAR-100 dataset [35] contains 60,000 color images of 100 fine-grained categories, and 20 coarse-level categories, which are both in size of 32x32. We use 64, 16, and 20 categories classes for meta-training, meta-validation, and meta-testing, respectively.

4) Cub-200: The CUB-200 dataset [36] contains 11,788 color images of 200 different bird species. We use 140 classes for meta-training, 20 classes for meta-validation, and test on the remaining 40 classes. In this fine-grained dataset, images’ differences between very similar classes are usually so subtle that they can hardly be recognized even by humans.

5) Omniglot: The data comes with a standard split of 30 training alphabets with 964 classes and 20 evaluation alphabets with 659 classes. Each of these was hand drawn by 20 different people. Each data source corresponds to an alphabet here.

For Cifar-100, we use the images of the original size, i.e. 32x32. For Omniglot, we resize the images to the size of 28x28. For the other three datasets, we resize the images to 84x84.

Model Model Type 5 class 5 vs. 3 3 vs. 5 2-shot 4-shot 2-shot 4-shot 2-shot 4-shot Meta-SGD - 49.890.73 56.280.68 - - - - Matching Network Basic (No FCE) 53.160.69 59.660.69 66.781.07 72.530.84 50.270.69 57.760.66 Matching Network fine-tune Basic (No FCE) 53.920.75 60.820.66 69.210.91 74.10.84 50.710.73 58.450.65 Meta Matching Network Basic (No FCE) 53.920.68 60.940.67 67.140.97 73.910.83 50.990.68 59.310.65 Prototypical Network Euclid. 50.890.75 57.870.70 64.560.97 71.760.89 48.020.76 55.570.72 Prototypical Network fine-tune Euclid. 52.470.71 60.680.83 66.480.98 72.520.85 49.510.71 57.640.70 Meta Prototypical Network Euclid. 51.950.68 58.440.71 65.610.96 72.470.84 49.160.72 56.310.71

Table I: Average classification accuracies on miniImageNet with different approaches in single-source form.

Model Model Type 5 class 5 vs. 3 3 vs. 5 2-shot 4-shot 2-shot 4-shot 2-shot 4-shot Meta-SGD - 58.880.92 65.470.83 - - - - Matching Network Basic (No FCE) 61.830.89 67.430.83 73.881.13 77.690.99 57.610.89 64.060.84 Matching Network fine-tune Basic (No FCE) 62.380.91 67.780.79 75.971.05 79.630.93 59.590.89 66.420.81 Meta Matching Network Basic (No FCE) 63.070.87 68.550.83 74.771.11 79.370.95 59.090.93 65.910.82 Prototypical Network Euclid. 56.890.86 65.170.85 70.651.14 75.131.02 55.480.95 62.940.85 Prototypical Network fine-tune Euclid. 59.250.88 65.010.81 72.851.11 77.170.91 57.70+-0.88 64.22+-0.82 Meta Prototypical Network Euclid. 59.210.91 66.230.81 70.941.14 76.290.96 55.110.93 62.740.83

Table II: Average classification accuracies on cifar-100 with different approaches in single-source form.

Model Model Type 5 class 5 vs. 3 3 vs. 5 2-shot 4-shot 2-shot 4-shot 2-shot 4-shot Meta-SGD - 58.600.81 66.490.72 - - - - Matching Network Basic (No FCE) 62.400.83 68.630.72 73.010.93 78.930.83 59.490.82 66.990.73 Matching Network fine-tune Basic (No FCE) 63.730.82 69.420.70 75.500.90 80.650.82 61.050.83 67.560.71 Meta Matching Network Basic (No FCE) 63.320.84 70.130.72 74.140.95 80.130.82 61.060.80 67.650.71 Prototypical Network Euclid. 58.930.83 67.900.73 70.451.02 76.990.84 56.520.84 64.290.77 Prototypical Network fine-tune Euclid. 59.620.80 68.920.71 73.140.99 79.420.82 59.780.82 65.320.75 Meta Prototypical Network Euclid. 60.280.81 68.990.71 71.101.00 78.770.84 57.820.84 65.990.76

Table III: Average classification accuracies on caltech-256 with different approaches in single-source form.

Model Model Type 5 class 5 vs. 3 3 vs. 5 2-shot 4-shot 2-shot 4-shot 2-shot 4-shot Meta-SGD - 57.180.81 62.550.76 - - - - Matching Network Basic (No FCE) 56.920.81 61.990.77 70.901.06 75.430.98 56.180.82 61.550.77 Matching Network fine-tune Basic (No FCE) 58.730.78 64.140.72 71.690.98 76.370.86 57.700.81 62.450.73 Meta Matching Network Basic (No FCE) 58.140.82 62.980.74 70.951.02 75.380.92 56.500.79 61.630.74 Prototypical Network Euclid. 55.770.86 63.130.75 69.661.05 76.050.96 54.200.83 60.300.74 Prototypical Network fine-tune Euclid. 57.210.81 64.010.77 72.470.99 76.740.89 54.340.80 62.990.73 Meta Prototypical Network Euclid. 55.580.83 63.490.73 68.731.05 77.040.90 54.500.81 60.910.72

Table IV: Average classification accuracies on cub-200 with different approaches in single-source form.

Baseline Models

There are three baseline models in our experiments: Matching Network, Prototypical Network, and Meta-SGD with CNN as the base model. For Matching Network and Prototypical Network, we implement our own versions. We only implement Matching Network without fully-conditional embedding (FCE). We choose Euclidean distance in Prototypical Network as is suggested in [18]. For Meta-SGD, we extend the version of [22] to support all of the four datasets.

CNN architectures

The CNN architecture in [7] and [18, 37] is used, which consists of 4 modules with a

convolution with 64 filters followed by batch normalization, a ReLu non-linearity and

max-pooling. In [7] and [18]

, dropout is not used. Here we use dropout with a small rate 0.1 in our Meta-Metric-Learner to reduce over-fitting in our experiment. For all models, the loss function is the classification cross-entropy between the predicted and true class.


There are several hyper-parameters required for our Meta-Metric-Learner and baseline models, including dropout rate, learning rate of the meta-learner, and the number of gradient steps . All of them are tuned in the meta-validation set.

Iv-a Experiments in Single-Source Form

To demonstrate the effectiveness of our Meta-Metric-Learner, we first execute experiments for single task/resource on all of the four image datasets, in which no auxiliary set is available from other tasks. Thus we need to perform -shot learning (), i.e., for each class, we split its samples into two parts equally, which are used as query samples and supporting samples respectively to calculate the gradient of the base learner. We test our approach in the classic “-shot -way” few-shot learning setting following previous works and a new but more realistic few-shot setting with flexible class numbers. We randomly construct 800 few-shot learning tasks as for meta-training, 600 tasks as for validation and 600 tasks as for performance evaluation. For the classic ”-shot -way” few-shot learning setting, each task contains images of 5 different classes, each with 2 or 4 samples in the supporting set and 15 samples in the testing set. Because numbers of classes in and are the same, the original Meta-SGD method can be employed. So in this setting, our baseline models are Meta-SGD, Matching Network, and Prototypical Network. For the second setting, we test our model in two modes: 1) 5 and 3 classes for meta-training and meta-testing, 2) 3 and 5 classes for meta-training and meta-testing, similarly with 2 or 4 samples in the supporting set and 15 samples in the testing set from each class. Note that 2) is a more challenging setting since the number of classes in is smaller than that in . The Meta-SGD method can’t be implemented directly in this setting, so our baseline models are only Matching Network and Prototypical Network. For all of the settings, tasks in the meta-validate dataset have the same number of classes as those in the meta-test dataset. We use meta-validate set to adjust the hyper-parameters. To have a fair comparison, all the baselines trained with 2 or 4 samples per class according to their own recipes. We test two kinds of Meta-Metric-Learner, i.e. Meta Matching Network (MMN) model and Meta Prototypical Network (MPN) model. Since the idea of [32] is quite similar to ours, we also implement and test thier model in our own setting, named Matching Network Finetune and Prototypical Network fine-tune in our experiments. To fine-tune the metric model, we split the supporting set equally, the same way as in our MMN and MPN model.

In our experiments, we find that increasing the number of gradient steps within a certain range can improve the model performance. But it won’t help a lot if the number of gradient steps is set to large, and a large number of gradient steps will increase the computational complexity of our model. We set gradient steps of training and testing to 5 for the MMN model and 7 for the MPN model. Both models are trained with task batch size of 4 and the learning rate of the meta-learner parameters is set to 0.001. All models are trained for 30000 iterations on a single NVIDIA GeForce GTX 1080ti GPU. We make an evaluation every 500 training iterations and record the best testing accuracy during the training procedure as the final result.

The results are shown in Table I to IV. All the results are averaged over 600 tasks from with confidence intervals. From these tables, we can see that Matching Network performs better than Prototypical Network. Perhaps this is because the number of classes in and are set fixed beforehand, and not tuned on a held-out validation set like [18]. And in the case where there is a finite number of meta-train tasks, the original Meta-SGD method seems to perform worse than the plain Matching Network. Moreover, it is obvious to see that the 3 vs. 5 split is a more challenging task. Comparing the results in both cases, the performance of 3 vs. 5 is around 15% lower than 5 vs. 3 cases. In both settings, we can see that task-specific metric models, no matter finetuned or meta-learned, outperform the baseline models. When comparing our method to [32], different models perform better in different scenarios, showing that our method is more effective to some degree.

Figure 2 shows the relationship between test accuracy and training iteration steps in the setting of unbalanced numbers of classes. We can see that our methods converge faster and better than corresponding baseline models in most of the cases. For some other cases, our methods can still achieve better test accuracy although there exists some fluctuation compared with baseline models during training.

(a) Training procedure of models related to Matching Network
(b) Training procedure of models related to Prototypical Network
Figure 2: The relationship between test accuracy and training iteration steps in the case of unbalanced numbers of classes. Different colors represent different experiment settings. Solid lines with dot markers show the results of our models and dash lines with cross markers show the results of corresponding baseline models.

Iv-B Experiments in Multi-Source Form

In this section, we show the results of our Meta-Metric-Learner when there is an auxiliary set available. We test this setting on Omniglot dataset because it naturally forms a multi-source setting. Here each data source corresponds to an alphabet; and the motivation of multi-source setting is to explore the cross-alphabet knowledge sharing to boost the performance on a target alphabet.

Model Model Type 30% classes, 1-shot 30% vs 50%, 1-shot Meta-SGD - 81.75% - Matching Network Basic (No FCE) 84.92% 76.69% Meta Matching Network Basic (No FCE) 85.43% 77.30% Prototypical Network Euclid. 86.58% 78.83% Meta Prototypical Network Euclid. 86.84% 79.25%

Table V: Average classification accuracies on Omniglot with different approaches in multi-source form.

In the experiment, the total number of data sources (i.e. alphabets) is 50. We randomly choose 10 alphabets as target data source and 30 alphabets as the auxiliary data source in this experiment. We use the strategy introduced above to find related top- ( is different according to different sources and 1 or 2 just works fine in our experiment) sources and train the Meta-Metric-Learner. We still test our model in both identical class number setting and flexible class number setting. In the first setting, every meta-dataset contains 30% of classes in the data source. In the second setting, we split classes in the data source with 3:2:5 as meta-train, meta-validate, and meta-test. We only conduct experiments in 1-shot learning mode. There are 10 examples per class for evaluation in the testing set of each task. We follow the procedure of [7] by augmenting the characters with rotations in multiples of 90 degrees. Average classification accuracy of the 10 target data source are shown in Table V. For all of the settings, our model can achieve better classification accuracy than others.

V Conclusion

In this paper, we propose the Meta-Metric-Learner for few-shot learning, which is a combination of a Meta-SGD meta-learner and a base metric classifier, and design two specific versions of it. The proposed method takes several advantages such as being able to handle unbalanced classes as well as to generate task-specific metrics. Moreover, as shown in the results, using the meta-learner to guide gradient optimization in metric learners seems to be a promising direction. We evaluate our model on several datasets, in both single-source and multi-source settings. The experiments demonstrate that our approach is very effective and competitive for few-shot learning problems.

There are several directions for future work. Firstly, we will try to exploit more sophisticated optimization-based meta-learning models such as probabilistic MAML and LEO to adjust the basic metric-based model so that we can make the learned metric generalize better. Secondly, we would like to focus on selecting data from more related domains/sources to support the training of Meta-Metric-Learners. Thirdly, it would be interesting to propose an end-to-end framework of the Meta-Metric-Learner to leverage the data from different domains/sources/tasks for the training, instead of the current two-stage procedure. Besides, we would like to move forward to apply the current framework in other applications, such as language modeling and text classification.