Meta-learning, or learning to learn, is a sub-field of machine learning that attempts to search for the best learning strategy as the learning experiences increasesVilalta and Drissi (2002); Thrun and Pratt (2012); Lemke et al. (2015); Vanschoren (2018); Finn (2018). Recent years have witnessed an abundance of new approaches on meta-learning Finn and Levine (2018); Finn et al. (2017a); Vinyals et al. (2016); Rusu et al. (2019); Triantafillou et al. (2020); Rajeswaran et al. (2019); Lee et al. (2019); Mishra et al. (2018); Munkhdalai and Yu (2017); Ritter et al. (2018); Santoro et al. (2016); Nichol et al. (2018), developed in various areas including few-shot learning Ravi and Larochelle (2017); Snell et al. (2017); Wang and Hebert (2016); Ye et al. (2020a, b); Sung et al. (2018); Zhang et al. (2018a); Wang et al. (2018b), optimization Andrychowicz et al. (2016); Wichrowska et al. (2017); Li and Malik (2017); Bello et al. (2017)
, reinforcement and imitation learningStadie et al. (2018); Frans et al. (2018); Wang et al. (2017a); Duan et al. (2016, 2017); Finn et al. (2017b); Yu et al. (2018)2018); Metz et al. (2019); Edwards and Storkey (2017); Reed et al. (2018), continual learning Riemer et al. (2019); Kaiser et al. (2017); Al-Shedivat et al. (2018); Clavera et al. (2019), transfer and multi-task learning Motiian et al. (2017); Balaji et al. (2018); Ying et al. (2018); Zhang et al. (2018b); Li et al. (2019, 2018)2018); Sharma et al. (2018); Bachman et al. (2017); Pang et al. (2018), etc. Specifically, meta-learning has demonstrated the capability to generalize learned knowledge to novel tasks, which greatly reduces the need for training data and time to optimize.
Model-agnostic meta-learning (MAML) Finn et al. (2017a) is one of the most wildly studied and applied meta-learning algorithms, thanks to its “model-agnostic” nature and its elegant formulation. Concretely, MAML aims to learn a good model initialization (through the outer loop optimization), which can then be quickly adapted to novel tasks using few data and few gradient updates (through the inner loop optimization). However, in few-shot classification Vinyals et al. (2016); Snell et al. (2017) which many meta-learning algorithms are dedicated to, MAML’s performance has been shown to fall far behind Wang et al. (2019); Chen et al. (2019); Triantafillou et al. (2020), even after paired with a stronger backbone, e.g., ResNet He et al. (2016) pre-trained on the meta-training set Ye et al. (2020a); Chao et al. (2020); Rusu et al. (2019); Qiao et al. (2018).111The original MAML Finn and Levine (2018) uses a simple 4-layer convolutional network (ConvNet) without pre-training.
In this paper, we take a closer look at MAML on the few-shot classification problem. The standard setup of few-shot classification using meta-learning involves a meta-training and meta-testing phase. For example, MAML learns the initialization during meta-training and applies it during meta-testing. In both phases, a meta-learning algorithm receives multiple -way -shot tasks. Each task is an -class classification problem provided with labeled support images per class. After the (temporal) inner loop optimization using the labeled support images, the resulting model is then evaluated on the query images of the same classes. In the meta-training phase, the loss calculated on the query images is the driving force to optimize the meta-parameters (e.g., the initialization in MAML). We note that, the classes seen in the meta-training and meta-testing phases are disjoint.
For an -way problem, what MAML learns is the initialization of an -class classifier. Without loss of generality, we denote it by , where is the feature extractor on image , and are the linear classifiers. We use to represent the collection of meta-parameters . For simplicity and fair comparisons, we use (a) a pre-trained ResNet-12 backbone Lee et al. (2019) released by an existing algorithm Ye et al. (2020a), (b) the first-order approximation in calculating outer loop gradients, and (c) the same number of inner loop steps in meta-training and meta-testing.
Our first observation is that MAML needs a large number of inner loop gradient steps. For example, on the benchmark MiniImageNet Vinyals et al. (2016) and TierdImageNet Ren et al. (2018a) datasets, MAML’s performance improves along with increased number of steps and achieves the highest accuracy around steps, which are much larger than the conventional usage of MAML Antoniou et al. (2019).
Our second observation is that MAML is inherently sensitive to the permutation of label assignments of each -way -shot task. Concretely, when a new task arrives, MAML pairs the learned initialization of to the corresponding class label of the task. The issue, however, resides in the “meaning” of in a task. In the standard setup, each -way task is created by drawing classes from a bigger pool of semantic classes (e.g., “dog”, “cat”, “bird”, etc.), followed by an arbitrary label re-assignment into . In other words, the same set of semantic classes can be re-labeled totally differently into and thus be paired with differently. Taking a five-way task for example, there are ways (permutations) to pair the same semantic classes to the linear classifiers. In some of them, a class “dog” may be assigned to ; in some others, it will be assigned to . In our experiment, we find that different permutations indeed lead to different meta-testing accuracy. Specifically, if we cherry-pick the best permutation for each meta-testing task, the resulting accuracy over five-way one-shot tasks can be higher on both datasets.
Building upon this observation, we investigate permutation-invariant treatments during meta-testing. First, we search the best permutation for each task; we explore using the loss or accuracy on the support set as the signal to determine a permutation. Second, we explore ensemble Breiman (1996); Zhou (2012); Dietterich (2000) over (a subset) of all possible permutations, which inevitably needs more computations. Third, we employ forced permutation invariance — during meta-testing, we initialize each by their average: . Overall, we found that (a) it is challenging to find the best permutation per task: the strategies we explore can hardly improve; (b) ensemble helps, even if we just pick a subset of permutations; (c) using the averaged initialization does not hurt but can sometimes improve.
We further investigate permutation-invariant treatments during meta-training. (The corresponding treatments are applied in meta-testing as well.) First, we again explore using the loss or accuracy on the support set to decide a permutation. Second, we investigate a fixed order in assigning the semantic classes into . Third, we employ the forced permutation invariance; i.e., we meta-train only a single . We duplicate into in the beginning of the inner loop optimization and aggregate the outer loop gradients to optimize (see Figure 1). We found that the first two treatments hurt MAML, suggesting that the permutations in meta-training may be beneficial, e.g., to prevent the initialization from over-fitting meta-training tasks. To our surprise, the third treatment — learning and testing with a single — consistently improve MAML. On few-shot tasks in both benchmarks, this approach, which we name unicorn-MAML, is on a par with or outperforms state-of-the-art algorithms, without any extra network modules or learning strategies. We provide further analysis on unicorn-MAML and show that, even with a strong backbone, it is still crucial to perform inner loop updates to the feature extractor in meta-testing, which matches the claims by Arnold and Sha (2021): “Embedding adaptation is still needed for few-shot learning.”
2 Related Work
Training a model under data budgets is important in machine learning, computer vision, and many other application fields, since the costs of collecting dataLi and Zhou (2015) and labeling them Huang et al. (2014)
are by no means negligible. This is especially the case for deep learning models in visual recognitionHe et al. (2016); Dosovitskiy et al. (2021); Simonyan and Zisserman (2015); Szegedy et al. (2015); Krizhevsky et al. (2012); Huang et al. (2017), which usually needs thousands of, millions of, or even more images to train Russakovsky et al. (2015); Deng et al. (2009); Guo et al. (2016); Zhou et al. (2017); Thomee et al. (2016); Mahajan et al. (2018); Joulin et al. (2016) in a conventional supervised manner. Different from training a model to predict at the instance level, meta-learning attempts to learn the inductive bias across training tasks Baxter (2000); Vilalta and Drissi (2002). A “meta-model” summarizes the common characteristic of tasks and generalizes them to those novel but related tasks Maurer (2009); Maurer et al. (2016); Denevi et al. (2018). Meta-learning has been applied in various fields, including imbalance learning Wang et al. (2017c); Ren et al. (2018b), data compression Wang et al. (2018a), architecture search Elsken et al. (2019), recommendation systems Vartak et al. (2017), data augmentation Ratner et al. (2017), teaching Fan et al. (2018), and hyper-parameter tuning Franceschi et al. (2017); Probst et al. (2019).
In few-shot learning (FSL), meta-learning is applied to learn the ability of “how to build a classifier using limited data” that can be generalized across tasks. Such an inductive bias is first learned over few-shot tasks composed of “base” classes, and then evaluated on tasks composed of “novel” classes. For example, few-shot classification can be implemented in a non-parametric way with soft nearest neighbor Vinyals et al. (2016) or nearest center classifiers Snell et al. (2017), so that the feature extractor is learned and acts at the task level. The learned features pull similar instances together and push dissimilar ones far away, such that a test instance can be classified even with a few labeled training examples Koch et al. (2015). Considering the complexity of a hypothesis class, the model training configurations (i.e., hyper-parameters) also serve as a type of inductive biases. Andrychowicz et al. (2016); Ravi and Larochelle (2017) meta-learn the optimization strategy for each task, including the learning rate and update directions. Other kinds of inductive biases are also explored. Hariharan and Girshick (2017); Wang et al. (2018b) learn a data generation prior to augment examples given few images; Dai et al. (2017) extract logical derivations from related tasks; Wang et al. (2017b); Shyam et al. (2017) learn the prior to attend images.
Model-agnostic meta-learning (MAML) Finn et al. (2017a) proposes another inductive bias, i.e., the model initialization. After the model initialization shared among tasks is meta-trained, the classifier of a new few-shot task can be fine-tuned with several steps of gradient descent from that initial point. The universality of this MAML-type updates is proved in Finn and Levine (2018)
. MAML has been applied in various scenarios, such as uncertainty estimationFinn et al. (2018), robotics control Yu et al. (2018); Clavera et al. (2018), neural translation Gu et al. (2018), language generation Huang et al. (2018), etc. Despite the success, there are still problems with MAML. Nichol et al. (2018) handle the computational burden by presenting a family of approaches using first-order approximations; Antoniou et al. (2019) provide a bunch of tricks to train and stabilize the MAML framework; Bernacchia (2021) points out that negative rates of gradient updates help in some scenarios.
Since MAML applies a uniform initialization to all the tasks (i.e., the same set of and ), recent methods explore ways to better incorporate task characteristics. Lee et al. (2019); Bertinetto et al. (2019); Rajeswaran et al. (2019) optimize the linear classifiers (not the feature ) till convergence in the inner loop; Triantafillou et al. (2020) construct the linear classifiers from class prototypes (i.e., aggregated features per class) so they are task-aware and need no inner loop updates. Another direction is to enable task-specific initialization Requeima et al. (2019); Vuorio et al. (2019); Yao et al. (2019); Ye et al. (2020b), which often needs additional sub-networks.
Our work is complementary to the above improvements of MAML: we find an inherent permutation issue and conduct a detailed analysis. We then build upon it to improve MAML. We note that, some of the above methods can be invariant to the permutations. For example, LEO Rusu et al. (2019) and ProtoMAML Triantafillou et al. (2020) compute class prototypes to represent each semantic class. However, they need to either introduce additional sub-networks or modify the training objective.
3 MAML for Few-Shot Classification
3.1 Problem definition
Following the literature Vinyals et al. (2016), we define an -way -shot task as an -class classification problem with labeled support examples per class. The value of is small, e.g., or . The goal of few-shot learning (FSL) is to construct a classifier using the support set of examples. Each is a pair of the instance and label, where . To evaluate the quality of the resulting classifier, each task is usually associated with a set of query examples , which is composed of examples of the same classes. The challenge of FSL is the potential over-fitting or poor generalization problem.
The core idea of meta-learning for FSL is to sample few-shot tasks from a set of “base” classes, of which we instead have ample data per class, and learn how to build a classifier using limited data from these tasks. After this so-called “meta-training” phase, we then proceed to the “meta-testing” phase to tackle the true few-shot tasks composed of “novel” classes that are disjoint from the “base” classes. It is worth noting that the number of total “base” (and “novel”) classes is usually larger than (see subsection 3.3). Thus, to construct an -way -shot task in both phases, one usually first samples classes randomly from the corresponding set of classes, and re-labels each sampled class by an index . Throughout the paper, we will use base and meta-training classes interchangeably, as well as novel and meta-testing classes.
3.2 Model-Agnostic Meta-Learning (MAML)
As mentioned in section 1 and section 2, MAML aims to learn an initialization of an -way classifier, such that when provided with the support set of an -way -shot task, the classifier can quickly update to perform well for the task (i.e., classify the query set well). Let us denote the classifier initialization by , where is the feature extractor on , are the linear classifiers, and represents the collection of both. MAML evaluates on and uses the gradient to update into a classifier that is ready for . This procedure is called the inner loop optimization, which usually takes gradient steps:
where is the loss computed on instances of and is the learning rate (or step size). The cross-entropy loss is commonly used for . As suggested in the original MAML paper Finn et al. (2017a) and Antoniou et al. (2019), is considered as a small integer (e.g., ). For ease of notation, let us denote the output after gradient steps by .
MAML learns such an initialization using the few-shot tasks sampled from the base classes. Let us denote by the distribution of tasks from the base classes, where each task is a pair of support and query sets , MAML aims to minimize the following meta-learning objective w.r.t. :
That is, MAML aims to find a shared among tasks, which, after the -step inner loop optimization using , can lead to a small classification loss on the query set . (We add the subscript to to indicate that depends on .) To optimize Equation 2
, MAML applies stochastic gradient descent (SGD) but at the task level;i.e., at every iteration it samples a task and computes the gradient w.r.t. :
In practice, one may sample a mini-batch of tasks and compute the mini-batch task gradient w.r.t. for learning . This SGD for is known as the outer loop optimization for MAML. It is worth noting that calculating the gradient in Equation 3 can be computation and memory heavy since it involves a gradient through a gradient (along the inner loop but in a backward order) Finn et al. (2017a). Thus in practice, it is common to apply the first-order approximation Finn et al. (2017a); Nichol et al. (2018), i.e., .
3.3 Experimental setup
As our paper is heavily driven by empirical observations, we first introduce the two main datasets we experiment with, the neural network architecture we use, and the implementation details.
Datasets. We evaluate on MiniImageNet Vinyals et al. (2016) and TiredImageNet Ren et al. (2018a). MiniImageNet contains semantic classes, each has 600 images. Following Ravi and Larochelle (2017), the classes are split into meta-training/validation/testing sets with 64/16/20 (non-overlapped) classes, respectively. That is, there are base classes and novel classes; the other classes are used for hyper-parameter tuning. In TieredImageNet Ren et al. (2018a), there are in total classes, which are split into meta-training/validation/testing sets with 351/97/160 (non-overlapped) classes, respectively. On average, each class has images. All images are resized to , following Lee et al. (2019); Ye et al. (2020a).
Training and evaluation. During meta-training, meta-testing, and meta-validation, we sample -way -shot tasks from the corresponding classes and images. We follow literature Snell et al. (2017); Vinyals et al. (2016) to study the five-way one-shot and five-way five-shot tasks. As mentioned in subsection 3.1, every time we sample five distinct classes, we randomly assign each of them an index . During meta-testing, we follow the evaluation protocol in Zhang et al. (2020); Rusu et al. (2019); Ye et al. (2020a) to sample tasks. In each task, the query set contains images per class. We report the mean accuracy (in %) and the confidence interval.
Model architecture. We follow Lee et al. (2019) to use a ResNet-12 architecture He et al. (2016) for (cf. subsection 3.2), which contains a wider width and the Dropblock module Ghiasi et al. (2018) to avoid over-fitting. More specifically, we initialize with the weights released by Ye et al. (2020a), which is pre-trained on the entire meta-training set, following the recent practice Ye et al. (2020a); Chao et al. (2020); Rusu et al. (2019); Qiao et al. (2018). We randomly initialize .
Implementation details. MAML has several hyper-parameters and we select them on the meta-validation set. Specifically for the outer loop, we learn with at most tasks: we group every
tasks into an epoch. We apply SGD with momentumand weight decay . We start with an outer loop learning rate for and for . Both are decayed by after every epochs. For the inner loop, we have to set the number of gradient step and the learning rate (cf. Equation 1). We provide more discussions in the next section.
4 MAML Needs a Large Number of Inner Loop Gradient Steps
While hyper-parameter tuning is a common practice in machine learning, we find that for MAML’s inner loop, the number of gradient update (cf. Equation 1) is usually searched in a small range close to , e.g., Antoniou et al. (2019). This makes sense according to the motivation of MAML Finn et al. (2017a) — with a small number of gradient steps, the resulting model will have a good generalization performance.
In our experiment, we however find that it is crucial to explore a larger .222For simplicity, we apply the same in meta-training and meta-testing. Specifically, we consider along with . We plot the meta-testing accuracy of five-way one-shot tasks on both datasets in Figure 2.333We tune hyper-parameters on the meta-validation set and find that the accuracy there reflects the meta-testing accuracy well. We show the meta-testing accuracy here simply for a direct comparison to results in the literature. We find that MAML achieves higher and much more stable results (w.r.t. the learning rate) when is larger, e.g., larger than . For MiniImageNet, the highest accuracy is obtained with , higher than with ; for TiredImageNet, the highest accuracy is obtained with , higher than with . As will be seen in Table 4, Table 5, these results with a larger are comparable with many existing algorithms.
To analyze how such a large value helps MAML, we plot the change of meta-testing accuracy along with the inner loop updates in Figure 3. Specifically, we meta-train with , but during meta-testing we show the accuracy from to inner loop updates. In general, the more inner loop updates we perform in meta-testing (even more than the number in meta-training), the higher the accuracy is. This observation matches the few-shot regression study in Finn et al. (2017a).
Also from Figure 3, we find that before any inner loop update, the learned initialization has a accuracy, i.e., the accuracy by random classification. While this may explain why a larger number of inner loop updates are needed, the accuracy is a bit surprising and hard to explain at first glance. MAML does learn a set of linear classifier initialization . How could they perform like random?
5 MAML is Sensitive to the Few-Shot Task Label Assignment
To understand the above observation, we revisit how a few-shot task is generated. According to subsection 3.1 and subsection 3.3, each class index in an -way task can be paired with any of the base classes in meta-training or any of the novel classes in meta-testing. For a few-shot task of a specific set of semantic classes (e.g., “dog”, “cat”, ,“bird”), such an arbitrary nature can indeed turn it into a totally different task from MAML’s perspective. That is, the class “dog” may be assigned to and paired with at the current task, but to and paired with when it is sampled again. We note that, for a standard five-way task, a same set of five semantic classes can be assigned to in (i.e., ) different ways.
This permutation in class label assignments explains why we obtain accuracy using the learned initialization of MAML directly without inner loop updates. On the one hand, we may sample the same set of semantic classes but in different permutations so their accuracy cancels out. On the other hand, since the permutation occurs also in meta-training, each will be discouraged to learn specific knowledge towards any semantic class. Indeed, we find that the learned initialization also has a accuracy on the meta-training set (please see the supplementary material).
This observation leads to two further questions:
Are the learned
useless and can be replaced by any random vectors?
Do different permutations lead to different meta-testing results after inner loop updates?
|Select the permutation by||1-Shot||5-Shot||1-Shot||5-Shot|
|Initial Support Acc||64.420.20||83.950.13||65.060.20||84.320.16|
|Initial Support Loss||64.420.20||83.910.13||65.420.20||84.230.16|
|Updated Support Acc||64.420.20||83.950.13||65.010.20||84.370.16|
|Updated Support Loss||64.670.20||84.050.13||65.430.20||84.220.16|
We answer the first question in Figure 4: the learned initialization result in higher accuracy than randomized ones. For the second question, we conduct a detailed experiment as follows.
We evaluate a learned MAML over sampled meta-testing tasks.
In each task, we consider all the permutations of class label assignments, followed by inner loop gradient steps. We then apply the updated models to the corresponding permutations, obtain meta-testing accuracy for each task, and sort them in the descending order.
We summarize the tasks by taking average over each task’s accuracy at the same rank. Namely, we will obtain averaged accuracy, each corresponds to a specific rank in each task.
We show the histogram of the average meta-testing accuracy in Figure 5. There exists a huge variance. Specifically, the best permutation can be higher than the worst in one/five-shot tasks. The best permutation is also much higher than vanilla MAML’s results (from section 5), which are 64.42%/83.44%/65.72%/84.37%, corresponding to the four sub-figures from left to right in Figure 5. What’s more, the best permutation can easily achieve state-of-the-art accuracy (see Table 4).
Of course, so far we find the best permutation through cherry-picking — by looking at the meta-testing accuracy — so it is like an upper bound. However, if we can (a) develop a strategy to find the best permutation without looking at the query sets’ labels, (b) leverage the variance among permutations, or (c) make MAML permutation-invariant, we may be able to practically improve MAML.
6 Making MAML Permutation-Invariant in the Meta-Testing Phase
In this section, we investigate ways to make MAML permutation-invariant during meta-testing. That is, we take the same learned MAML as in section 5 without changing the meta-training phase.
The first direction we investigate is to search for the best permutation for each task. As we cannot access query sets’ labels in advance, we explore using the support sets’ data as a proxy. Specifically, we consider choosing the best permutation under which the learned initialization (before inner loop updates) leads to (a) the largest support set accuracy or (b) the smallest support set loss. We further consider two less practical ways, which are to perform inner loop updates for each permutation and re-evaluate (a) and (b). Table 1 summarizes the results: none of the above leads to consistent gains.
The second direction we investigate is to perform ensemble Breiman (1996); Zhou (2012); Dietterich (2000) over the updated model of each permutation. For this direction, instead of permute the class label assignment of a task, we permute , which is equivalent to the former but easier for aggregating the models. We study two versions: (a) full permutation (i.e., of them in five-way tasks), which is intractable for a larger ; (b) rotated permutation, which is to rotate ,444That is, we consider re-assign to , where . leading to permutations. Table 2 summarizes the results — ensemble can consistently improve vanilla MAML. Importantly, even with the rotated version that has exponentially fewer base models than the full version, the improvements are comparable. We note that, this ensemble is quite different from the common practice that performs augmentation to the test data or learns multiple meta-learners.
The third direction is forced permutation invariance. That is, we initialize each by the average: . By doing so, no matter which permutation we perform, the resulting inner loop update is the same. At first glance, this approach makes less sense as the resulting simply takes chances in classification. However, please note that according to Figure 3, even the original initialization has an averaged chance accuracy. This approach is further inspired by viewing the permutation in meta-training as a special form of dropout Srivastava et al. (2014). That is, in meta-training, we receive a task with an arbitrary permutation, which can be understood as drawing a permutation at random for the task. In meta-testing, we then take expectation over the distribution, which essentially lead to the averaged . Table 3 shows the results, which improve vanilla MAML (see Table 2) in three of four results.
|ProtoMAML Triantafillou et al. (2020)||62.620.20||79.240.20|
|MetaOptNet Lee et al. (2019)||62.640.35||78.630.68|
|MTL+E3BM Sun et al. (2019)||63.800.40||80.100.30|
|RFS-Distill Tian et al. (2020)||64.820.60||82.140.43|
|DeepEMD Zhang et al. (2020)||65.910.82||82.410.56|
|MATE+MetaOptNet Chen et al. (2020)||62.080.64||78.640.46|
|TRAML+AM3 Li et al. (2020a)||67.100.54||79.540.60|
|DSN-MR Simon et al. (2020)||64.600.72||79.510.50|
|FEAT Ye et al. (2020b)||66.780.20||82.050.14|
|MAML (Our reimplementation)||64.420.20||83.440.13|
|ProtoNet Snell et al. (2017)||68.230.23||84.030.16|
|ProtoMAML Triantafillou et al. (2020)||67.100.23||81.180.16|
|MetaOptNet Lee et al. (2019)||65.990.72||81.560.53|
|MTL+E3BM Sun et al. (2019)||71.200.40||85.300.30|
|RFS-Distill Tian et al. (2020)||69.740.72||84.410.55|
|DeepEMD Zhang et al. (2020)||71.520.69||86.030.49|
|MATE+MetaOptNet Chen et al. (2020)||71.160.87||86.030.58|
|DSN-MR Simon et al. (2020)||67.390.82||82.850.56|
|FEAT Ye et al. (2020b)||70.800.23||84.790.16|
|MAML (Our reimplementation)||65.720.20||84.370.16|
7 Making MAML Permutation-Invariant in the Meta-Training Phase
We further investigate making both the meta-training and meta-testing phases permutation-invariant. The first direction is again to search for the best permutation for each task. Specifically, we select the order with the minimum initial support set loss in both phases. The second direction we investigate is to make the label assignment deterministic. Concretely, we give each meta-training/validation/testing class an integer index. Whenever we sample classes, we sort them by the indices and then label them with .
Our third direction is to apply forced permutation invariance, but this time to meta-testing. That is, we explore meta-training a single rather than (we name this method unicorn-MAML). We use this learned initialization of to initialize each in the beginning of the inner loop gradient updates. In meta-training, we then aggregate the gradients w.r.t. to to update .
Table 4 and Table 5 summarize the results, together with those by existing few-shot learning algorithms. The first two methods (i.e., MAML-PM and MAML-FO) hurt MAML, suggesting that the permutations in meta-training may be beneficial, e.g., to prevent the initialization from over-fitting meta-training tasks. To our surprise, the third method — learning and testing with a single — consistently improves MAML. Specifically, on MiniImageNet, unicorn-MAML has a gain on one-shot tasks and gain on five-shot tasks. The latter already achieves the state-of-the-art accuracy. On TieredImageNet, unicorn-MAML has significant improvements ( gain on one-shot tasks and gain on five-shot tasks). The latter, again, already achieves the state-of-the-art accuracy. Specifically, compared to ProtoMAML and MetaOptNet which are both permutation-invariant, unicorn-MAML notably outperforms them. We note that, similar to permutations in meta-training, learning a single prevents any from over-fitting a specific semantic class.
Embedding adaptation is needed. We analyze unicorn-MAML in terms of its inner loop updates during meta-testing, similar to Figure 3. This time, we also investigate updating or freezing the feature extractor . Figure 6 shows the results on five-way one- and five-shot tasks on both datasets. unicorn-MAML’s accuracy again begins with but rapidly increases along with the inner loop updates. In three out of four cases, adapting the feature extractor is necessary for claiming a higher accuracy, even if the backbone has been well pre-trained, which matches the recent claim by Arnold and Sha (2021).
We further evaluate unicorn-MAML on CUB Wah et al. (2011), where unicorn-MAML also achieves promising improvements. (See the supplementary material.)
We provide a series of analyses and observations of MAML for few-shot classification, in terms of hyper-parameter tuning and its sensitivity to the inherent permutations in few-shot task generations. With a large number of inner loop gradient steps (in both meta-training and meta-testing), MAML can achieve comparable results to many existing algorithms. By further incorporating a forced permutation-invariant treatment, we present unicorn-MAML, which arrives at the state-of-the-art accuracy of five-shot tasks, without introducing any extra sub-networks. We hope that unicorn-MAML could serve as a strong baseline for future work in few-shot classification.
This research is supported by National Key R&D Program of China (2020AAA0109401), NSFC (61773198,61921006,62006112), NSFC-NRF Joint Research Project under Grant 61861146001, Collaborative Innovation Center of Novel Software Technology and Industrialization, NSF of Jiangsu Province (BK20200313), and the OSU GI Development funds. We are thankful for the generous support of computational resources by Ohio Supercomputer Center and AWS Cloud Credits for Research. We thank Sébastien M.R. Arnold (USC) for helpful discussions.
- Associative alignment for few-shot image classification. In ECCV, pp. 18–35. Cited by: Table F.
- Continuous adaptation via meta-learning in nonstationary and competitive environments. In ICLR, Cited by: §1.
- Learning to learn by gradient descent by gradient descent. In NIPS, pp. 3981–3989. Cited by: §1, §2.
- How to train your MAML. In ICLR, Cited by: §1, §2, §3.2, §4.
- Embedding adaptation is still needed for few-shot learning. CoRR abs/2104.07255. Cited by: §1, §7.
- Learning algorithms for active learning. In ICML, Cited by: §1.
- MetaReg: towards domain generalization using meta-regularization. In NeurIPS, Cited by: §1.
- A model of inductive bias learning. JAIR 12, pp. 149–198. Cited by: §2.
Neural optimizer search with reinforcement learning. In ICML, Cited by: §1.
- Meta-learning with negative learning rates. In ICLR, Cited by: §2.
- Meta-learning with differentiable closed-form solvers. In ICLR, Cited by: §2.
- Bagging predictors. Machine learning. Cited by: §1, §6.
Revisiting meta-learning as supervised learning. CoRR abs/2002.00573. Cited by: §1, §3.3.
- A closer look at few-shot classificationA closer look at few-shot classification. In ICLR, Cited by: Table F, §1.
- MATE: plugging in model awareness to task embedding for meta learning. In NeurIPS, Cited by: Table 4, Table 5.
- Learning to adapt: meta-learning for model-based control. CoRR abs/1803.11347. Cited by: §2.
- Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. In ICLR, Cited by: §1.
- Logical vision: one-shot meta-interpretive learning from real images. In ILP, pp. 46–62. Cited by: §2.
- Learning to learn around A common mean. In NeurIPS, pp. 10190–10200. Cited by: §2.
- Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §2.
- Ensemble methods in machine learning. In International workshop on multiple classifier systems, Cited by: §1, §6.
- An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §2.
- One-shot imitation learning. In NIPS, Cited by: §1.
- RL : fast reinforcement learning via slow reinforcement learning. CoRR abs/1611.02779. Cited by: §1.
- Towards a neural statistician. In ICLR, Cited by: §1.
- Neural architecture search: A survey. JMLR 20, pp. 55:1–55:21. Cited by: §2.
- Learning to teach. In ICLR, Cited by: §2.
- Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §1, §1, §2, §3.2, §3.2, §4, §4.
- Meta-learning and universality: deep representations and gradient descent can approximate any learning algorithm. In ICLR, Cited by: §1, §2, footnote 1.
- Probabilistic model-agnostic meta-learning. In NeurIPS, pp. 9537–9548. Cited by: §2.
- One-shot visual imitation learning via meta-learning. In CoRL, Cited by: §1.
- Learning to learn with gradients. Ph.D. Thesis, UC Berkeley. Cited by: §1.
A bridge between hyperparameter optimization and larning-to-learn. CoRR abs/1712.06283. Cited by: §2.
- Meta learning shared hierarchies. In ICLR, Cited by: §1.
- Supervising unsupervised learning. In NeurIPS, pp. 4996–5006. Cited by: §1.
- DropBlock: A regularization method for convolutional networks. In NeurIPS, pp. 10750–10760. Cited by: §3.3.
Meta-learning for low-resource neural machine translation. In EMNLP, pp. 3622–3631. Cited by: §2.
MS-celeb-1m: A dataset and benchmark for large-scale face recognition. In ECCV, pp. 87–102. Cited by: §2.
- Low-shot visual recognition by shrinking and hallucinating features. In ICCV, pp. 3037–3046. Cited by: §2.
- Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §2, §3.3.
- Densely connected convolutional networks. In CVPR, pp. 2261–2269. Cited by: §2.
- Natural language to structured query generation via meta-learning. In ACL, pp. 732–738. Cited by: §2.
- Active learning by querying informative and representative examples. TPAMI 36 (10), pp. 1936–1949. Cited by: §2.
- Learning visual features from large weakly supervised data. In ECCV, pp. 67–84. Cited by: §2.
- Learning to remember rare events. In ICLR, Cited by: §1.
- Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, Vol. 2. Cited by: §2.
Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114. Cited by: §2.
- Meta-learning with differentiable convex optimization. In CVPR, pp. 10657–10665. Cited by: Appendix E, §1, §1, §2, §3.3, §3.3, Table 4, Table 5.
- Metalearning: a survey of trends and technologies. Artificial intelligence review 44 (1), pp. 117–130. Cited by: §1.
- Boosting few-shot learning with adaptive margin loss. In CVPR, pp. 12573–12581. Cited by: Table 4.
- Learning to generalize: meta-learning for domain generalization. In AAAI, pp. 3490–3497. Cited by: §1.
- Adversarial feature hallucination networks for few-shot learning. In CVPR, pp. 13470–13479. Cited by: Table F.
- Learning to optimize. In ICLR, Cited by: §1.
- Feature-critic networks for heterogeneous domain generalization. In ICML, pp. 3915–3924. Cited by: §1.
- Towards making unlabeled data never hurt. TPAMI 37 (1), pp. 175–188. Cited by: §2.
- Negative margin matters: understanding margin in few-shot classification. In ECCV, pp. 438–455. Cited by: Table F.
- Exploring the limits of weakly supervised pretraining. In ECCV, pp. 181–196. Cited by: §2.
- The benefit of multitask representation learning. JMLR 17, pp. 81:1–81:32. Cited by: §2.
- Transfer bounds for linear feature learning. Machine Learning 75 (3), pp. 327–350. Cited by: §2.
- Meta-learning update rules for unsupervised representation learning. In ICLR, Cited by: §1.
- A simple neural attentive meta-learner. In ICLR, Cited by: §1.
- Few-shot adversarial domain adaptation. In NIPS, pp. 6673–6683. Cited by: §1.
- Meta networks. In ICML, Cited by: §1.
- On first-order meta-learning algorithms. CoRR abs/1803.02999. Cited by: §1, §2, §3.2.
- Meta-learning transferable active learning policies by deep reinforcement learning. CoRR abs/1806.04798. Cited by: §1.
- Tunability: importance of hyperparameters of machine learning algorithms. JMLR 20, pp. 53:1–53:32. Cited by: §2.
- Few-shot image recognition by predicting parameters from activations. In CVPR, pp. 7229–7238. Cited by: §1, §3.3.
- Meta-learning with implicit gradients. In NeurIPS, pp. 113–124. Cited by: §1, §2.
- Learning to compose domain-specific transformations for data augmentation. In NIPS, pp. 3236–3246. Cited by: §2.
- Optimization as a model for few-shot learning. In ICLR, Cited by: §1, §2, §3.3.
- Meta-learning for batch mode active learning. In ICLR Workshop, Cited by: §1.
- Few-shot autoregressive density estimation: towards learning to learn distributions. In ICLR, Cited by: §1.
- Meta-learning for semi-supervised few-shot classification. In ICLR, Cited by: §1, §3.3.
- Learning to reweight examples for robust deep learning. In ICML, pp. 4331–4340. Cited by: §2.
- Fast and flexible multi-task classification using conditional neural adaptive processes. In NeurIPS, pp. 7957–7968. Cited by: Appendix E, §2.
- Learning to learn without forgetting by maximizing transfer and minimizing interference. In ICLR, Cited by: §1.
- Been there, done that: meta-learning with episodic recall. In ICML, pp. 4351–4360. Cited by: §1.
- ImageNet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §2.
- Meta-learning with latent embedding optimization. In ICLR, Cited by: §1, §1, §2, §3.3, §3.3.
- Meta-learning with memory-augmented neural networks. In ICML, pp. 1842–1850. Cited by: §1.
- Learning to multi-task by active sampling. In ICLR, Cited by: §1.
- Attentive recurrent comparators. In ICML, pp. 3173–3181. Cited by: §2.
- Adaptive subspaces for few-shot learning. In CVPR, pp. 4135–4144. Cited by: Table 4, Table 5.
- Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §2.
- Prototypical networks for few-shot learning. In NIPS, pp. 4080–4090. Cited by: Table F, §1, §1, §2, §3.3, Table 5.
- Dropout: a simple way to prevent neural networks from overfitting. JMLR 15 (1), pp. 1929–1958. Cited by: Appendix D, §6.
- The importance of sampling inmeta-reinforcement learning. In NeurIPS, pp. 9300–9310. Cited by: §1.
Meta-transfer learning through hard tasks. CoRR abs/1910.03648. Cited by: Table 4, Table 5.
- Learning to compare: relation network for few-shot learning. In CVPR, pp. 1199–1208. Cited by: §1.
- Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: §2.
- YFCC100M: the new data in multimedia research. Communications of ACM 59 (2), pp. 64–73. Cited by: §2.
- Learning to learn. Springer Science & Business Media. Cited by: §1.
- Rethinking few-shot image classification: A good embedding is all you need?. In ECCV, pp. 266–282. Cited by: Table 4, Table 5.
- Meta-dataset: A dataset of datasets for learning to learn from few examples. In ICLR, Cited by: Appendix E, §1, §1, §2, §2, Table 4, Table 5.
- Meta-learning: a survey. CoRR abs/1810.03548. Cited by: §1.
- A meta-learning perspective on cold-start recommendations for items. In NIPS, pp. 6907–6917. Cited by: §2.
- A perspective view and survey of meta-learning. Artificial Intelligence Review 18 (2), pp. 77–95. Cited by: §1, §2.
- Matching networks for one shot learning. In NIPS, pp. 3630–3638. Cited by: Table F, §1, §1, §1, §2, §3.1, §3.3, §3.3.
- Multimodal model-agnostic meta-learning via task-aware modulation. In NeurIPS, pp. 1–12. Cited by: Appendix E, §2.
- The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: 3rd item, Appendix C, §7.
- Learning to reinforcement learnLearning to reinforcement learn. In CogSci, Cited by: §1.
- Multi-attention network for one shot learning. In CVPR, pp. 6212–6220. Cited by: §2.
- Dataset distillation. CoRR abs/1811.10959. Cited by: §2.
- Simpleshot: revisiting nearest-neighbor classification for few-shot learning. CoRR abs/1911.04623. Cited by: §1.
- Low-shot learning from imaginary data. In CVPR, pp. 7278–7286. Cited by: §1, §2.
- Learning to learn: model regression networks for easy small sample learning. In ECCV, pp. 616–634. Cited by: §1.
- Learning to model the tail. In NIPS, pp. 7032–7042. Cited by: §2.
- Learned optimizers that scale and generalize. In ICML, pp. 3751–3760. Cited by: §1.
- Hierarchically structured meta-learning. In ICML, pp. 7045–7054. Cited by: Appendix E, §2.
- Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, pp. 8805–8814. Cited by: Appendix C, §1, §1, §1, §3.3, §3.3, §3.3.
- Few-shot learning with adaptively initialized task optimizer: a practical meta-learning approach. Machine Learning 109 (3), pp. 643–664. Cited by: Appendix E, §1, §2, Table 4, Table 5.
- Transfer learning via learning to transfer. In ICML, pp. 5072–5081. Cited by: §1.
- One-shot imitation from observing humans via domain-adaptive meta-learning. In Robotics: Science and Systems, Cited by: §1, §2.
- DeepEMD: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In CVPR, pp. 12200–12210. Cited by: Table F, §3.3, Table 4, Table 5.
- MetaGAN: an adversarial approach to few-shot learning. In NeurIPS, pp. 2371–2380. Cited by: §1.
- Learning to multitask. In NeurIPS, pp. 5776–5787. Cited by: §1.
Places: a 10 million image database for scene recognition. TPAMI 40 (6), pp. 1452–1464. Cited by: §2.
- Ensemble methods: foundations and algorithms. Chapman and Hall/CRC. Cited by: §1, §6.
Appendix A Permutations of Class Label Assignments
We provide more details and discussions on the permutation issue in the class label assignment. As illustrated in Figure G, few-shot tasks of the same set of semantic classes (e.g., “unicorn”, “bee”, etc.) can be associated with different label assignments (i.e., ) and are paired with the learned initialization of MAML differently.
In section 5 and Table 5, we study how the permutations affect the meta-testing accuracy (after inner loop optimization on ). (For five-way tasks, there are permutations.) We see a high variance among the permutations. We note that, the inner loop optimization updates not only the linear classifiers , but also of the feature extractor. Different permutations therefore can lead to different feature extractors.
Here, we further sample a five-way one-shot meta-testing task, and study the change of accuracy along with the inner loop updates (using a MAML trained with a fixed number of inner loop updates). Specifically, we plot both the support set and query set accuracy for each permutation. As shown in Figure H, there exists a high variance of query set accuracy among permutations after inner loop optimization. This is, however, not the case for the support set. (The reason that only three curves appear for the support set is because there are only five examples, and all the permutations reach support set accuracy within five inner loop steps.) Interestingly, for all the permutations, their initialized accuracy (i.e., before inner loop optimization) is all . After an investigation, we find that the meta-learned (initialization) is dominated by one of them; i.e., all the support or query examples are classified into one class. While this may not always be the case for other few-shot tasks or if we re-train MAML, for the task we sampled, it explains why we obtain for all permutations. We note that, even with an initial accuracy of , the learned initialization can quickly be updated to attain high classification accuracy.
We further compare the change of support and query set accuracy along with the inner loop optimization. We find that, while both accuracy increases, since the support set accuracy converges quickly and has a smaller variance among permutations, it is difficult to use its information to determine which permutation leads to the highest query set accuracy. This makes sense since the support set is few-shot: its accuracy thus cannot robustly reflect the query set accuracy. This explains why the methods studied in Table 1 cannot determine the best permutation for the query set.
We further study if the poor initialization accuracy using the learned also occurs on meta-training tasks, which are sampled from the base classes seen during the meta-training phases. Figure I shows the results — even on meta-training tasks, the initialization gives almost a random accuracy. We provide a simple mathematical explanation as follows. Let us assume we have a five-way one-shot task with five semantic classes and the best permutation follows the ascending order; i.e., for the classes in order. Let us assume this permutation gives a initialized support set accuracy. Since there are in total possible permutations, there will be of them with accuracy (i.e., by switching two-class indices), of them with accuracy (i.e., by shuffling the indices of three classes such that they do not take their original indices), of them with accuracy, and of them with accuracy. Taking an average over these permutations gives a accuracy. In other words, even if one of the permutations performs well, on average the accuracy will be close to random.
Appendix B unicorn-MAML
We provide some further details of unicorn-MAML. The meta-parameters learned by unicorn-MAML are
for feature extraction and a single linear classifier. In the inner loop optimization, is first duplicated into , which then undergo the same inner loop optimization process as vanilla MAML. Let us denote the resulting model . Now to perform the outer loop optimization during meta-training, we need to collect the gradients derived from the query set of a task. Let us denote by the gradient w.r.t. the initialization (please be referred to subsection 3.2). Since are duplicated from, we obtain the gradient w.r.t. the single classifier by .
Appendix C Experimental Results on the CUB dataset
We further evaluate unicorn-MAML on the CUB dataset Wah et al. (2011), following the split proposed by Ye et al. (2020a): there are 100/50/50 meta-training/evaluation/testing classes. All images are resized to . Table F shows the results: unicorn-MAML outperforms the existing methods.
|MatchNet Vinyals et al. (2016)||66.090.92||82.500.58|
|ProtoNet Snell et al. (2017)||71.870.85||85.080.57|
|DeepEMD Zhang et al. (2020)||75.650.83||88.690.50|
|Baseline++ Chen et al. (2019)||67.020.90||83.580.54|
|AFHN Li et al. (2020b)||70.531.01||83.950.63|
|Neg-Cosine Liu et al. (2020)||72.660.85||89.400.43|
|Align Afrasiyabi et al. (2020)||74.221.09||88.650.55|
|MAML (Our reimplementation)||77.670.20||90.350.16|
Appendix D Additional Explanations of Our Studied Methods
We provide some more explanations on the ensemble and forced permutation invariance methods introduced in section 6. For the ensemble method, give a few-shot task, we can permute
to pair them differently with such a task. We can then perform different inner loop optimization to obtain a set of five-class classifiers that we can perform ensemble. In the main text, we average the posterior probabilities of these five-class classifiers to make the final predictions.
Since the permutation affects the meta-training phase as well, we can interpret the meta-training phase as follows. Ever time we sample a few-shot task , we also sample a permutation to re-label the classes. (We note that, this is implicitly done when few-shot tasks are sampled.) We then take to optimize in the inner loop. That is, in meta-training, the objective function in Equation 2 can indeed be re-written as
is a uniform distribution over all possible permutations.Equation D can be equivalently re-written as
where means that the initialization of the linear classifiers are permuted; is the corresponding updated model. This additional sampling process of is reminiscent of dropout Srivastava et al. (2014)
, which randomly masks out a neural network’s neurons or edges to prevent an over-parameterized neural network from over-fitting. During testing, dropout takes expectation over the masks. We also investigate a similar idea, by taking expectation (i.e., average) over the permutations on the linear classifiers, which results in a new initialization during the meta-testing phase: .
Appendix E Additional Comparison to Related Works
As mentioned in section 2, there are several following-up works to improve MAML (not specifically for the permutation issue). Requeima et al. (2019); Vuorio et al. (2019); Yao et al. (2019); Ye et al. (2020b) enable task-specific initialization with additional task embedding sub-networks. However, since they take an average of the feature embeddings (over classes) to represent a task, their methods cannot resolve the permutation issue. MetaOptNet Lee et al. (2019) performs inner loop optimization only on (till convergence), making it a convex problem which is not sensitive to the initialization and hence the permutations. This method, however, has a high computational burden and needs careful hyper-parameter tuning for additionally introduced regularizers. Triantafillou et al. (2020); Ye et al. (2020b) match the classifier with the prototypes (i.e., averaged feature embedding per class), which could be permutation-invariant, but cannot achieve accuracy as high as our unicorn-MAML (except for Ye et al. (2020b) on MiniImageNet one-shot tasks).