Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks

02/17/2020 ∙ by Micah Goldblum, et al. ∙ University of Maryland 0

Meta-learning algorithms produce feature extractors which achieve state-of-the-art performance on few-shot classification. While the literature is rich with meta-learning methods, little is known about why the resulting feature extractors perform so well. We develop a better understanding of the underlying mechanics of meta-learning and the difference between models trained using meta-learning and models which are trained classically. In doing so, we develop several hypotheses for why meta-learned models perform better. In addition to visualizations, we design several regularizers inspired by our hypotheses which improve performance on few-shot classification.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training neural networks from scratch requires large amounts of labeled data, making it impractical in many settings. When data is expensive or time consuming to obtain, training from scratch may be cost prohibitive

Altae-Tran et al. (2017). In other scenarios, models must adapt efficiently to changing environments before enough time has passed to amass a large and diverse data corpus Nagabandi et al. (2018)

. In both of these cases, massive state-of-the-art networks would overfit to the tiny training sets available. To overcome this problem, practitioners pre-train on large auxiliary datasets and then fine-tune the resulting models on the target task. For example, ImageNet pre-training of large ResNets has become an industry standard for transfer learning

Kornblith et al. (2019). Unfortunately, transfer learning from classically trained models often yields sub-par performance in the extremely data-scarce regime, or breaks down entirely when only a few data samples are available in the target domain.

Recently, a number of few-shot benchmarks have been rapidly improved using meta-learning methods Lee et al. (2019); Song et al. (2019). Unlike classical transfer learning, which uses a base model pre-trained on a different task, meta-learning algorithms produce a base network that is specifically designed for quick adaptation to new tasks using few-shot data. Furthermore, meta-learning is still effective when applied to small, lightweight base models that can be fine-tuned with relatively few computations.

The ability of meta-learned networks to rapidly adapt to new domains suggests that the feature representations learned by meta-learning must be fundamentally different than feature representations learned through conventional training. Because of the good performance that meta-learning offers, many researchers have been content to use these features without considering how or why they differ from conventional representations. As a result, little is known about the fundamental differences between meta-learned feature extractors and those which result from classical training. Training routines are often treated like a black box in which high performance is celebrated, but a deeper understanding of the phenomenon remains elusive. To further complicate matters, a myriad of meta-learning strategies exist that may exploit different mechanisms.

In this paper, we delve into the differences between features learned by meta-learning and classical training. We explore and visualize the behaviors of different methods and identify two different mechanisms by which meta-learned representations can improve few-shot learning. In the case of meta-learning strategies that fine-tune only the last (classification) layer of a network, such as MetaOptNet Lee et al. (2019) and R2-D2 Bertinetto et al. (2018), we find that meta-learning tends to cluster object classes more tightly in feature space. As a result, the classification boundaries learned during fine-tuning are less sensitive to the choice of few-shot samples. In the second case, we hypothesize that meta-learning strategies that use end-to-end fine-tuning, such as MAML Finn et al. (2017) and Reptile Nichol and Schulman (2018), search for meta-parameters that lie close in weight space to a wide range of task-specific minima. In this case, a small number of SGD steps can transport the parameters to a good minimum for a specific task.

Inspired by these observations, we propose simple regularizers that improve feature space clustering and parameter-space proximity. These regularizers boost few-shot performance without the dramatic increase in optimization cost that comes from conventional meta-learning.

2 Problem Setting

2.1 The Meta-Learning Framework

The objective of meta-learning algorithms is to produce a network that quickly adapts to new classes using little data. Concretely stated, meta-learning algorithms find parameters that can be fine-tuned in few optimization steps and on few data points in order to achieve good generalization on a task , consisting of a small number of data samples from a distribution and label space that was not seen during training. The task is characterized as n-way, k-shot

if the meta-learning algorithm must adapt to classify data from

after seeing examples from each of the classes in .

Meta-learning schemes typically rely on bi-level optimization problems with an inner loop and an outer loop. An iteration of the outer loop involves first sampling a “task,” which comprises two sets of labeled data: the support data, , and the query data, . Then, in the inner loop, the model being trained is fine-tuned using the support data. Finally, the routine moves back to the outer loop, where the meta-learning algorithm minimizes loss on the query data with respect to the pre-fine-tuned weights. This minimization is executed by differentiating through the inner loop computation and updating the network parameters to make the inner loop fine-tuning as effective as possible. Note that, in contrast to standard transfer learning (which uses classical training and simple first-order gradient information to update parameters), meta-learning algorithms differentiate through the entire fine-tuning loop. A formal description of this process can be found in Algorithm 1, as seen in Goldblum et al. (2019).

2.2 Meta-Learning Algorithms

A variety of meta-learning algorithms exist, mostly differing in how they fine-tune on support data during the inner loop. Early meta-learning approaches, such as MAML, update all network parameters using gradient descent during fine-tuning Finn et al. (2017). Because differentiating through the inner loop is memory and computationally intensive, the fine-tuning process consists of only a few (sometimes just 1) SGD steps.

Reptile, which functions as a zero’th-order approximation to MAML, avoids unrolling the inner loop and differentiating through the SGD steps. Instead, after fine-tuning on support data, Reptile moves the central parameter vector in the direction of the fine-tuned parameters during the outer loop

Nichol and Schulman (2018). In many cases, Reptile achieves better performance than MAML without having to differentiate through the fine-tuning process.

  Require: Base model, , fine-tuning algorithm, , learning rate, , and distribution over tasks, .
  Initialize , the weights of ;
  while not done do
     Sample batch of tasks, , where and .
     for  do
        Fine-tune model on (inner loop). New network parameters are written .
        Compute gradient
     end for
     Update base model parameters (outer loop):
  end while
Algorithm 1 The meta-learning framework

A newer class of algorithms freezes the feature extraction layers during the inner loop; only the linear classifier layer is trained during fine-tuning. Such models include R2-D2 and MetaOptNet

Bertinetto et al. (2018); Lee et al. (2019). The advantage of this approach is that the fine-tuning problem is now a convex optimization problem. Unlike MAML, which simulates the fine-tuning process using only a few gradient updates, last-layer meta-learning methods can use differentiable optimizers to exactly minimize the fine-tuning objective and then differentiate the solution with respect to feature inputs. Moreover, differentiating through these solvers is computationally cheap compared to MAML’s differentiation through SGD steps on the whole network. While MetaOptNet relies on an SVM loss, R2-D2 simplifies the process even further by using a quadratic objective with a closed-form solution. R2-D2 and MetaOptNet achieve stronger performance than MAML and are able to harness larger architectures without overfitting.

Model SVM RR ProtoNet MAML
MetaOptNet-M 62.64 0.31 60.50 0.30 51.99 0.33 55.77 0.32
MetaOptNet-C 56.18 0.31 55.09 0.30 41.89 0.32 46.39 0.28
R2-D2-M 51.80 0.20 55.89 0.31 47.89 0.32 53.72 0.33
R2-D2-C 48.39 0.29 48.29 0.29 28.77 0.24 44.31 0.28
Table 1:

Comparison of meta-learning and classical transfer learning models with various fine-tuning algorithms on 1-shot mini-ImageNet. “MetaOptNet-M” and “MetaOptNet-C” denote models with MetaOptNet backbone trained with MetaOptNet-SVM and classical training. Similarly, “R2-D2-M” and “R2-D2-C” denote models with R2-D2 backbone trained with ridge regression (RR) and classical training. Column headers denote the fine-tuning algorithm used for evaluation, and the radius of confidence intervals is one standard error.

Another last-layer method, ProtoNet, classifies examples by the proximity of their features to those of class centroids - a metric learning approach - in its inner loop Snell et al. (2017)

. Again, the feature extractor’s parameters are frozen in the inner loop, and used to create class centroids which then determine the network’s class boundaries. Because calculating class centroids is mathematically simple, the algorithm is able to efficiently backpropagate through this calculation to adjust the feature extractor.

In this work, “classically trained” models are trained using SGD on all classes simultaneously, and the feature extractors are adapted to new tasks using the same fine-tuning procedures as the meta-learned models for fair comparison. This approach represents the industry-standard method of transfer learning using pre-trained feature extractors.

2.3 Few-Shot Datasets

Several datasets have been developed for few-shot learning. We focus our attention on two datasets: mini-ImageNet and CIFAR-FS. Mini-ImageNet is a pruned and downsized version of the ImageNet classification dataset, consisting of 60,000, RGB color images from classes Vinyals et al. (2016). These 100 classes are split into and classes for training, validation, and testing sets, respectively. The CIFAR-FS dataset samples images from CIFAR-100 Bertinetto et al. (2018). CIFAR-FS is split in the same way as mini-ImageNet with 60,000, RGB color images from classes divided into and classes for training, validation, and testing sets, respectively.

2.4 Related Work

In addition to introducing new methods for few-shot learning, recent work has increased our understanding of why some models perform better than others at few-shot tasks. One such exploration performs baseline testing and discovers that network size has a large effect on the success of meta-learning algorithms. Specifically, as network depth increases, the performance of transfer learning approaches that of some meta-learning algorithms Chen et al. (2019).

Yet other work finds that, with a pre-trained model, features generated by data from classes absent from training are entangled, but the logits of the unseen data tend to be clustered

Frosst et al. (2019). Meta-learning performance can then be improved by using a transductive regularizer during training, which uses information from unlabeled query data to narrow the hypothesis space for meta-training Dhillon et al. (2019).

On the transfer learning side, recent work has found that feature extractors trained on large complex tasks can be more effectively deployed in a transfer learning setting by distilling knowledge about only important features for the transfer task Wang et al. (2020).

While improvements have been made to meta-learning algorithms and transfer learning approaches to few-shot learning, little work has been done on understanding the underlying mechanisms that cause meta-learning routines to perform better than classically trained models in data scarce settings.

3 Are Meta-Learned Features Fundamentally Better for Few-Shot Learning?

It has been said that meta-learned models “learn to learn” Finn et al. (2017), but one might ask if they instead learn to optimize; their features could simply be well-adapted for the specific fine-tuning optimizers on which they are trained. We dispel the latter notion in this section.

In Table 1, we test the performance of meta-learned feature extractors not only with their own fine-tuning algorithm, but with a variety of fine-tuning algorithms. We find that in all cases, the meta-learned feature extractors outperform classically trained models of the same architecture. See Appendix A.1 for results from additional experiments.

This performance advantage across the board suggests that meta-learned features are qualitatively different than conventional features and fundamentally superior for few-shot learning. The remainder of this work will explore the characteristics of meta-learned models.

4 Class Clustering in Feature Space

Methods such as ProtoNet, MetaOptNet, and R2-D2 fix their feature extractor during fine-tuning. For this reason, they must learn to embed features in a way that enables few-shot classification. For example, MetaOptNet and R2-D2 require that classes are linearly separable in feature space, but mere linear separability is not a sufficient condition for good few-shot performance. The feature representations of randomly sampled few-shot data from a given class must not vary so much as to cause classification performance to be sample-dependent. In this section, we examine clustering in feature space, and we find that meta-learned models separate features differently than classically trained networks.

4.1 Measuring Clustering in Feature Space

We begin by measuring how well different training methods cluster feature representations. To measure feature clustering (FC), we consider the intra-class to inter-class variance ratio

where is a feature vector in class , is the mean of feature vectors in class , is the mean across all feature vectors, is the number of classes, and

is the number of data points per class. Low values of this fraction correspond to collections of features such that classes are well-separated and a hyperplane formed by choosing a point from each of two classes does not vary dramatically with the choice of samples.

In Table 2, we highlight the superior class separation of meta-learning methods. We compute two quantities, and , for MetaOptNet and R2-D2 as well as classical transfer learning baselines of the same architectures. These quantities measure the intra-class to inter-class variance ratio and invariance of separating hyperplanes to data sampling. In both cases, lower values correspond to better class separation. On both CIFAR-FS and mini-ImageNet, the meta-learned models attain lower values, indicating that feature space clustering plays a role in the effectiveness of meta-learning.

Training Dataset
R2-D2-M CIFAR-FS 1.29 0.95
R2-D2-C CIFAR-FS 2.92 1.69
MetaOptNet-M CIFAR-FS 0.99 0.75
MetaOptNet-C CIFAR-FS 1.84 1.25
R2-D2-M mini-ImageNet 2.60 1.57
R2-D2-C mini-ImageNet 3.58 1.90
MetaOptNet-M mini-ImageNet 1.29 0.95
MetaOptNet-C mini-ImageNet 3.13 1.75

Table 2: Comparison of class separation metrics for feature extractors trained by classical and meta-learning routines. and are measurements of feature clustering and hyperplane variation, respectively, and we formalize these measurements below. In both cases, lower values correspond to better class separation. We pair together models according to dataset and backbone architecture. “-C” and “-M” respectively denote classical training and meta-learning. See Sections 4.4 and 4.5 for more details.

4.2 Why is Clustering Important?

To demonstrate why linear separability is insufficient for few-shot learning, consider Figure 3. As features in a class become spread out and the classes are brought closer together, the classification boundaries formed by sampling one-shot data often misclassify large regions. In contrast, as features in a class are compacted and classes move far apart from each other, the intra-class to inter-class variance ratio drops, and dependence of the class boundary on the choice of one-shot samples becomes weaker.

Figure 3: a) When class variation is high relative to the variation between classes, decision boundaries formed by one-shot learning are inaccurate, even though classes are linearly separable. b) As classes move farther apart relative to the class variation, one-shot learning yields better decision boundaries.

This intuitive argument is formalized in the following result.

Theorem 1

Consider two random variables,

representing class and representing class Let be the random variable equal to

with probability

and with probability Assume the variance ratio bound

holds for sufficiently small

Draw random one-shot data, and and a test point Consider the linear classifier

This classifier assigns the correct label to with probability at least

Note that the linear classifier in the theorem is simply the maximum-margin linear classifier that separates the two training points. In plain words, Theorem 1 guarantees that one-shot learning performance is effective when the variance ratio is small, with classification becoming asymptotically perfect as the ratio approaches zero. A proof is provided in Appendix B.

4.3 Comparing Feature Representations of Meta-Learning and Classically Trained Models

We begin our investigation into the feature space of meta-learned models by visualizing features. Figure 7 contains a visual comparison of ProtoNet and MAML with a classically trained model of the same architecture on mini-ImageNet. Three classes are randomly chosen from the test set, and samples are taken from each class. The samples are then passed through the feature extractor, and the resulting vectors are plotted. Because feature space is high-dimensional, we perform a linear projection into . We project onto the first two component vectors determined by LDA. Linear discriminant analysis (LDA) projects data onto directions that minimize the intra-class to inter-class variance ratio Mika et al. (1999), and so is ideal for visualizing class separation phenomenon.

(a) MAML
(b) ProtoNet
(c) Classically Trained
Figure 7: Features extracted from mini-ImageNet test data by a) MAML, b) ProtoNet, and c) classically trained models with identical architectures (4 convolutional layers). The meta-learned networks produce better class separation.

In the plots, we see that relative to the size of the point clusters, the classically trained model mashes features together, while the meta-learned models draws the classes farther apart. While visually separate class features may be neither a necessary nor sufficient condition for few-shot performance, we take these plots as inspiration for our regularizer in the following section. Surprisingly, MAML, which updates all network parameters during fine-tuning, exhibits almost as good class separation as ProtoNet.

4.4 Feature Space Clustering Improves the Few-Shot Performance of Transfer Learning

We now further test the feature clustering hypothesis by promoting the same behavior in classically trained models. Consider a network with feature extractor and fully-connected layer . Then, denoting training data in class by , we formulate the feature clustering regularizer by

where is a feature vector corresponding to a data point in class , is the mean of feature vectors in class , and is the mean across all feature vectors. When this regularizer has value zero, classes are represented by distinct point masses in feature space, and thus the class boundary is invariant to the choice of few-shot data.

mini-ImageNet CIFAR-FS
Training Backbone 1-shot 5-shot 1-shot 5-shot
R2-D2 R2-D2 % % % %
Classical R2-D2 % % % %
Classical w/ R2-D2 % % % %
Classical w/ R2-D2 % %
MetaOptNet-SVM MetaOptNet % % % %
Classical MetaOptNet % % % %
Classical w/ MetaOptNet % % % %
Classical w/ MetaOptNet % % %

Table 3: Comparison of methods on 1-shot and 5-shot CIFAR-FS and mini-ImageNet 5-way classification. The top accuracy for each backbone/task is in bold. Confidence intervals have radius equal to one standard error. Few-shot fine-tuning is performed with SVM except for R2-D2 in which we report numbers from the original paper.

We incorporate this regularizer into a standard training routine by sampling two images per class in each mini-batch so that we can compute a within-class variance estimate. Then, the total loss function becomes the sum of cross-entropy and

. We train the R2-D2 and MetaOptNet backbones in this fashion on the mini-ImageNet and CIFAR-FS datasets, and we test these networks on both 1-shot and 5-shot tasks. In all experiments, feature clustering improves the performance of transfer learning and sometimes even achieves higher performance than meta-learning. Furthermore, the regularizer does not appreciably slow down classical training, which, without the expense of differentiating through an inner loop, runs 2-5 times faster than the corresponding meta-learning routine. See Table 3 for numerical results, and Appendix A.2 for experimental details including training times.

4.5 Connecting Feature Clustering with Hyperplane Invariance

For further validation of the connection between feature clustering and invariance of separating hyperplanes to data sampling, we replace the feature clustering regularizer with one that penalizes variations in the maximum-margin hyperplane separating feature vectors in opposite classes. Consider data points in class , data points in class , and feature extractor . The difference vector determines the direction of the maximum margin hyperplane separating the two points in feature space. To penalize the variation in hyperplanes, we introduce the hyperplane variation regularizer,

This function measures the distance between distance vectors and relative to their size. In practice, during a batch of training, we sample many pairs of classes and two samples from each class. Then, we compute on all class pairs and add these terms to the cross-entropy loss. We find that this regularizer performs almost as well as , and conclusively outperforms non-regularized classical training. We include these results in Table 3. See Appendix A.2 for more details on these experiments, including training times (which, as indicated in Section 4.4, are significantly lower than those needed for meta-learning).

4.6 MAML Does Not Have the Same Feature Separation Properties

We saw in feature space plots that the first two LDA components generated by MAML features visually appear to separate classes. We now quantify MAML’s class separation compared to transfer learning by computing our regularizer values for a pre-trained MAML model as well as a classically trained model of the same architecture. We find that, in fact, MAML exhibits even worse feature separation than a classically trained model of the same architecture. See Table 4 for numerical results.

MAML-1 3.9406 1.9434
MAML-5 3.7044 1.8901
MAML-C 3.3487 1.8113

Table 4: Comparison of regularizer values 1-shot and 5-shot MAML models (MAML-1 and MAML-5) as well as MAML-C, a classically trained model of the same architecture on mini-ImageNet training data. The lowest value of each regularizer is in bold.

5 Finding Clusters of Local Minima for Task Losses in Parameter Space

Since Reptile does not fix the feature extractor during fine-tuning, it must find parameters that adapt easily to new tasks. One way Reptile might achieve this is by finding parameters that can reach a task-specific minimum by traversing a smooth, nearly linear region of the loss landscape. In this case, even a single SGD update would move parameters in a useful direction. Unlike MAML, however, Reptile does not backpropagate through optimization steps, and thus lacks information about the loss surface geometry when performing parameter updates. Instead, we hypothesize that Reptile finds parameters that lie very close to good minima for many tasks and is therefore able to perform well.

This hypothesis is further motivated by the close relationship between Reptile and consensus optimization Boyd et al. (2011). In a consensus method, a number of models are independently optimized with their own task-specific parameters, and the tasks communicate via a penalty that encourages all the individual solutions to converge around a common value. Reptile can be interpreted as approximately minimizing the consensus formulation

where is the loss for task , are task-specific parameters, and the quadratic penalty on the right encourages the parameters to cluster around a “consensus value” . A stochastic optimizer for this loss would proceed by alternately selecting a random task/term index , minimizing the loss with respect to and then taking a gradient step to approximate minimize the loss for

Reptile diverges from a traditional consensus optimizer only in that it does not explicitly consider the quadratic penalty term when minimizing for However, it implicitly considers this penalty by initializing the optimizer for the task-specific loss using the current value of the consensus variables which encourages the task-specific parameters to stay near the consensus parameters. In the next section, we replace the standard Reptile algorithm with one that explicitly minimizes a consensus formulation.

5.1 Consensus Optimization Improves Reptile

To validate the weight-space clustering hypothesis, we modify Reptile to explicitly enforce parameter clustering around a consensus value. We find that directly optimizing the consensus formulation leads to improved performance. To this end, during each inner loop update step in Reptile, we penalize the squared distance from the parameters for the current task to the average of the parameters across all tasks in the current batch. Namely, we let:

where are the network parameters on task and is the filter normalized distance (see Note 1). Note that as parameters shrink towards the origin, the distances between minima shrink as well. Thus, we employ filter normalization to ensure that our calculation is invariant to scaling Li et al. (2018). See below for a description of filter normalization. This regularizer guides optimization to a location where many task-specific minima lie in close proximity. A detailed description is given in Algorithm 2, which is equivalent to the original Reptile when . We call this method “Weight-Clustering.”

Note 1

Consider that a perturbation to the parameters of a network is more impactful when the network has small parameters. While previous work has used layer normalization or even more coarse normalization schemes, the authors of Li et al. (2018)

note that since the output of networks with batch normalization is invariant to filter scaling as long as the batch statistics are updated accordingly, we can normalize every filter of such a network independently. The latter work suggests that this scheme, “filter normalization”, correlates better with properties of the optimization landscape. Thus, we measure distance in our regularizer using filter normalization, and we find that this technique prevents parameters from shrinking towards the origin.

  Require: Initial parameter vector, , outer learning rate, , inner learning rate, , regularization coefficient, , and distribution over tasks, .
  for meta-step  do
     Sample batch of tasks, from
     Initialize parameter vectors for each task
     for  do
        for  do
        end for
     end for
     Compute difference vectors
  end for
Algorithm 2 Reptile with Weight-Clustering Regularization

We compare the performance of our regularized Reptile algorithm to that of the original Reptile method, as well as first-order MAML (FOMAML) and a classically trained model of the same architecture. We test these methods on a sample of 100,000 5-way 1-shot and 5-shot mini-ImageNet tasks and find that in both cases, Reptile with Weight-Clustering achieves higher performance than the original algorithm and significantly better performance than FOMAML and the classically trained models. These results are summarized in Table 5.

Framework 1-shot 5-shot
Reptile % %

Table 5: Comparison of methods on 1-shot and 5-shot mini-ImageNet 5-way classification. The top accuracy for each setup is in bold. Confidence intervals have width equal to one standard error. W-Clustering denotes the Weight-Clustering regularizer.

We note that the best-performing result was attained when the product of the constant term collected from the gradient of the regularizer and the regularization coefficient was , but that a range of values up to ten times larger and smaller also produced improvements over the original algorithm. Experimental details, as well as results for other values of this coefficient, can be found in Appendix A.3.

In addition to these performance gains, we found that the parameters of networks trained using our regularized version of Reptile do not travel as far during fine-tuning as those trained using vanilla Reptile. Figure 10 depicts histograms of filter normalized distance traveled by both networks fine-tuning on samples of 1,000 1-shot and 5-shot mini-ImageNet tasks. From these, we conclude that our regularizer does indeed move model parameters toward a consensus which is near good minima for many tasks.

Figure 10: Histogram of filter normalized distance traveled during fine-tuning on a) 1-shot and b) 5-shot mini-ImageNet tasks by models trained using Reptile (red) and weight-clustered Reptile (blue).

6 Discussion

In this work, we shed light on two key differences between meta-learned networks and their classically trained counterparts. We find evidence that meta-learning algorithms minimize the variation between feature vectors within a class relative to the variation between classes. Moreover, we design two regularizers for transfer learning inspired by this principal, and our regularizers consistently improve few-shot performance. The success of our method helps to confirm the hypothesis that minimizing within-class feature variation is critical for few-shot performance.

We further notice that Reptile resembles a consensus optimization algorithm, and we enhance the method by designing yet another regularizer, which we apply to Reptile in order to find clusters of local minima in the loss landscape of tasks. We find in our experiments that this regularizer improves both one-shot and five-shot performance of Reptile on mini-ImageNet.


  • H. Altae-Tran, B. Ramsundar, A. S. Pappu, and V. Pande (2017) Low data drug discovery with one-shot learning. ACS central science 3 (4), pp. 283–293. Cited by: §1.
  • L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi (2018) Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136. Cited by: §A.2, §1, §2.2, §2.3.
  • S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §5.
  • W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019) A closer look at few-shot classification. arXiv preprint arXiv:1904.04232. Cited by: §2.4.
  • G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto (2019) A baseline for few-shot image classification. arXiv preprint arXiv:1909.02729. Cited by: §2.4.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1, §2.2, §3.
  • N. Frosst, N. Papernot, and G. Hinton (2019) Analyzing and improving representations with the soft nearest neighbor loss. arXiv preprint arXiv:1902.01889. Cited by: §2.4.
  • M. Goldblum, L. Fowl, and T. Goldstein (2019) Robust few-shot learning with adversarially queried meta-learners. arXiv preprint arXiv:1910.00982. Cited by: §2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §A.2.
  • S. Kornblith, J. Shlens, and Q. V. Le (2019) Do better imagenet models transfer better?. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2661–2671. Cited by: §1.
  • K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019) Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10657–10665. Cited by: §A.2, §A.2, §1, §1, §2.2.
  • H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein (2018) Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pp. 6389–6399. Cited by: §5.1, Note 1.
  • S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. Mullers (1999) Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), pp. 41–48. Cited by: §4.3.
  • A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn (2018)

    Learning to adapt in dynamic, real-world environments through meta-reinforcement learning

    arXiv preprint arXiv:1803.11347. Cited by: §1.
  • A. Nichol and J. Schulman (2018) Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2, pp. 2. Cited by: §A.3, Table 9, §1, §2.2.
  • B. Oreshkin, P. R. López, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731. Cited by: §A.2.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §2.2.
  • L. Song, J. Liu, and Y. Qin (2019) Fast and generalized adaptation for few-shot learning. arXiv preprint arXiv:1911.10807. Cited by: §1.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §2.3.
  • K. Wang, X. Gao, Y. Zhao, X. Li, D. Dou, and C. Xu (2020) Pay attention to features, transfer learn faster CNNs. In International Conference on Learning Representations, External Links: Link Cited by: §2.4.

Appendix A Experimental Details

a.1 Mixing Meta-Learned Models and Fine-Tuning Procedures: Additional Experiments

Model SVM RR ProtoNet MAML
MetaOptNet-M 78.63 0.25 % 76.96 0.23 % 76.17 0.23 % 70.14 0.27 %
MetaOptNet-C 76.72 0.24 % 74.48 0.24 % 73.37 0.24 % 71.32 0.26 %
R2-D2-M 68.40 0.20 % 72.09 0.25 % 70.74 0.25 % 71.43 0.27 %
R2-D2-C 68.24 0.26 % 67.04 0.26 % 60.93 0.29 % 65.30 0.27 %
Table 6: Comparison of meta-learning and transfer learning models with various fine-tuning algorithms on 5-shot mini-ImageNet. “MetaOptNet-M” and “MetaOptNet-C” denote models with MetaOptNet backbone trained with MetaOptNet-SVM and classical training. Similarly, “R2-D2-M” and “R2-D2-C” denote models with R2-D2 backbone trained with ridge regression (RR) and classical training. Column headers denote the fine-tuning algorithm used for evaluation, and the radius of confidence intervals is one standard error.

a.2 Transfer Learning and Feature Space Clustering

We evaluate the proposed regularizers and classically trained baseline on two backbone architectures: a 4-layer convolutional neural network with number of filters per layer 96-192-384-512 originally used for R2-D2

Bertinetto et al. (2018) and ResNet-12 He et al. (2016); Oreshkin et al. (2018); Lee et al. (2019). We run experiments on the Mini-ImageNet and CIFAR-FS datasets.

When training the backbone feature extractors, we use SGD with a batch-size of 128 for CIFAR-FS and 256 for mini-ImageNet, Nesterov momentum set to 0.9 and weight decay of

. For training on CIFAR-FS, we set the initial learning rate to 0.1 for the first 100 epochs and reduce by a factor of 10 every 50 epochs. To avoid gradient explosion problems, we use 15 warm-up epochs for mini-ImageNet with learning rate 0.01. We train all classically trained networks for a total of 300 epochs. We employ data parallelism across 2 Nvidia RTX 2080 Ti GPUs when training on mini-ImageNet, and we only use one GPU for each CIFAR-FS experiment. For few-shot testing, we train two classification heads, a linear NN layer and SVM  

Lee et al. (2019) on top of the pre-trained feature extractors. The evaluation results of these models are given in Table 8. Table 7 shows the running time per training epoch as well as total training time on both datasets and backbone architectures to achieve the results in Table 3. The training speed of the proposed regularizers is nearly as fast as classical transfer learning and up to almost 12 times faster than meta-learning methods. For meta-learning methods, we follow the training hyperparemeters from Lee et al. (2019).

mini-ImageNet CIFAR-FS
Training Backbone runtime runtime
R2-D2 R2-D2 16m/16.8h 44s/45m
Classical R2-D2 20s/1.7h 4s/22m
Classical w/ R2-D2 20s/1.7h 4s/24m
Classical w/ R2-D2 20s/1.7h 4s/23m
MetaOptNet-SVM MetaOptNet 1.5h/88.0h 4m/4.5h
Classical MetaOptNet 1.4m/7.0h 14s/1.2h
Classical w/ MetaOptNet 1.5m/7.4h 15s/1.3h
Classical w/ MetaOptNet 1.3m/7.2h 16s/1.4h

Table 7: GPU-time (training time per epoch/total times) comparison of methods on CIFAR-FS and mini-ImageNet 5-way classification on a single GPU.
mini-ImageNet CIFAR-FS
Backbone Regularizer Coeff Head 1-shot 5-shot 1-shot 5-shot
R2-D2 0.02 NN % % % %
0.05 NN % % % %
0.1 NN % % % %
0.02 NN % % % %
0.05 NN % % % %
0.1 NN % % % %
0.02 SVM % % % %
0.05 SVM % % % %
0.1 SVM % % % %
0.02 SVM % % % %
0.05 SVM % % % %
0.1 SVM % % %
ResNet-12 0.02 NN % % % %
0.05 NN % % % %
0.1 NN % % % %
0.02 NN % % % %
0.05 NN % % % %
0.1 NN % % %
0.02 SVM % % % %
0.05 SVM % % % %
0.1 SVM % % % %
0.02 SVM % % % %
0.05 SVM % % % %
0.1 SVM % % % %

Table 8: Hyper-parameter tuning for Feature Clustering and Hyperplane invariance regularizer of different backbone structures with various classification heads on 1-shot and 5-shot CIFAR-FS and mini-ImageNet 5-way classification. Regularizer coefficients include the factor.

a.3 Reptile Weight Clustering

We train models via our weight-clustering Reptile algorithm with a range of coefficients for the regularization term. The model architecture and all other hyperparameters were chosen to match those specified for Reptile training and evaluation on 1-shot and 5-shot mini-ImageNet in

Nichol and Schulman (2018). The evaluation results of these models are given in Table 9. All models were trained on Nvidia RTX 2080 Ti GPUs.

Coefficient 1-shot 5-shot
(Reptile) % %

Table 9: Comparison of test accuracy for models trained with the weight-clustering Reptile algorithm with various regularization coefficients evaluated on 1-shot and 5-shot mini-ImageNet tasks. The results for vanilla Reptile are those given in Nichol and Schulman (2018).

Appendix B Proof of Theorem 1

Consider the three conditions

where and is the expected value of Under these conditions,


Combining the above yields

We can now write

and so is classified correctly if our three conditions hold. From the Chebyshev bound, these conditions hold with probability at least


where we have twice applied the identity which holds for (this also requires , but this can be guaranteed by choosing a sufficiently small as in the statement of the theorem).

Finally, we have the variation ratio bound

And so

Plugging this into (1) we get the final probability bound