1 Introduction
Training neural networks from scratch requires large amounts of labeled data, making it impractical in many settings. When data is expensive or time consuming to obtain, training from scratch may be cost prohibitive
AltaeTran et al. (2017). In other scenarios, models must adapt efficiently to changing environments before enough time has passed to amass a large and diverse data corpus Nagabandi et al. (2018). In both of these cases, massive stateoftheart networks would overfit to the tiny training sets available. To overcome this problem, practitioners pretrain on large auxiliary datasets and then finetune the resulting models on the target task. For example, ImageNet pretraining of large ResNets has become an industry standard for transfer learning
Kornblith et al. (2019). Unfortunately, transfer learning from classically trained models often yields subpar performance in the extremely datascarce regime, or breaks down entirely when only a few data samples are available in the target domain.Recently, a number of fewshot benchmarks have been rapidly improved using metalearning methods Lee et al. (2019); Song et al. (2019). Unlike classical transfer learning, which uses a base model pretrained on a different task, metalearning algorithms produce a base network that is specifically designed for quick adaptation to new tasks using fewshot data. Furthermore, metalearning is still effective when applied to small, lightweight base models that can be finetuned with relatively few computations.
The ability of metalearned networks to rapidly adapt to new domains suggests that the feature representations learned by metalearning must be fundamentally different than feature representations learned through conventional training. Because of the good performance that metalearning offers, many researchers have been content to use these features without considering how or why they differ from conventional representations. As a result, little is known about the fundamental differences between metalearned feature extractors and those which result from classical training. Training routines are often treated like a black box in which high performance is celebrated, but a deeper understanding of the phenomenon remains elusive. To further complicate matters, a myriad of metalearning strategies exist that may exploit different mechanisms.
In this paper, we delve into the differences between features learned by metalearning and classical training. We explore and visualize the behaviors of different methods and identify two different mechanisms by which metalearned representations can improve fewshot learning. In the case of metalearning strategies that finetune only the last (classification) layer of a network, such as MetaOptNet Lee et al. (2019) and R2D2 Bertinetto et al. (2018), we find that metalearning tends to cluster object classes more tightly in feature space. As a result, the classification boundaries learned during finetuning are less sensitive to the choice of fewshot samples. In the second case, we hypothesize that metalearning strategies that use endtoend finetuning, such as MAML Finn et al. (2017) and Reptile Nichol and Schulman (2018), search for metaparameters that lie close in weight space to a wide range of taskspecific minima. In this case, a small number of SGD steps can transport the parameters to a good minimum for a specific task.
Inspired by these observations, we propose simple regularizers that improve feature space clustering and parameterspace proximity. These regularizers boost fewshot performance without the dramatic increase in optimization cost that comes from conventional metalearning.
2 Problem Setting
2.1 The MetaLearning Framework
The objective of metalearning algorithms is to produce a network that quickly adapts to new classes using little data. Concretely stated, metalearning algorithms find parameters that can be finetuned in few optimization steps and on few data points in order to achieve good generalization on a task , consisting of a small number of data samples from a distribution and label space that was not seen during training. The task is characterized as nway, kshot
if the metalearning algorithm must adapt to classify data from
after seeing examples from each of the classes in .Metalearning schemes typically rely on bilevel optimization problems with an inner loop and an outer loop. An iteration of the outer loop involves first sampling a “task,” which comprises two sets of labeled data: the support data, , and the query data, . Then, in the inner loop, the model being trained is finetuned using the support data. Finally, the routine moves back to the outer loop, where the metalearning algorithm minimizes loss on the query data with respect to the prefinetuned weights. This minimization is executed by differentiating through the inner loop computation and updating the network parameters to make the inner loop finetuning as effective as possible. Note that, in contrast to standard transfer learning (which uses classical training and simple firstorder gradient information to update parameters), metalearning algorithms differentiate through the entire finetuning loop. A formal description of this process can be found in Algorithm 1, as seen in Goldblum et al. (2019).
2.2 MetaLearning Algorithms
A variety of metalearning algorithms exist, mostly differing in how they finetune on support data during the inner loop. Early metalearning approaches, such as MAML, update all network parameters using gradient descent during finetuning Finn et al. (2017). Because differentiating through the inner loop is memory and computationally intensive, the finetuning process consists of only a few (sometimes just 1) SGD steps.
Reptile, which functions as a zero’thorder approximation to MAML, avoids unrolling the inner loop and differentiating through the SGD steps. Instead, after finetuning on support data, Reptile moves the central parameter vector in the direction of the finetuned parameters during the outer loop
Nichol and Schulman (2018). In many cases, Reptile achieves better performance than MAML without having to differentiate through the finetuning process.A newer class of algorithms freezes the feature extraction layers during the inner loop; only the linear classifier layer is trained during finetuning. Such models include R2D2 and MetaOptNet
Bertinetto et al. (2018); Lee et al. (2019). The advantage of this approach is that the finetuning problem is now a convex optimization problem. Unlike MAML, which simulates the finetuning process using only a few gradient updates, lastlayer metalearning methods can use differentiable optimizers to exactly minimize the finetuning objective and then differentiate the solution with respect to feature inputs. Moreover, differentiating through these solvers is computationally cheap compared to MAML’s differentiation through SGD steps on the whole network. While MetaOptNet relies on an SVM loss, R2D2 simplifies the process even further by using a quadratic objective with a closedform solution. R2D2 and MetaOptNet achieve stronger performance than MAML and are able to harness larger architectures without overfitting.Model  SVM  RR  ProtoNet  MAML 

MetaOptNetM  62.64 0.31  60.50 0.30  51.99 0.33  55.77 0.32 
MetaOptNetC  56.18 0.31  55.09 0.30  41.89 0.32  46.39 0.28 
R2D2M  51.80 0.20  55.89 0.31  47.89 0.32  53.72 0.33 
R2D2C  48.39 0.29  48.29 0.29  28.77 0.24  44.31 0.28 
Comparison of metalearning and classical transfer learning models with various finetuning algorithms on 1shot miniImageNet. “MetaOptNetM” and “MetaOptNetC” denote models with MetaOptNet backbone trained with MetaOptNetSVM and classical training. Similarly, “R2D2M” and “R2D2C” denote models with R2D2 backbone trained with ridge regression (RR) and classical training. Column headers denote the finetuning algorithm used for evaluation, and the radius of confidence intervals is one standard error.
Another lastlayer method, ProtoNet, classifies examples by the proximity of their features to those of class centroids  a metric learning approach  in its inner loop Snell et al. (2017)
. Again, the feature extractor’s parameters are frozen in the inner loop, and used to create class centroids which then determine the network’s class boundaries. Because calculating class centroids is mathematically simple, the algorithm is able to efficiently backpropagate through this calculation to adjust the feature extractor.
In this work, “classically trained” models are trained using SGD on all classes simultaneously, and the feature extractors are adapted to new tasks using the same finetuning procedures as the metalearned models for fair comparison. This approach represents the industrystandard method of transfer learning using pretrained feature extractors.
2.3 FewShot Datasets
Several datasets have been developed for fewshot learning. We focus our attention on two datasets: miniImageNet and CIFARFS. MiniImageNet is a pruned and downsized version of the ImageNet classification dataset, consisting of 60,000, RGB color images from classes Vinyals et al. (2016). These 100 classes are split into and classes for training, validation, and testing sets, respectively. The CIFARFS dataset samples images from CIFAR100 Bertinetto et al. (2018). CIFARFS is split in the same way as miniImageNet with 60,000, RGB color images from classes divided into and classes for training, validation, and testing sets, respectively.
2.4 Related Work
In addition to introducing new methods for fewshot learning, recent work has increased our understanding of why some models perform better than others at fewshot tasks. One such exploration performs baseline testing and discovers that network size has a large effect on the success of metalearning algorithms. Specifically, as network depth increases, the performance of transfer learning approaches that of some metalearning algorithms Chen et al. (2019).
Yet other work finds that, with a pretrained model, features generated by data from classes absent from training are entangled, but the logits of the unseen data tend to be clustered
Frosst et al. (2019). Metalearning performance can then be improved by using a transductive regularizer during training, which uses information from unlabeled query data to narrow the hypothesis space for metatraining Dhillon et al. (2019).On the transfer learning side, recent work has found that feature extractors trained on large complex tasks can be more effectively deployed in a transfer learning setting by distilling knowledge about only important features for the transfer task Wang et al. (2020).
While improvements have been made to metalearning algorithms and transfer learning approaches to fewshot learning, little work has been done on understanding the underlying mechanisms that cause metalearning routines to perform better than classically trained models in data scarce settings.
3 Are MetaLearned Features Fundamentally Better for FewShot Learning?
It has been said that metalearned models “learn to learn” Finn et al. (2017), but one might ask if they instead learn to optimize; their features could simply be welladapted for the specific finetuning optimizers on which they are trained. We dispel the latter notion in this section.
In Table 1, we test the performance of metalearned feature extractors not only with their own finetuning algorithm, but with a variety of finetuning algorithms. We find that in all cases, the metalearned feature extractors outperform classically trained models of the same architecture. See Appendix A.1 for results from additional experiments.
This performance advantage across the board suggests that metalearned features are qualitatively different than conventional features and fundamentally superior for fewshot learning. The remainder of this work will explore the characteristics of metalearned models.
4 Class Clustering in Feature Space
Methods such as ProtoNet, MetaOptNet, and R2D2 fix their feature extractor during finetuning. For this reason, they must learn to embed features in a way that enables fewshot classification. For example, MetaOptNet and R2D2 require that classes are linearly separable in feature space, but mere linear separability is not a sufficient condition for good fewshot performance. The feature representations of randomly sampled fewshot data from a given class must not vary so much as to cause classification performance to be sampledependent. In this section, we examine clustering in feature space, and we find that metalearned models separate features differently than classically trained networks.
4.1 Measuring Clustering in Feature Space
We begin by measuring how well different training methods cluster feature representations. To measure feature clustering (FC), we consider the intraclass to interclass variance ratio
where is a feature vector in class , is the mean of feature vectors in class , is the mean across all feature vectors, is the number of classes, and
is the number of data points per class. Low values of this fraction correspond to collections of features such that classes are wellseparated and a hyperplane formed by choosing a point from each of two classes does not vary dramatically with the choice of samples.
In Table 2, we highlight the superior class separation of metalearning methods. We compute two quantities, and , for MetaOptNet and R2D2 as well as classical transfer learning baselines of the same architectures. These quantities measure the intraclass to interclass variance ratio and invariance of separating hyperplanes to data sampling. In both cases, lower values correspond to better class separation. On both CIFARFS and miniImageNet, the metalearned models attain lower values, indicating that feature space clustering plays a role in the effectiveness of metalearning.
Training  Dataset  
R2D2M  CIFARFS  1.29  0.95 
R2D2C  CIFARFS  2.92  1.69 
MetaOptNetM  CIFARFS  0.99  0.75 
MetaOptNetC  CIFARFS  1.84  1.25 
R2D2M  miniImageNet  2.60  1.57 
R2D2C  miniImageNet  3.58  1.90 
MetaOptNetM  miniImageNet  1.29  0.95 
MetaOptNetC  miniImageNet  3.13  1.75 

4.2 Why is Clustering Important?
To demonstrate why linear separability is insufficient for fewshot learning, consider Figure 3. As features in a class become spread out and the classes are brought closer together, the classification boundaries formed by sampling oneshot data often misclassify large regions. In contrast, as features in a class are compacted and classes move far apart from each other, the intraclass to interclass variance ratio drops, and dependence of the class boundary on the choice of oneshot samples becomes weaker.
This intuitive argument is formalized in the following result.
Theorem 1
Consider two random variables,
representing class and representing class Let be the random variable equal towith probability
and with probability Assume the variance ratio boundholds for sufficiently small
Draw random oneshot data, and and a test point Consider the linear classifier
This classifier assigns the correct label to with probability at least
Note that the linear classifier in the theorem is simply the maximummargin linear classifier that separates the two training points. In plain words, Theorem 1 guarantees that oneshot learning performance is effective when the variance ratio is small, with classification becoming asymptotically perfect as the ratio approaches zero. A proof is provided in Appendix B.
4.3 Comparing Feature Representations of MetaLearning and Classically Trained Models
We begin our investigation into the feature space of metalearned models by visualizing features. Figure 7 contains a visual comparison of ProtoNet and MAML with a classically trained model of the same architecture on miniImageNet. Three classes are randomly chosen from the test set, and samples are taken from each class. The samples are then passed through the feature extractor, and the resulting vectors are plotted. Because feature space is highdimensional, we perform a linear projection into . We project onto the first two component vectors determined by LDA. Linear discriminant analysis (LDA) projects data onto directions that minimize the intraclass to interclass variance ratio Mika et al. (1999), and so is ideal for visualizing class separation phenomenon.
In the plots, we see that relative to the size of the point clusters, the classically trained model mashes features together, while the metalearned models draws the classes farther apart. While visually separate class features may be neither a necessary nor sufficient condition for fewshot performance, we take these plots as inspiration for our regularizer in the following section. Surprisingly, MAML, which updates all network parameters during finetuning, exhibits almost as good class separation as ProtoNet.
4.4 Feature Space Clustering Improves the FewShot Performance of Transfer Learning
We now further test the feature clustering hypothesis by promoting the same behavior in classically trained models. Consider a network with feature extractor and fullyconnected layer . Then, denoting training data in class by , we formulate the feature clustering regularizer by
where is a feature vector corresponding to a data point in class , is the mean of feature vectors in class , and is the mean across all feature vectors. When this regularizer has value zero, classes are represented by distinct point masses in feature space, and thus the class boundary is invariant to the choice of fewshot data.
miniImageNet  CIFARFS  
Training  Backbone  1shot  5shot  1shot  5shot 
R2D2  R2D2  %  %  %  % 
Classical  R2D2  %  %  %  % 
Classical w/  R2D2  %  %  %  % 
Classical w/  R2D2  %  %  
MetaOptNetSVM  MetaOptNet  %  %  %  % 
Classical  MetaOptNet  %  %  %  % 
Classical w/  MetaOptNet  %  %  %  % 
Classical w/  MetaOptNet  %  %  %  

We incorporate this regularizer into a standard training routine by sampling two images per class in each minibatch so that we can compute a withinclass variance estimate. Then, the total loss function becomes the sum of crossentropy and
. We train the R2D2 and MetaOptNet backbones in this fashion on the miniImageNet and CIFARFS datasets, and we test these networks on both 1shot and 5shot tasks. In all experiments, feature clustering improves the performance of transfer learning and sometimes even achieves higher performance than metalearning. Furthermore, the regularizer does not appreciably slow down classical training, which, without the expense of differentiating through an inner loop, runs 25 times faster than the corresponding metalearning routine. See Table 3 for numerical results, and Appendix A.2 for experimental details including training times.4.5 Connecting Feature Clustering with Hyperplane Invariance
For further validation of the connection between feature clustering and invariance of separating hyperplanes to data sampling, we replace the feature clustering regularizer with one that penalizes variations in the maximummargin hyperplane separating feature vectors in opposite classes. Consider data points in class , data points in class , and feature extractor . The difference vector determines the direction of the maximum margin hyperplane separating the two points in feature space. To penalize the variation in hyperplanes, we introduce the hyperplane variation regularizer,
This function measures the distance between distance vectors and relative to their size. In practice, during a batch of training, we sample many pairs of classes and two samples from each class. Then, we compute on all class pairs and add these terms to the crossentropy loss. We find that this regularizer performs almost as well as , and conclusively outperforms nonregularized classical training. We include these results in Table 3. See Appendix A.2 for more details on these experiments, including training times (which, as indicated in Section 4.4, are significantly lower than those needed for metalearning).
4.6 MAML Does Not Have the Same Feature Separation Properties
We saw in feature space plots that the first two LDA components generated by MAML features visually appear to separate classes. We now quantify MAML’s class separation compared to transfer learning by computing our regularizer values for a pretrained MAML model as well as a classically trained model of the same architecture. We find that, in fact, MAML exhibits even worse feature separation than a classically trained model of the same architecture. See Table 4 for numerical results.
Model  

MAML1  3.9406  1.9434 
MAML5  3.7044  1.8901 
MAMLC  3.3487  1.8113 

5 Finding Clusters of Local Minima for Task Losses in Parameter Space
Since Reptile does not fix the feature extractor during finetuning, it must find parameters that adapt easily to new tasks. One way Reptile might achieve this is by finding parameters that can reach a taskspecific minimum by traversing a smooth, nearly linear region of the loss landscape. In this case, even a single SGD update would move parameters in a useful direction. Unlike MAML, however, Reptile does not backpropagate through optimization steps, and thus lacks information about the loss surface geometry when performing parameter updates. Instead, we hypothesize that Reptile finds parameters that lie very close to good minima for many tasks and is therefore able to perform well.
This hypothesis is further motivated by the close relationship between Reptile and consensus optimization Boyd et al. (2011). In a consensus method, a number of models are independently optimized with their own taskspecific parameters, and the tasks communicate via a penalty that encourages all the individual solutions to converge around a common value. Reptile can be interpreted as approximately minimizing the consensus formulation
where is the loss for task , are taskspecific parameters, and the quadratic penalty on the right encourages the parameters to cluster around a “consensus value” . A stochastic optimizer for this loss would proceed by alternately selecting a random task/term index , minimizing the loss with respect to and then taking a gradient step to approximate minimize the loss for
Reptile diverges from a traditional consensus optimizer only in that it does not explicitly consider the quadratic penalty term when minimizing for However, it implicitly considers this penalty by initializing the optimizer for the taskspecific loss using the current value of the consensus variables which encourages the taskspecific parameters to stay near the consensus parameters. In the next section, we replace the standard Reptile algorithm with one that explicitly minimizes a consensus formulation.
5.1 Consensus Optimization Improves Reptile
To validate the weightspace clustering hypothesis, we modify Reptile to explicitly enforce parameter clustering around a consensus value. We find that directly optimizing the consensus formulation leads to improved performance. To this end, during each inner loop update step in Reptile, we penalize the squared distance from the parameters for the current task to the average of the parameters across all tasks in the current batch. Namely, we let:
where are the network parameters on task and is the filter normalized distance (see Note 1). Note that as parameters shrink towards the origin, the distances between minima shrink as well. Thus, we employ filter normalization to ensure that our calculation is invariant to scaling Li et al. (2018). See below for a description of filter normalization. This regularizer guides optimization to a location where many taskspecific minima lie in close proximity. A detailed description is given in Algorithm 2, which is equivalent to the original Reptile when . We call this method “WeightClustering.”
Note 1
Consider that a perturbation to the parameters of a network is more impactful when the network has small parameters. While previous work has used layer normalization or even more coarse normalization schemes, the authors of Li et al. (2018)
note that since the output of networks with batch normalization is invariant to filter scaling as long as the batch statistics are updated accordingly, we can normalize every filter of such a network independently. The latter work suggests that this scheme, “filter normalization”, correlates better with properties of the optimization landscape. Thus, we measure distance in our regularizer using filter normalization, and we find that this technique prevents parameters from shrinking towards the origin.
We compare the performance of our regularized Reptile algorithm to that of the original Reptile method, as well as firstorder MAML (FOMAML) and a classically trained model of the same architecture. We test these methods on a sample of 100,000 5way 1shot and 5shot miniImageNet tasks and find that in both cases, Reptile with WeightClustering achieves higher performance than the original algorithm and significantly better performance than FOMAML and the classically trained models. These results are summarized in Table 5.
Framework  1shot  5shot 

Classical  
FOMAML  %  % 
Reptile  %  % 
WClustering  

We note that the bestperforming result was attained when the product of the constant term collected from the gradient of the regularizer and the regularization coefficient was , but that a range of values up to ten times larger and smaller also produced improvements over the original algorithm. Experimental details, as well as results for other values of this coefficient, can be found in Appendix A.3.
In addition to these performance gains, we found that the parameters of networks trained using our regularized version of Reptile do not travel as far during finetuning as those trained using vanilla Reptile. Figure 10 depicts histograms of filter normalized distance traveled by both networks finetuning on samples of 1,000 1shot and 5shot miniImageNet tasks. From these, we conclude that our regularizer does indeed move model parameters toward a consensus which is near good minima for many tasks.
6 Discussion
In this work, we shed light on two key differences between metalearned networks and their classically trained counterparts. We find evidence that metalearning algorithms minimize the variation between feature vectors within a class relative to the variation between classes. Moreover, we design two regularizers for transfer learning inspired by this principal, and our regularizers consistently improve fewshot performance. The success of our method helps to confirm the hypothesis that minimizing withinclass feature variation is critical for fewshot performance.
We further notice that Reptile resembles a consensus optimization algorithm, and we enhance the method by designing yet another regularizer, which we apply to Reptile in order to find clusters of local minima in the loss landscape of tasks. We find in our experiments that this regularizer improves both oneshot and fiveshot performance of Reptile on miniImageNet.
References
 Low data drug discovery with oneshot learning. ACS central science 3 (4), pp. 283–293. Cited by: §1.
 Metalearning with differentiable closedform solvers. arXiv preprint arXiv:1805.08136. Cited by: §A.2, §1, §2.2, §2.3.
 Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §5.
 A closer look at fewshot classification. arXiv preprint arXiv:1904.04232. Cited by: §2.4.
 A baseline for fewshot image classification. arXiv preprint arXiv:1909.02729. Cited by: §2.4.
 Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. Cited by: §1, §2.2, §3.
 Analyzing and improving representations with the soft nearest neighbor loss. arXiv preprint arXiv:1902.01889. Cited by: §2.4.
 Robust fewshot learning with adversarially queried metalearners. arXiv preprint arXiv:1910.00982. Cited by: §2.1.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §A.2.  Do better imagenet models transfer better?. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2661–2671. Cited by: §1.
 Metalearning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10657–10665. Cited by: §A.2, §A.2, §1, §1, §2.2.
 Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pp. 6389–6399. Cited by: §5.1, Note 1.
 Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), pp. 41–48. Cited by: §4.3.

Learning to adapt in dynamic, realworld environments through metareinforcement learning
. arXiv preprint arXiv:1803.11347. Cited by: §1.  Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2, pp. 2. Cited by: §A.3, Table 9, §1, §2.2.
 Tadam: task dependent adaptive metric for improved fewshot learning. In Advances in Neural Information Processing Systems, pp. 721–731. Cited by: §A.2.
 Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §2.2.
 Fast and generalized adaptation for fewshot learning. arXiv preprint arXiv:1911.10807. Cited by: §1.
 Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §2.3.
 Pay attention to features, transfer learn faster CNNs. In International Conference on Learning Representations, External Links: Link Cited by: §2.4.
Appendix A Experimental Details
The miniImageNet and CIFARFS datasets can be found at https://github.com/yaoyaoliu/miniimagenettools and https://github.com/ArnoutDevos/mamlcifarfsrespectively.
a.1 Mixing MetaLearned Models and FineTuning Procedures: Additional Experiments
Model  SVM  RR  ProtoNet  MAML 

MetaOptNetM  78.63 0.25 %  76.96 0.23 %  76.17 0.23 %  70.14 0.27 % 
MetaOptNetC  76.72 0.24 %  74.48 0.24 %  73.37 0.24 %  71.32 0.26 % 
R2D2M  68.40 0.20 %  72.09 0.25 %  70.74 0.25 %  71.43 0.27 % 
R2D2C  68.24 0.26 %  67.04 0.26 %  60.93 0.29 %  65.30 0.27 % 
a.2 Transfer Learning and Feature Space Clustering
We evaluate the proposed regularizers and classically trained baseline on two backbone architectures: a 4layer convolutional neural network with number of filters per layer 96192384512 originally used for R2D2
Bertinetto et al. (2018) and ResNet12 He et al. (2016); Oreshkin et al. (2018); Lee et al. (2019). We run experiments on the MiniImageNet and CIFARFS datasets.When training the backbone feature extractors, we use SGD with a batchsize of 128 for CIFARFS and 256 for miniImageNet, Nesterov momentum set to 0.9 and weight decay of
. For training on CIFARFS, we set the initial learning rate to 0.1 for the first 100 epochs and reduce by a factor of 10 every 50 epochs. To avoid gradient explosion problems, we use 15 warmup epochs for miniImageNet with learning rate 0.01. We train all classically trained networks for a total of 300 epochs. We employ data parallelism across 2 Nvidia RTX 2080 Ti GPUs when training on miniImageNet, and we only use one GPU for each CIFARFS experiment. For fewshot testing, we train two classification heads, a linear NN layer and SVM
Lee et al. (2019) on top of the pretrained feature extractors. The evaluation results of these models are given in Table 8. Table 7 shows the running time per training epoch as well as total training time on both datasets and backbone architectures to achieve the results in Table 3. The training speed of the proposed regularizers is nearly as fast as classical transfer learning and up to almost 12 times faster than metalearning methods. For metalearning methods, we follow the training hyperparemeters from Lee et al. (2019).miniImageNet  CIFARFS  
Training  Backbone  runtime  runtime 
R2D2  R2D2  16m/16.8h  44s/45m 
Classical  R2D2  20s/1.7h  4s/22m 
Classical w/  R2D2  20s/1.7h  4s/24m 
Classical w/  R2D2  20s/1.7h  4s/23m 
MetaOptNetSVM  MetaOptNet  1.5h/88.0h  4m/4.5h 
Classical  MetaOptNet  1.4m/7.0h  14s/1.2h 
Classical w/  MetaOptNet  1.5m/7.4h  15s/1.3h 
Classical w/  MetaOptNet  1.3m/7.2h  16s/1.4h 

miniImageNet  CIFARFS  

Backbone  Regularizer  Coeff  Head  1shot  5shot  1shot  5shot 
R2D2  0.02  NN  %  %  %  %  
0.05  NN  %  %  %  %  
0.1  NN  %  %  %  %  
0.02  NN  %  %  %  %  
0.05  NN  %  %  %  %  
0.1  NN  %  %  %  %  
0.02  SVM  %  %  %  %  
0.05  SVM  %  %  %  %  
0.1  SVM  %  %  %  %  
0.02  SVM  %  %  %  %  
0.05  SVM  %  %  %  %  
0.1  SVM  %  %  %  
ResNet12  0.02  NN  %  %  %  %  
0.05  NN  %  %  %  %  
0.1  NN  %  %  %  %  
0.02  NN  %  %  %  %  
0.05  NN  %  %  %  %  
0.1  NN  %  %  %  
0.02  SVM  %  %  %  %  
0.05  SVM  %  %  %  %  
0.1  SVM  %  %  %  %  
0.02  SVM  %  %  %  %  
0.05  SVM  %  %  %  %  
0.1  SVM  %  %  %  %  

a.3 Reptile Weight Clustering
We train models via our weightclustering Reptile algorithm with a range of coefficients for the regularization term. The model architecture and all other hyperparameters were chosen to match those specified for Reptile training and evaluation on 1shot and 5shot miniImageNet in
Nichol and Schulman (2018). The evaluation results of these models are given in Table 9. All models were trained on Nvidia RTX 2080 Ti GPUs.Coefficient  1shot  5shot 

(Reptile)  %  % 

Appendix B Proof of Theorem 1
Consider the three conditions
where and is the expected value of Under these conditions,
and
Combining the above yields
We can now write
and so is classified correctly if our three conditions hold. From the Chebyshev bound, these conditions hold with probability at least
(1) 
where we have twice applied the identity which holds for (this also requires , but this can be guaranteed by choosing a sufficiently small as in the statement of the theorem).
Finally, we have the variation ratio bound
And so
Plugging this into (1) we get the final probability bound
(2) 
Comments
There are no comments yet.