1 Introduction
Fewshot learning aims to learn new concepts from a handful of training examples, e.g. from 1 or 5 training images [30, 11, 54]
. This ability is wellhandled by humans, while in contrast, it remains challenging for machine learning models that typically require a significant amount of training data for good performance
[26]. For instance on the CIFAR100 dataset, a classification model trained in the fully supervised mode achieves accuracy for the 100class setting [9], while the bestperforming 1shot model achieves only in average for the much simpler 5class setting [54]. On the other hand, in many realworld applications we are lacking significant amounts of training data, as e.g. in the medical domain. It is thus desirable to improve machine learning models to handle fewshot settings.The nature of fewshot learning with very scarce training data makes is difficult to train powerful machine learning models for new concepts. Metalearning methods aim to tackle this problem by transferring experience from similar fewshot learning tasks [7]
. There are different meta strategies, among which the optimizationbased methods are particularly promising for today’s neural networks
[11, 12, 17, 13, 29, 63, 54, 2]. These methods follow a unified training process that contains two loops. The innerloop learns a baselearner for an individual task, and the outerloop then uses the validation performance of the learned baselearner to optimize the metalearner. In previous work [11, 12, 2], the task of the metalearner is to effectively initialize the baselearner.In this work we are addressing two shortcomings of previous work. First, the learning process of a baselearner for fewshot tasks is quite unstable [2], and often results in low performance. An intuitive solution is to train an ensemble of models and use the combined prediction which should be more robust [6, 41, 24]. However, it is not obvious how to obtain and combine an ensemble of baselearners given the fact that only few training samples are available. Rather than learning multiple independent baselearners, we propose to use the sequence of baselearners while training a single baselearner as the ensemble and also learn how to weigh them for best performance automatically. Second, it is well known that the value of various hyperparameters are critical for best performance which is particularly important in fewshot learning settings. We thus propose to also metalearn two important hyperparameters, namely learning rate and regularization weight. We call the resulting novel metalearning approach LCC. LCC explicitly Learns to Customize multiple baselearners as well as learns to Combine their prediction results. Our “multiple baselearners” are different models
since each one of them results from a specific training epoch and is trained with a specific set of hyperparameter values. LCC sets these hyperparameters to be finegrained, e.g. layerwise learning rates, in order to enable more efficient model exploration. During test, LCC combines multiple baselearners’ predictions using soft weights in order to produce more robust results. Overall, the used hyperparameters and soft weights are also metalearning targets of LCC. For metatraining we leverage meta gradient descent methods that have been shown effective
[11, 54, 2, 12, 45].Importantly, fast model adaptation is an objective of metalearning. In the adaptation process, the most active adapting behaviors actually happen in the early epochs, and then converging to and even overfitting to training data in later epochs. Related works use a single baselearner (usually from the last epoch), so their metalearners learn only partial adaptation experience [11, 54, 12]. By contrast, our LCC leverages an ensemble modeling strategy that adapts baselearners at different training epochs with optimized hyperparameters. Its metalearner thus obtains the optimized combinational experience. Figure 1 presents that our approach improves the generalization ability substantially over the baseline approach that uses a single baselearner with standard hyperparameters [11].
Our overall contribution is thus threefold. (1) We propose the novel metalearning approach LCC that learns to combine an ensemble of baselearners for fewshot learning. LCC both learns how to combine an ensemble of baselearners and learns how to learn these models automatically with finegrained hyperparameters, e.g. layerwise learning rates and regularization weights. (2) Extensive experiments on two challenging fewshot benchmarks, miniImageNet [59] and FewshotCIFAR100 (FC100) [40]. (3) Indepth analysis of the learning process of LCC. We report several interesting observations for automatic adaption. For example, the learning rate of the laterepoch baselearner is often slightly higher, which is opposite to the common schedule, i.e. monotonically decreasing the learning rate, of largescale network training [18, 55].
2 Related works
Fewshot learning & metalearning. Research literature on fewshot learning paradigms exhibits a high diversity from using data augmentation techniques [60, 62] over sharing feature representation [3, 61] to metalearning [16, 58]. In this paper, we focus on the metalearning paradigm that leverages fewshot learning experiences from similar tasks, based on the episodic formulation (see Section 3.1). Related work can be roughly divided into three categories: (1) metric learning methods [59, 49, 57] aim to learn a similarity space, in which the learning should be efficient for fewshot examples; (2) memory network methods [37, 46, 40, 35] aim to learn training “experience” from seen tasks and then aim to generalize to the learning of unseen ones; and (3) gradient descent based methods [11, 12, 2, 43, 29, 17, 63, 54] usually employing a metalearner that learns to fast adapt a NN baselearner to a new task within a few iterations. Stateoftheart models are MAML [11] and its recent improved version MAML++ [2]. Their metalearners learn to effectively initialize the parameters of a NN baselearner for a new task. Our approach is closely related to MAML related methods [11, 2]. An important difference is that we learn how to customize and how to combine an ensemble of baselearners for robust model prediction, while MAML [11] and MAML++ [2] use a single baselearner.
Hyperparameter optimization. Building a model for a new task is a process of explorationexploitation. Exploring suitable architectures and hyperparameters are important before training. Traditional methods are modelfree, e.g. grid search. Bergstra and Bengio [5] advocated using random search over grid search. Li [31] improved random search by adaptively allocating resources to promising configurations. Jaderberg [23]
scheduled a population of networks in parallel, and periodically replace the weights of underperforming networks by better ones. These methods require multiple full training trials and are thus costly. Modelbased hyperparameter optimization methods are adaptive but sophisticated, e.g. using random forests
[20], Gaussian processes [50] and input warped Gaussian processes [52] or scalable Bayesian optimization [51]. In our approach, we metalearn hyperparameters by a simple and elegant gradient descent method, without additional manual labor. Related methods using gradient descent mostly work for single network training [4, 10, 33, 32, 13, 34]. While, we aim to learn a sequence of hyperparameters for multiple baselearners.Ensemble modeling. It is a strategy that aims to improve machine learning performance using multiple algorithms, and has proved to effectively reduce problems related to overfitting [27, 53]. Mitchell [36] provided a theoretical explanation for it. Boosting is one classical way to build an ensemble by training new models with emphasizing hard samples, e.g. AdaBoost [14] and Gradient Tree Boosting [15]
. Stacking combines multiple models by learning a combiner like a logistic regression model. It applies to both supervised learning tasks
[6, 41, 24][48]. Bootstrap aggregating (bagging) builds an ensemble using models generated in parallel to reduce the variance
[6], e.g. Random Forests [19]. In fewshot settings, it is hard to train plenty of different models in parallel. Our approach makes use of the ensemble of training epochs to obtain different models. Ensembling models in a temporal way [28]and utilizing features extracted by an ensemble of attribute models
[56] are also related works. Comparing to them, our difference lies in that our multiple models are customized with optimized hyperparameters and combined with learned weights, automatically.3 Preliminary
This section first introduces the unified episodic formulation of fewshot learning, following related works [59, 43, 11, 40, 54, 45]. Then, we briefly introduce the meta gradient decent of metalearner based on a single base model, which is commonly used in related works [11, 12, 54, 2].
3.1 Episodic formulation
The episodic formulation was proposed for fewshot learning first in [59]
. The problem definition of fewshot learning is different from traditional image classification, in three aspects: (1) the main phases are not train and test but metatrain and metatest, each of which includes training and testing; (2) the samples in metatrain and metatest are not datapoints but episodes, i.e. fewshot classification tasks; and (3) the objective is not classifying unseen datapoints but to fast adapt the metalearned experience or knowledge to the learning of a new fewshot classification task.
Given a dataset for metatrain, we first sample fewshot episodes (tasks) from a task distribution such that each episode contains few samples of few classes, e.g. 5 classes and 1 shot per class. Each episode includes a training split to optimize a specific baselearning network, and a test split to compute a generalization loss used to optimize a global metalearner. For metatest, given an unseen dataset , we sample a test task to have the samesize training/test splits. “Unseen” means there is no overlap of image classes between metatest and metatrain tasks. We first initiate a new model with metalearned network parameters (ours with additional hyperparameters), then train this model on the training split . We finally evaluate the performance on the test split . If we have multiple unseen tasks for metatest, we report average accuracy as the final result.
3.2 Meta gradient descent
Meta gradient descent is a classical way of outerloop optimization [58, 47, 39]. MAML [11]
first applied this to supervised metalearning and reinforcement learning. It optimizes meta parameters
(metalearner) that are used to initialize a specific model (baselearner) for fast adaption to a new task [11]. It trains a single baselearner for prediction in each episode.Given an episode , we initialize the baselearner parameters as , then adapt it by using gradient descent using the loss on the training datapoints ,
(1) 
where is the penalty with a fixed hyperparameter , is a fixed learning rate and the epoch number. Each basetraining contains epochs. After epochs, a validation loss of is computed based on . The corresponding gradient on is called meta gradient, and it unrolls through the entire base adaptation procedure from to (the itself). The update of is thus to apply a meta gradient descent computation as follows,
(2) 
where
is the metalearning rate. This meta gradient update involves a gradient through a gradient. It requires an additional backward pass through the baselearner to compute Hessianvector products
[11], and this is supported by standard libraries such as TensorFlow
[1]. In the following, we show how to leverage meta gradient descent within our approach.4 Learning to Customize and Combine (LCC)
As shown in Figure 2, our LCC both learns a sequence of baselearners and learns to combine their prediction scores during test for best performance. Hyperparameters are learned by meta gradients automatically.
4.1 Initiate the sequence of baselearners
We use the sequence of baselearners obtained from training a single baselearner as the ensemble. We thus can formulate the initiation of these baselearners in a sequential manner. Our “initiation” here includes the initialization of neural network parameters, i.e., weights and bias (the initialization of the st baselearner is the same as for MAML [11]), as well as the configuration of specific hyperparameters, for the sequence of baselearners.
Given an episode , let corresponds to the parameters of the baselearner working at epoch (w.r.t. the th baselearner or BL), with . First, we initiate BL with the initialization parameters (network weights and bias), as well as with specific hyperparameters, i.e. learning rates and regularization weights . We then adapt BL using gradient descent on the training split , and its updated weights and bias are then used to initialize the parameters of BL. We formulate the general process as follows,
(3) 
(4) 
where denotes the learning rate specified for BL, and is the training loss. Note that is introduced to make the notation consistent. If we use to initialize function BL mapping the inputs to the prediction scores, the training loss of can be unfolded as,
(5) 
where is the softmax cross entropy loss, and is the regularization of network weights and is the regularization weight specified for BL. The meta optimization on hyperparameters and is given in Section 4.2.
4.2 Learn to customize baselearners
As introduced in Section 4.1, the specific learning rate and regularization weight are used to configure the th baselearner. It is well known that finegrained hyperparameters, e.g. layerwise learning rates, are more efficient, but exponentially expensive to set by hand [21, 5]. Our LCC does not have this problem and can learn finegrained hyperparameters without additional labour. Therefore, we use layerwise learning rates and regularization weights as and , where is the layer number. When plugging and into Eq. 4, we get all baselearners with finegrained customization.
Our LCC automatically optimizes and by meta gradient descent. First, it computes the validation loss on the test split as,
(6) 
which is based on the sequence of baselearners. denotes the combination of their predictions, and its detailed computation is given in Section 4.3.
Then, it uses to compute meta gradients of or , which unrolls the entire adaptation process on the sequence of baselearners back to the initiation step. Thus, the sequence of involved hyperparameters or can be updated as,
(7) 
(8) 
where and are metalearning rates determining the update stepsize of hyperparameter values.
The meta updates in Eq. 7 and Eq. 8 involve the backward pass through BL to BL. Derivatives are backpropagated through the unfolded inner loop (of every baselearner) which contains all convolutional and fullyconnected layers. The corresponding layerwise learning rates and regularization weights thus all get updated.
4.3 Learn to combine baselearners
As introduced in Sections 4.1 and 4.2, our LCC optimizes the parameters as well as hyperparameters for the sequence of baselearners. For prediction, it uses the weighted sum of the sequence of prediction scores (from all baselearners). It optimizes the combination weights by meta gradient descent.
First, we formulate the prediction scores of a single baselearner as:
(9) 
For multiple baselearners, we define the combination weights as , and thus compute the combination as follows,
(10) 
Similar to the meta updates on and , we update as,
(11) 
where is the stepsize of this update. is the validation loss as follows,
(12) 
which uses the weighted sum of all model predictions, and is also the expanded version of Eq. 6.
4.4 Overall optimization and algorithm
When including the initialization parameters [11] (for the initialization of st baselearner), we have the overall formulation of metaparameterization as:
(13) 
where and is the stepsize for updating . For the computation of , we apply the metabatch strategy in the iteration of episode training, following [11]. At each iteration, we sample a batch of episodes and then compute the average validation loss as,
(14) 
5 Experiments
We evaluate and analyze the proposed LCC approach in terms of its overall performance and the effects from its two components, i.e. using multiple baselearners and metalearning hyperparameters. We first describe the datasets and detailed settings, then compare the results to stateoftheart methods and conduct an ablation study.
5.1 Datasets and implementation details
We conduct fewshot learning experiments on two benchmarks: miniImageNet [59] and FewshotCIFAR100 (FC100) [40]. The former one is widely used in related works [11, 43, 17, 13, 38, 54], and the later one is more challenging due to lower image resolution and harder trainingtest splits [40, 54].
miniImageNet was proposed by Vinyals et al. [59]
for evaluation of fewshot learning. It is complex because of using ImageNet images, but requires fewer resources and infrastructure than running models on full ImageNet
[44]. There are classes with samples of color images per class. Classes are divided into , , and classes respectively for sampling tasks for metatraining, metavalidation and metatest, following related works [11, 43, 17, 13, 38].FewshotCIFAR100 (FC100) is based on the popular object classification dataset CIFAR100 [25]. The splits were proposed by [40], see details in the supplementary. It offers a more challenging scenario with lower image resolution and more challenging metatrain/metatest splits (separated according to the superclasses of objects) than miniImageNet. It contains object classes and each class has samples of color images per class. The classes belong to superclasses. Metatrain data are from classes belonging to superclasses. Metavalidation and metatest data are from the other two classes belonging to superclasses, respectively. These splits according to superclasses minimize the information overlap between metatrain and metatest (metavalidation) tasks.
The following settings are shared for both datasets. We use the same task sampling used in related works [11, 43, 12, 2]. Specifically, we consider the 5class classification and sample 5class, 1shot (5shot or 10shot) episodes to contain 1 (5 or 10) samples as episode train data, and (a uniform number) samples as episode test data. In total, we sample tasks for metatraining, and respectively sample random tasks for metavalidation and metatest.
The base architecture is 4CONV, which is commonly used in related works [43, 11, 49, 57, 17, 2]. 4CONV consists of layers with convolutions and
filters, followed by batch normalization (BN)
[22], a ReLU nonlinearity, a
maxpooling layer, and a fullyconnected layer.The configuration of metalearners. The network initialization parameters have the same architecture as the baselearner, except that the BN, nonlinear and maxpooling layer are removed. The architectures of and depend on both the number and architecture of the baselearner. In our default setting, baselearners with 4CONV architecture are learned in the ensemble, so and consist of (for weights and biases) and (only for weights) different variables, respectively. The architecture of the combination weights is related to the number of baselearners in the ensemble, so it has variables.
The initialization of metalearners. is initialized randomly, which is the same as MAML [11]. All weights of and are initialized with and respectively. All the weights of are initialized with the reciprocal of the baselearner number, i.e. .
The hyperparameters of metalearners. The meta iteration number is set to k and k for MAML and MAML++ respectively. The meta batch size is , and the meta learning rate for the initialization parameters is (). All the above settings exactly follow [11] and [2]. For the new added metalearners, the meta learning rates are set to , , and for , , and respectively.
The most related methods. MAML [11] is commonly used as baseline, and MAML++ [2] is the most recently published stateoftheart method also using 4CONV as base architecture. MAML++ introduced six training tips which contribute to stable and efficient metatraining process. Our approach is called LCC. If we use the training tips of MAML++, we obtain an improved version called LCC++. Note that LCC++ and MAML++ have the overlap of learning layerwise learning rates. For this part, we use our implementation as we can set flexible stepsizes for the meta update. Therefore, LCC++ actually uses the other five training tips of MAML++.
5.2 Results and analyses
We conduct extensive fewshot learning experiments. In Table 1 and Table 2, we present our results compared to the stateoftheart, respectively on the miniImageNet and FC100 datasets. In Table 3, we provide an ablation study for several components of our approach, on miniImageNet. In Figure 4, we show the specific changes on the recognition accuracies in different ablative settings. In Figure 3, we particularly plot the weight changes of multiple baselearners during metalearning in (a), and show its boost performance compared to baseline settings in (b), as “multiple baselearners” is one of our main contributions. For the other contribution of “metalearning hyperparameter”, we plot extensive curves in Figure 5 and Figure 6.
Method  Arch.  1shot  5shot 

TADAM [40]  ResNet12  58.5  76.7 
MTL [54]  ResNet12  61.2  75.5 
LEO [45]  WRN28  61.76  77.59 
PFA [42]  WRN28  59.60  73.74 
MatchingNets [59]  4CONV  43.44  55.31 
ProtoNets [49]  4CONV  49.42  68.20 
MetaLSTM [43]  4CONV  43.56  60.60 
Bilevel [13]  4CONV  50.54  64.53 
CompareNets [57]  4CONV  50.44  65.32 
LLAMA [17]  4CONV  49.40  – 
Baseline++ [8]  4CONV  48.24  66.43 
MAML [11]  4CONV  48.70  63.11 
MAML++ [2]  4CONV  52.15  68.32 
LCC (Ours)  4CONV  54.0  65.8 
LCC++ (Ours)  4CONV  54.6  71.1 
Pretrained on manyshot classification task 
Overview on miniImageNet. In Table 1, we can see that our LCC++ achieves the best performance in both shot () and shot () settings, compared to the methods with the same 4CONV architecture. Only methods [40, 54, 45, 42] that use deeper neural networks with expensive pretraining as an important preprocessing step do obtain higher performance. Similarly, we expect further gains of our approach using similar pretraining strategies.
Method  1shot  5shot  10shot 

TADAM [40]  40.1  56.1  61.5 
MTL [54]  45.1  57.6  63.4 
MAML [11]  38.1  50.4  56.2 
MAML++ [2]  38.7  52.9  58.8 
LCC (Ours)  40.6  52.7  56.9 
LCC++ (Ours)  39.7  55.2  60.9 
Reported in [54]  
Our implementation using the public code  
Pretrained on manyshot classification task 
Overview on FC100. In Table 2, we present the results of TADAM [40] and MTL [54] using their reported numbers. We note that the numbers of MAML are from [54], and those of MAML++ are our results using the public code. When comparing methods using the same base learning architecture 4CNOV, that is LCC vs MAML and LCC++ vs. MAML++, we can see that our approach LCC (LCC++) obtains better performance. For example, LCC++ achieves , , and improvement on shot, shot, and shot respectively over MAML++. Quite interestingly, on this more challenging dataset, our approach (4CONV) achieves comparable results to TADAM which uses a pretrained and deeper network (ResNet12).
No.  Metalearned  Accuracy  
1shot  5shot  
1  E  47.0  62.0  
2  S  48.0  62.4  
3  ✓  S  49.7  64.4  
4  ✓  S  49.0  63.4  
5  ✓  ✓  S  49.0  65.0 
6  L  49.7  65.4  
7  ✓  L  52.9  65.6  
8  ✓  L  48.6  64.7  
LCC(Ours)  ✓  ✓  L  54.0  65.8 
“oracle”  O  52.4  64.7 
Multiple baselearners with learnable weights . In Table 3, we can see that with fixed and , our approach using multiple baselearners with learnable weighting scheme (No.6) performs better than a single baselearner (No.2) as well as multiple baselearners with fixed average weights (No.1). Please note that No.2 essentially corresponds to the setting of MAML, but the results here are slightly lower than reported in Table 1 (, ). This is because the original MAML ran metatrain with epochs and ran metatest with epochs. Here, we report the results of using metatest with epochs (for fair comparison with our approach which also uses epochs). The last row of Table 3 shows the “oracle” results by assuming the “Optimal” values of have been learned by LCC and are fixed during training. They are clearly higher than the results of any arbitrary (No.1 or No.2), especially in the 1shot setting.
For No.1, 2, 6, the validation accuracies during metatrain are shown in Figure 3(b). No.1 gives BL1 with weights fixed to which causes stronger fluctuations in later iterations (red curve). By contrast, our method automatically adjusts this weight to close to when other learners become mature, see Figure 3(a). With automatic tuning, our approach performs the best during the entire metatrain.
From Figure 3(a) we can observe that the weights of the baselearners are initialized as and then adapted over time. Intuitively, an increase relates to the fact that a baselearners become more mature in later iterations. Interestingly, baselearners working at later epochs gain relatively higher weights but the baselearner working at the initial epoch (the BL) tends to be disabled when the metatrain process converges after around iterations.
Metalearning hyperparameters and . The finegrained hyperparameters, i.e. layerwise learning rates and regularization weights , can be automatically learned by our LCC approach. In Table 3, we have two blocks to present the ablative results of using single baselearner (S) and using multiple baselearners with learnable combination weights (L) in the miniImageNet 1shot and 5shot settings. Particularly in Figure 4, we demonstrate the validation curves (the curve smooth rate is ) of the whole metatrain processes. It is clearly shown that our approach with metalearned hyperparameters achieves top performance.
About . In Table 3, comparing No.3 to No.2 and No.7 to No.6, we can conclude that metalearning layerwise consistently improves the model performance, e.g. it gains for the case of using multiple baselearners in 1shot setting. We can observe the change of layerwise in Figure 5(a), and the change of a single learning rate for each baselearner in Figure 5(b). Note that (a) shows the results of using multiple baselearners, for which the curve of a specific layer is obtained by averaging over the exact layers of all baselearners. In this “averaged baselearner”, we can observe that higherlevel conv layers learn to increase their learning rates. It is quite amazing that conv4 explores a much bigger learning step, around times higher than its initial value of . While in (b), when each baselearner has a single learning rate to learn, the global change of this rate is in a small range, e.g. the biggest jump in BL(purple) is from to around .
It also shows in (b) that baselearners working at later epochs tend to get higher learning rates. This is opposite to the common schedule, i.e. monotonically decreasing the learning rate, of traditional largescale network training [18]. In our fewshot case, this increasing phenomenon can be interpreted as: our LCC learns to update more on the baselearners which have both maturer patterns and higher combination weights (i.e. values), and in turn gets greater feedback from them for meta optimization.
About . In Table 3, comparing No.4 to No.3 and No.8 to No.7, we can see that metalearning does not help as much as metalearning . While, metalearning them together (i.e. L ) makes consistent improvements. Figure 4 gives more detailed results. The superiority of learning and is significant in the 5shot case (b).
In Figure 6, we present the curves of metalearned . In (a), the value of an individual layer is the average of those of baselearners. We can see that highlevel layers learn to have higher values. We believe this is a collaborating behavior with the simultaneously metalearned which also gets increased in higherlevel layers (Figure 5(a)). An intuitive interpretation is that heavily penalizing peaky weights is needed when the weights are updated with large steps. Another interesting point in Figure 6(a) is the values of of conv1 become negative after
iterations. This can be explained that the gradient vanishing problem probably happens during the training with very scarce samples. Our LCC learns to use negative
to penalize such vanishing. Figure 6(b) shows that LCC can learn to adapt the values of for multiple baselearners. For convenient visualization, layerwise values of each baselearner are averaged.6 Conclusions
We propose a novel LCC approach that learns to customize a sequence of baselearns and learns to combine their prediction results. It addresses shortcomings of previous metalearning approaches by metalearning hyperparameters both layerwise as well as over time and allows to use an ensemble of baselearners. Following the metalearning paradigm, the method allows to achieve top performance in comparison to related work. The design of our approach is independent from a specific baselearning model, i.e. baselearner architecture, and can be generalized also to pretrained and deeper networks.
Acknowledgments
This research is part of NExT++ research supported by the National Research Foundation, Prime Minister’s Office, Singapore under its IRC@SG Funding Initiative. It is also partially supported by German Research Foundation (DFG CRC 1223), and National Natural Science Foundation of China (61772359).
References
 [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv, 1603.04467, 2016.
 [2] A. Antoniou, H. Edwards, and A. Storkey. How to train your maml. In ICLR, 2019.
 [3] E. Bart and S. Ullman. Crossgeneralization: Learning novel classes from a single example by feature replacement. In CVPR, 2005.
 [4] Y. Bengio. Gradientbased optimization of hyperparameters. Neural Computation, 12(8):1889–1900, 2000.
 [5] J. Bergstra and Y. Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281–305, 2012.
 [6] L. Breiman. Stacked regressions. Machine learning, 24(1):49–64, 1996.

[7]
R. Caruana.
Learning many related tasks at the same time with backpropagation.
In NIPS, 1994.  [8] W.Y. Chen, Y.C. Liu, Z. Kira, Y.C. Wang, and J.B. Huang. A closer look at fewshot classification. In ICLR, 2019.
 [9] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In ICLR, 2016.
 [10] J. Domke. Generic methods for optimizationbased modeling. In AISTATS, 2012.
 [11] C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 [12] C. Finn, K. Xu, and S. Levine. Probabilistic modelagnostic metalearning. In NeurIPS, 2018.
 [13] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil. Bilevel programming for hyperparameter optimization and metalearning. In ICML, 2018.
 [14] Y. Freund and R. E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.

[15]
J. H. Friedman.
Stochastic gradient boosting.
Computational statistics & data analysis, 38(4):367–378, 2002.  [16] H. E. Geoffrey and P. C. David. Using fast weights to deblur old memories. In CogSci, 1987.
 [17] E. Grant, C. Finn, S. Levine, T. Darrell, and T. L. Griffiths. Recasting gradientbased metalearning as hierarchical bayes. In ICLR, 2018.

[18]
T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li.
Bag of tricks for image classification with convolutional neural networks.
arXiv, 2018.  [19] T. K. Ho. Random decision forests. In ICDAR, 1995.
 [20] F. Hutter, H. H. Hoos, and K. LeytonBrown. Sequential modelbased optimization for general algorithm configuration. In LION, 2011.
 [21] F. Hutter, J. Lücke, and L. SchmidtThieme. Beyond manual tuning of hyperparameters. KI, 29(4):329–337, 2015.
 [22] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [23] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu. Population based training of neural networks. arXiv, 1711.09846, 2017.
 [24] C. Ju, A. Bibaut, and M. J. van der Laan. The relative performance of ensemble methods with deep convolutional neural networks for image classification. arXiv, 1704.01664, 2017.
 [25] A. Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 2009.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [27] L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2):181–207, 2003.

[28]
S. Laine and T. Aila.
Temporal ensembling for semisupervised learning.
In ICLR, 2017.  [29] Y. Lee and S. Choi. Gradientbased metalearning with learned layerwise metric and subspace. In ICML, 2018.
 [30] F. Li, R. Fergus, and P. Perona. Oneshot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell., 28(4):594–611, 2006.
 [31] L. Li, K. G. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel banditbased approach to hyperparameter optimization. Journal of Machine Learning Research, 18:185:1–185:52, 2017.
 [32] J. Luketina, T. Raiko, M. Berglund, and K. Greff. Scalable gradientbased tuning of continuous regularization hyperparameters. In ICML, 2016.
 [33] D. Maclaurin, D. K. Duvenaud, and R. P. Adams. Gradientbased hyperparameter optimization through reversible learning. In ICML, 2015.
 [34] L. Metz, N. Maheswaranathan, B. Cheung, and J. SohlDickstein. Metalearning update rules for unsupervised representation learning. In ICLR, 2019.
 [35] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. Snail: A simple neural attentive metalearner. In ICLR, 2018.
 [36] T. Mitchell. Machine learning, mcgrawhill higher education. New York, 1997.
 [37] T. Munkhdalai and H. Yu. Meta networks. In ICML, 2017.

[38]
T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler.
Rapid adaptation with conditionally shifted neurons.
In ICML, 2018.  [39] D. K. Naik and R. Mammone. Metaneural networks that learn by learning. In IJCNN, 1992.
 [40] B. N. Oreshkin, P. Rodríguez, and A. Lacoste. TADAM: task dependent adaptive metric for improved fewshot learning. In NeurIPS, 2018.
 [41] M. Ozay and F. T. Y. Vural. A new fuzzy stacked generalization technique and analysis of its performance. arXiv, 1204.0171, 2012.
 [42] S. Qiao, C. Liu, W. Shen, and A. L. Yuille. Fewshot image recognition by predicting parameters from activations. In CVPR, 2018.
 [43] S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. In ICLR, 2017.

[44]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei.
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision
, 115(3):211–252, 2015.  [45] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Metalearning with latent embedding optimization. In ICLR, 2019.
 [46] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. P. Lillicrap. Metalearning with memoryaugmented neural networks. In ICML, 2016.
 [47] J. Schmidhuber. Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. PhD thesis, Technische Universität München, 1987.

[48]
P. Smyth and D. Wolpert.
Linearly combining density estimators via stacking.
Machine Learning, 36(12):59–83, 1999.  [49] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for fewshot learning. In NIPS, 2017.
 [50] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In NIPS, 2012.
 [51] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. M. A. Patwary, Prabhat, and R. P. Adams. Scalable bayesian optimization using deep neural networks. In ICML, 2015.
 [52] J. Snoek, K. Swersky, R. S. Zemel, and R. P. Adams. Input warping for bayesian optimization of nonstationary functions. In ICML, 2014.
 [53] P. Sollich and A. Krogh. Learning with ensembles: How overfitting can be useful. In NIPS, 1995.

[54]
Q. Sun, Y. Liu, T.S. Chua, and B. Schiele.
Metatransfer learning for fewshot learning.
In CVPR, 2019.  [55] Q. Sun, L. Ma, S. Joon Oh, L. Van Gool, B. Schiele, and M. Fritz. Natural and effective obfuscation by head inpainting. In CVPR, 2018.
 [56] Q. Sun, B. Schiele, and M. Fritz. A domain based approach to social relation recognition. In CVPR, 2017.
 [57] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales. Learning to compare: Relation network for fewshot learning. In CVPR, 2018.
 [58] S. Thrun and L. Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998.
 [59] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In NIPS, 2016.
 [60] Y. Wang, R. B. Girshick, M. Hebert, and B. Hariharan. Lowshot learning from imaginary data. In CVPR, 2018.
 [61] Y.X. Wang and M. Hebert. Learning from small sample sets by combining unsupervised metatraining with cnns. In NIPS, 2016.
 [62] Y. Xian, S. Sharma, B. Schiele, and Z. Akata. fVAEGAND2: A feature generating framework for anyshot learning. In CVPR, 2019.
 [63] R. Zhang, T. Che, Z. Grahahramani, Y. Bengio, and Y. Song. Metagan: An adversarial approach to fewshot learning. In NeurIPS, 2018.