Deep neural networks, trained over a large-scale dataset with legitimate regularization techniques,
generalize to a novel instance with persistent performance, while they are highly over-parameterized.
Many attempts have been introduced to analyze their
generalization performance in notion of sharpness of local minima
Deep neural networks, trained over a large-scale dataset with legitimate regularization techniques, generalize to a novel instance with persistent performance, while they are highly over-parameterized. Many attempts have been introduced to analyze their generalization performance in notion of sharpness of local minima(Keskar et al., 2017), which describes the reason why deep networks can generalize even if the number of parameters is larger than the number of training instances. Yet, complex deep networks learned from few examples tend to be easily over-fitting to the training set, which is hardly alleviated by a regularization from Bayesian learning (Fei-Fei et al., 2003, 2006; Salakhutdinov et al., 2012).
The primary interest of this paper is few-shot classification:
the objective is to learn a mapping
function that assigns each instance in a query set into
few-shot classes defined by a support set ,
which is composed of a set of few instances in classes.
Under this problem setting, meta-learning (Schmidhuber, 1987; Thrun and Pratt, 1998) generalizes to
a novel task by learning a series of tasks , where is the th episode
that consists of a tuple of and , and is the number of
A common practice to train a meta-learner has a major limitation,
where tasks in the phase of meta-training and meta-test
are sampled from the same dataset.
Similar to supervised learning in which training and test
distributions are typically matched, meta-learning implicitly assumes that
tasks from meta-training and meta-test share similar high-level concepts.
Then, the learner performs poorly to a novel task
that does not share common attributes of the tasks in meta-training.
is the number of training elements. A common practice to train a meta-learner has a major limitation, where tasks in the phase of meta-training and meta-test are sampled from the same dataset. Similar to supervised learning in which training and test distributions are typically matched, meta-learning implicitly assumes that tasks from meta-training and meta-test share similar high-level concepts. Then, the learner performs poorly to a novel task that does not share common attributes of the tasks in meta-training.
Rather than learning from a single dataset, we expect meta-learners to be trained over datasets from diverse visual domains, and perform robustly to a novel task by expanding their knowledge from the most similar ones to a novel task. Yet, a naïve training from multiple sets does not perform well because the model considers many irrelevant tasks. To alleviate the effects, an obvious approach is to retrieve similar ones of a novel task and to train a meta-learner from them. This may perform well enough to the target task, but we observe some limitations: (i) learning an appropriate metric between datasets is quite challenging and (ii) a meta-learner is always trained from scratch whenever a new task is given. Instead of selecting similar datasets (Kim et al., 2017), it is better to keep multiple meta-learners trained from each dataset and determine how to aggregate them effectively. This encourages us to build a mixture of meta-learners where mixture coefficients are adaptively determined whenever a novel task is given.
The major concern of building such model is that the model has to determine the coefficient while it glimpses the target task, contrast to the regular supervised learning which has abundant validating examples to evaluate the base learners. To this end, we train the model to learn how to combine the meta-learners given an episode. More specifically, our mixture of meta-learners is established by putting more weight on the base-learner that is expected to perform well to the tasks in the test phase. To evaluate the model given a small number of instances, we employ the weight prediction network (WPN) that predicts the performance of the base learner by observing their latent embeddings of given task. Since WPN determines the performance of the meta-learner based on its output, it can be viewed as a similar idea of meta-recognition system (Scheirer et al., 2012) that analyzes and predicts the recognition system. Hence, learning to evaluate meta-learners can be considered as two layers of meta-learning.
Our contribution is two-fold: (i) we point out a major limitation of the conventional approaches and propose MxML as a solution, such that the mixture coefficients on base meta-learners are task-adaptively determined (ii) we observe that our model achieves the best performance among the state-of-the-art algorithms, when the task is sampled from novel distribution (out-of-distribution) as well as when the task shares the similar attributes with training tasks (in-distribution).
This section introduces a problem setting and noticeable works on meta-learning for few-shot classification.
2.1 Problem Setting
We follow the conventional definition of few-shot
classification as in (Vinyals et al., 2016; Snell et al., 2017) .
The objective of few-shot classification
is to estimate a function
. The objective of few-shot classification is to estimate a functionparameterized by that maps an instance of a query set into a label set . Specifically, the -way -shot classification is formally defined as the task that assigns a query into one of classes in the support set composed of examples and their associated labels: , where an example , the associated label , and . Note that the number of examples with the same label is and represents an input space. Similarly, , where is the number of queries and the associated labels are only given in the meta-training.
Meta-learning (Schmidhuber, 1987; Thrun and Pratt, 1998) for few-shot classification introduces an episode, a tuple of and sampled from a dataset, that is used to learn parameters of a model in an episodic training strategy (Vinyals et al., 2016; Snell et al., 2017). It effectively prevents from over-fitting of a model when it is solely trained by a single task. In the subsequent section, we briefly summarize representative meta-learning methods for few-shot classification.
2.2 Related Work
We categorized previous works by the existence of adaptation to few labeled examples of a task in the test phase.
2.2.1 Meta-Learning without Adaptation
Learning appropriate metrics is a key step to solve few-shot learning.
Along this direction, matching network (Vinyals et al., 2016) proposes a differentiable nearest neighbor classifier that is learned to minimize the empirical risk computed in the meta-training phase.
Given a set of few-shot classes, matching network learns a mapping function from a test instance into one of few-shot classes, which is formulated by bi-directional LSTM with attention mechanism.
Prototypical networks simply learn a representative vector in each few-shot classes, instead of learning complex neural networks with attention mechanism. This is also trained by a series of episodes, where the prototype vectors are learned to enforce that they should be close in the same class. Moreover, its simple extension to learn covariance structures is also available at
proposes a differentiable nearest neighbor classifier that is learned to minimize the empirical risk computed in the meta-training phase. Given a set of few-shot classes, matching network learns a mapping function from a test instance into one of few-shot classes, which is formulated by bi-directional LSTM with attention mechanism. Prototypical networks(Snell et al., 2017)
simply learn a representative vector in each few-shot classes, instead of learning complex neural networks with attention mechanism. This is also trained by a series of episodes, where the prototype vectors are learned to enforce that they should be close in the same class. Moreover, its simple extension to learn covariance structures is also available at(Fort, 2017). Among early works on this direction, Siamese neural network (Koch et al., 2015) is used to learn the metric that preserves semantic similarities between instances.
2.2.2 Meta-Learning with Adaptation
Learning models that quickly adapt to few examples is critical to solve few-shot learning. Along this direction, model-agnostic meta-learning (MAML) explicitly trains a meta-learner such that few updates with labeled instances are enough to achieve high generalization performance on a new task (Finn et al., 2017). The original implementation of MAML requires a second-order derivative of parameters of deep neural network, which is accelerated by a first-order approximation (Nichol et al., 2018). Similarly, Ravi and Larochelle (2017) propose a meta-learning framework such that LSTM is trained to learn an update rule for few-shot learning. To further advance this direction, Lee and Choi (2018) explicitly split a meta-learner into task-specific and task-general components, where each component is updated in a more effective way, compared to update them simultaneously.
2.2.3 Set-Input Network
Neural networks that are capable of being invariant to permutation and dealing
with variable-length inputs have recently gained a lot of attention to learn
semantic representation of sets.
Zaheer et al. (2017) provide a theoretical justification to
a unique structure
of neural networks that is invariant to permutation, which is able to
being deployed to many interesting applications: multiple instance learning, point-cloud classification, etc.
Edwards and Storkey (2017) develop a generative process of a set
given a context vector, which is inferred by a statistic network that takes into
account the exchangeability of a dataset.
Moreover, Lee et al. (2018) propose a feed-forward neural network with self-attention, which still holds permutation-invariant property.
propose a feed-forward neural network with self-attention, which still holds permutation-invariant property.
3 Main Algorithm
In this section, we introduce our main algorithm to train mixture of meta-learners in a task-adaptive fashion.
3.1 Mixture of Meta-Learners (MxML)
Mixture of meta-learners (MxML) task-adaptively aggregates base meta-learners, in which mixture coefficients are determined by weight prediction network (WPN). Figure 2 introduces an overall structure of MxML that generates representations of an episode to determine the weight proportion of meta-learners by WPN. Specifically, an episode composed of is transformed by the th base meta-learner as follows:
where , are the hidden representations
are the hidden representations of theth instance in the support and query set, respectively. To ease exploiting the label information, we collect instances that belong to the same label and denote the hidden representations of them as , where means the subset of support set that only contains the data labeled as . Hence, we denote .
Then, the final prediction of MxML is established by combining the predictions of meta-learners with mixing coefficients as follows:
where means the importance of each meta-learner from WPN parameterized by , represents the prediction of by the th base meta-learner, and is a softmax function. Details on WPN are described in the subsequent section.
MxML requires a two-step training procedure for base learners and WPN, in which datasets are given. For the first step, each meta-learner is trained from its associated dataset, and fixed throughout the next step. MxML allows us to choose any type of meta-learners including prototypical network (Snell et al., 2017) and MAML (Finn et al., 2017). For the second step, WPN is trained by a series of episodes sampled from diverse datasets. In this step, is trained by minimizing the cross-entropy between weighted prediction (2) and the associated labels, while is fixed after training base meta-learners. The objective function is introduced as follows:
where refers to a dataset distribution which is defined in the space that all of the datasets exists, means a single dataset sampled from , and represents an episode from selected dataset .
3.2 Weight Prediction Network (WPN)
To measure the importance of th learner, WPN introduces a distribution that encodes into a vector, denoted as , which is parameterized by . We expect that the base meta-learner performs reasonably well when the inter-class distributions are separable and the predicted class of a query is highly concentrated on a specific class. In this sense, the weight prediction on th learner is defined as follows:
where is a distribution over the latent vectors that belong to th class in the support set (referred to as class-specific distribution), refers the latent vector of th query, and is the smooth max function that returns the maximum component in the set in a differentiable way. We assume that are given from the neural network with input . Since is a set of embeddings, set encoding network is needed. Bi-directional LSTM is used in (Vinyals et al., 2016) and average pooling-based set representation method is used in (Zaheer et al., 2017; Edwards and Storkey, 2017). Likewise, we also use average vectors for encoding because of its simplicity. Exact implementation details are shown in Table 2.
The first part of the right-hand side of (4) explains the distance between class-specific distributions. We assume that the value of the term will increase when the classes in the task are more separable than other meta-learners. In the second part of (4), is obtained from the neural network with input . The probability density in the point from the nearest distribution is multiplied over the entire query. Apparently, this term is cumulative, so utilizing more number of queries make more dependent on the second part. Figure 3 shows graphical visualization of each component.
WPN is trained by optimizing the parameters thereof from minimizing the cross entropy loss between prediction of MxML and the true label (see Algorithm 1). This formulation explicitly consider training WPN from diverse datasets to generalize on the novel one.
Our insight to use the parameterized WPN and train them to predict the model performance is because the evaluation for the model from (4) is not a perfect metric. Actually, it is not easy to validate the model without any similar instances from the target. Since the target task is given as an episode, we can barely estimate based on the task embedding structure. By building a structured weight inference instead of using the model that inputs a set of embedding vectors and outputs a single weight prediction, it keeps from over-fitting to a simple selection of single meta-learner.
In order to handle a novel few-shot learning task, the model needs to be trained over similar set of tasks. But in case that there is no available similar set of tasks, or more specifically, if there is no other training classes containing plenty amount of instances in the same visual domain with the target, then the model should be trained from diverse domain to generalize to a novel one. We mainly focus on this problem setting which we consider more realistic.
In the ensemble methods in classical supervised learning, the base learners should be accurate as possible. They are evaluated with validating examples which has the same distributional property with the test examples, and only good ones of them are used to combine (otherwise, the model degenerates). It is also similar in meta-learning phase if we assume that the target tasks are achieved from known domain and similar (but not exact) tasks can be collected so that we can validate the models, then we can possibly attain good ones.
In case of the target task is given from the first seen distribution, however, then there are no available validating examples for the base meta-learners since it contains only few-shot training examples () with some test instances (). Our main proposal is to build a model that evaluates the meta-learner given an episode to select the best performing model depending on distributional property of the task so that the model can be capable of solving any kind of tasks. Since the episode is composed of two sets , we formulate this as a set-based problem. Sometimes contains too small instances to evaluate the model, we found that utilizing query instances helps a lot (i.e., transductive setting).
|Type||Dataset||Split||Average of (H, W)|
th meta-learner and WPN. ConvBlock is composed of series of convolutional layer, batch normalization, and ReLU activation function. Each convolutional layer contains the kernel with size (3,3) and the channel with 64 for protoypical network and 32 for MAML, as implemented originally. Avg refers the mean vector of all instances in(see equation 1) which is fed to dense layer to produce , means and log diagonal variances of class-specific distribution.
In this section, we show that our methods outperform other existing methods. First, we introduce the datasets used in this paper and the detailed settings of the experiments.
We use various image datasets to train and test our network. Five datasets are used to train base meta-learners and WPN, and distinct five datasets are used for the phase of meta-test. For the fair comparison, we collect datasets in a basis of four categories:
Gray-scale low/high resolution images. We use Omniglot (Lake et al., 2015) and MNIST (LeCun and Cortes, 1998) in this category. Omiglot is used for training base-learners and MxML and MNIST is used for the meta-test phase, because the number of classes is not large enough for training meta-learners.
Colored low-resolution images. We use CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), where the former is used for the meta-test phase and the latter is used for training meta-learners.
Colored high-resolution animal images. We employ AwA2 (Xian et al., 2017) and CUB200 (Welinder et al., 2010): the former consists of high-resolution animal images crawled from web sites and the latter is composed of images from 50 species of birds. CUB200 is originally designed for fine-grained classification, then this is relatively difficult to achieve the high performance.
Table 1 summarizes statistics and experiment settings of each dataset: which dataset is used for the phase of meta-training or meta-test, and the number of classes used for training meta-learners and WPN. We simply remark that all datasets are pre-processed in the same fashion, where images are resized into resolution and gray-scale images are converted into 3-channel images.
4.2 Experimental Details
We set parameters of WPN as , ,
with fixed learning rate using Adam optimizer (Kingma and Ba, 2015).
While training and testing, 15 queries with single meta-batch is given.
The parameters of prototypical network is set similar to original setting
except for the learning rate that starts from and
decreases to when it approaches 70 epoch out of 100 epochs.
We use second order MAML with inner loop learning rate
when it approaches 70 epoch out of 100 epochs. We use second order MAML with inner loop learning rate, and the same learning rate used in prototypical network. While training, meta-batch size is fixed to 2 in the entire out-of-distribution task and 1-shot in-distribution task, and 4 in 5-shot in-distribution task.
In addition, we empirically observe
that normalized features
are effective to train WPN.
Thus, in this paper all of the features extracted from base meta-learners
are normalized as:
In addition, we empirically observe that normalized features are effective to train WPN. Thus, in this paper all of the features extracted from base meta-learners are normalized as:
where and are -dimensional representation of support and query vectors, respectively. We assume that the normalization helps to standardize a scale in embedding spaces learned from different datasets.
We design out-of-distribution task to verify that our model performs robustly on the few-shot classification task sampled from first seen distribution. We use 5 datasets to train the model (i.e., AwA2, CIFAR-100, Omniglot, VOC2012, and Caltech256), and evaluate it from 5 separate datasets (i.e., MNIST, CUB-200, CIFAR-10, Caltech101, and miniImageNet). The classes of each meta-train dataset are randomly split into 2 subsets. One (80% of entire classes) is used to train class-specific meta-learners, and the other (20% of entire classes) is used to train WPN that learns from diverse task distribution.
The results of the experiments are shown in Table 3 (with base meta-learner as prototypical network) and Table 4 (with base meta-learner as MAML). Dataset-specific models are only trained within their associated dataset, and their classification results for each meta-test datasets are listed. When some datasets share the common source (i.e., the same visual domain so they have the similar image resolution) or share the similar tasks (classifying among animals, or general objects) then the class-specific models tend to perform well on the target dataset. However, that is not always true on some cases such as Omniglot to MNIST. We can expect that the model trained on Omniglot performs the most on MNIST dataset because both of them consider gray-scale character images, but Caltech256 trained meta-learner is the stronger classifier. This is a supporting reason to building MxML rather than using a similar dataset manually because it is difficult to notice that model would perform the best in advance.
Single model has an identical structure with a dataset-specific model, but is trained on multiple datasets. In many cases, we observe that the performance degrades than well-performing single learner because of many irrelevant tasks from unrelated domains. Uniform averaging model is a mixture model of dataset-specific models with identical mixture coefficient. Also in this case, some models are degrading.
Figure 4 shows the averages and standard deviations
of mixing coefficient on base learners associated with datasets,
when CUB200 or CIFAR10 is given as a meta-test dataset.
When there exists a relevant dataset to the target task,
MxML assigns relatively high coefficient to the dataset.
We observe that the weight on the base learner from CIFAR100
is much higher than the ones from other datasets when the
tasks are given from CIFAR10. On the contrary, MxML assigns
large coefficient on AwA2 and Caltech256 when the tasks are
generated from CUB200, where all of them contain the tasks of classifying animals.
shows the averages and standard deviations of mixing coefficient on base learners associated with datasets, when CUB200 or CIFAR10 is given as a meta-test dataset. When there exists a relevant dataset to the target task, MxML assigns relatively high coefficient to the dataset. We observe that the weight on the base learner from CIFAR100 is much higher than the ones from other datasets when the tasks are given from CIFAR10. On the contrary, MxML assigns large coefficient on AwA2 and Caltech256 when the tasks are generated from CUB200, where all of them contain the tasks of classifying animals.
In in-distribution task, we show that our model performs good as well when the task is sampled from the already-seen distribution. We make slight change from the out-of-distribution experiment setting. From the setting in Table 1, miniImageNet (denoted as mImgNet) dataset is split into 3 subsets as in Vinyals et al. (2016) with the number of classes 64/16/20 (meta-train/validation/test). One base meta-learner trained on miniImageNet dataset is added compare to the previous setting, and is tested only on the miniImageNet dataset. The result shows that our model consistently improves the performance in the conventional -way -shot classification problem with additional meta-learners trained on other datasets (Table 5). Figure 5 shows the mixing coefficients of base meta-learners, in which MxML puts more attentions on miniImageNet and Caltech256 in both 1-shot and 5-shot experiments.
|Dataset-specific model||50.741 (0.764)||67.781 (0.664)|
|Single model||47.432 (0.750)||65.729 (0.641)|
|Uniform Averaging||44.618 (0.756)||66.007 (0.667)|
|MxML||51.393 (0.765)||69.338 (0.642)|
Accuracy with 95% confidence interval of MxML with prototypical network as a base meta-learners in 5-way classification task sampled from miniImageNet of which non-overlapping classes are included in the meta-training.
In this paper, we propose a task-adaptive mixture of meta-learners, referred to as MxML for few-shot classification. We observe that a common practice for meta-learning has a major limitation: tasks used in the meta-training and meta-test phases are sampled from the similar task distribution. To resolve this critical issue, we tackle a challenging problem in which a test task is sampled from a novel dataset. We then propose an ensemble network that learns how to adaptively aggregate base meta-learners for the given task. Extensive experiments on diverse datasets confirm that MxML outperforms other baselines.
- Edwards and Storkey  H. Edwards and A. Storkey. Towards a neural statistician. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
Everingham et al. 
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.
The Pascal visual object classes (VOC) challenge.
International Journal of Computer Vision, 2010.
- Fei-Fei et al.  L. Fei-Fei, R. Fergus, and P. Perona. A bayesian approach to unsupervised one-shot learning of object categories. In Proceedings of the International Conference on Computer Vision (ICCV), 2003.
- Fei-Fei et al.  L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 2004.
- Fei-Fei et al.  L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006.
Finn et al. 
C. Finn, P. Abbeel, and S. Levine.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the International Conference on Machine Learning (ICML), 2017.
- Fort  S. Fort. Gaussian prototypical networks for few-shot learning on omniglot. arXiv e-prints, arXiv:1708.02735, 2017.
- Griffin et al.  G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical report, California Institute of Technology, 2007.
Keskar et al. 
N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang.
On large-batch training for deep learning: Generalization gap and sharp minima.In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- Kim et al.  J. Kim, S. Kim, and S. Choi. Learning to warm-start Bayesian hyperparameter optimization. arXiv e-prints, arXiv:1710.06219, 2017.
- Kingma and Ba  D. K. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
- Koch et al.  G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In Proceedings of the International Conference on Machine Learning (ICML), 2015.
- Krizhevsky  A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- Lake et al.  B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 2015.
LeCun and Cortes 
Y. LeCun and C. Cortes.
The MNIST database of handwritten digits, 1998.http://yann.lecun.com/exdb/mnist/.
- Lee et al.  J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. arXiv e-prints, arXiv:1810.00825, 2018.
- Lee and Choi  Y. Lee and S. Choi. Gradient-based meta-learning with learned layerwise metric and subspace. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
- Nichol et al.  A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv e-prints, arXiv:1803.02999, 2018.
- Ravi and Larochelle  S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
Salakhutdinov et al. 
R. Salakhutdinov, J. Tenenbaum, and A. Torralba.
One-shot learning with a hierarchical nonparametric bayesian model.
Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 2012.
- Scheirer et al.  W. J. Scheirer, A. Rocha, R. J. Micheals, and T. E. Boult. Meta-recognition: The theory and practice of recognition score analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
- Schmidhuber  J. Schmidhuber. Evolutionary Principles in Self-Referential Learning. PhD thesis, Technical University of Munich, 1987.
- Snell et al.  J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Thrun and Pratt  S. Thrun and L. Pratt. Learning to learn. Kluwer Academic Publishers Norwell, 1998.
- Vinyals et al.  O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
- Welinder et al.  P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical report, California Institute of Technology, 2010.
- Xian et al.  Y. Xian, C. Lampert, B. Schiele, and Z. Akata. Zero-shot learning - A comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- Zaheer et al.  M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In Advances in Neural Information Processing Systems (NeurIPS), 2017.