Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples

by   Eleni Triantafillou, et al.

Few-shot classification refers to learning a classifier for new classes given only a few examples. While a plethora of models have emerged to tackle this recently, we find the current procedure and datasets that are used to systematically assess progress in this setting lacking. To address this, we propose Meta-Dataset: a new benchmark for training and evaluating few-shot classifiers that is large-scale, consists of multiple datasets, and presents more natural and realistic tasks. The aim is to measure the ability of state-of-the-art models to leverage diverse sources of data to achieve higher generalization, and to evaluate that generalization ability in a more challenging setting. We additionally measure robustness of current methods to variations in the number of available examples and the number of classes. Finally our extensive empirical evaluation leads us to identify weaknesses in Prototypical Networks and MAML, two popular few-shot classification methods, and to propose a new method, Proto-MAML, which achieves improved performance on our benchmark.


page 12

page 13


Meta-Learning Adversarial Domain Adaptation Network for Few-Shot Text Classification

Meta-learning has emerged as a trending technique to tackle few-shot tex...

Learning a Universal Template for Few-shot Dataset Generalization

Few-shot dataset generalization is a challenging variant of the well-stu...

A Closer Look at Few-shot Classification

Few-shot classification aims to learn a classifier to recognize unseen c...

Meta-Learning for Semi-Supervised Few-Shot Classification

In few-shot classification, we are interested in learning algorithms tha...

Exploiting Unsupervised Inputs for Accurate Few-Shot Classification

In few-shot classification, the aim is to learn models able to discrimin...

Beyond Simple Meta-Learning: Multi-Purpose Models for Multi-Domain, Active and Continual Few-Shot Learning

Modern deep learning requires large-scale extensively labelled datasets ...

Behind the Scenes: An Exploration of Trigger Biases Problem in Few-Shot Event Classification

Few-Shot Event Classification (FSEC) aims at developing a model for even...

Code Repositories


NeurIPS 2021 - Few-shot learning competition

view repo


Learning to reinforcement learn for Neural Architecture Search

view repo


meta learning from the initializaion induced by word embedding

view repo


Full implementation and re-production of the meta-learning algorithm REPTILE

view repo

1 Introduction

Few-shot learning refers to learning new concepts from few examples, an ability that humans naturally possess, but machines desperately lack. Improving on this aspect would lead to more efficient algorithms that can flexibly expand their knowledge as necessary without requiring large labeled datasets. We focus on few-shot classification: classifying unseen examples into one of new ‘test’ classes, given only a few reference examples of each new class. Recent progress in this direction has been made by considering a meta-problem: though we are not interested in learning about any training class in particular, we can still exploit the training classes for the purpose of learning to learn new classes from few examples. The acquired learning procedure can then be directly applied to few-shot learning problems on new classes.

This intuition has inspired numerous models of increasing complexity for this problem (see the Related Work for many examples). However, we believe that the commonly used setup for measuring success in this direction is lacking. Notably, the typical approach is to train a model on a subset of classes from a given dataset and then subject it to classification tasks formed from the remaining set of classes from the same dataset. However, to be practically useful, meta-learners must generalize to truly different classes sampled from a different data distribution altogether. Furthermore, the evaluation tasks are artificially constrained to have training sets that are perfectly class-balanced, and any two classes are equally likely to co-appear in the same task. It is not clear therefore to what extent the performance on these tasks approximates the performance in the significantly more structured and imbalanced real world.


directly addresses the aforementioned limitations. In particular: 1) it is significantly larger-scale than previous benchmarks and is comprised of multiple datasets of different data distributions, 2) its task creation is informed by class structure for ImageNet and Omniglot, 3) it introduces realistic class imbalance, and 4) it varies the number of classes in each task and the size of the training set, enabling us to examine the robustness of models across a spectrum of tasks: from very-low-shot learning onwards.

The main contribution of this work is therefore to offer a more realistic and challenging environment for training and evaluating meta-learners for few-shot classification. By evaluating various baselines and meta-learners on Meta-Dataset, we are able to expose weaknesses of two popular meta-learners: Prototypical Networks and MAML. Finally, in light of these findings, we propose a novel hybrid of these two approaches which we demonstrate captures complementary desired aspects of both and achieves state-of-the-art in Meta-Dataset.

2 Background

Task Formulation

The end-goal of few-shot classification is to produce a model which, given a new learning episode with classes and a few labeled examples ( per class, ), is able to generalize to unseen examples for that episode. In other words, the model learns from a training (support) set (with ) and is evaluated on a held-out test (query) set . Each example

is formed of an input vector

and a class label . Episodes with balanced training sets (i.e., ) are usually described as ‘-way, -shot’ episodes.

These evaluation episodes are constructed by sampling their classes from a larger set of classes and sampling the desired number of examples per class. A disjoint set of classes is used to train the model; note that this notion of training is distinct from the training that occurs within a few-shot learning episode.

Few-shot learning does not prescribe a specific training procedure, but a common approach involves matching the conditions in which the model is trained and evaluated (Vinyals et al., 2016). In other words, training often (but not always) proceeds in an episodic fashion. Some authors use training and testing to refer to what happens within any given episode, and use the terms meta-training and meta-testing to refer to using to turn the model into a learner capable of fast adaptation and for evaluating its success to learn using few shots. This nomenclature highlights the meta-learning perspective alluded to earlier, but to avoid confusion we will adopt another common nomenclature and refer to the training and test sets of an episode as the support and query sets and to the process of learning from simply as training.

Standard Datasets

Two datasets have emerged as de facto benchmarks for few-shot learning. Omniglot (Lake et al., 2015) is a dataset of 1623 handwritten characters from 50 different alphabets and contains 20 examples per class (character). Most recent methods obtain very high accuracy on various meta-learning problems formulated on Omniglot (using various numbers of ways and shots), rendering the comparisons between new few-shot learning methods unreliable.

The second benchmark, miniImageNet (Vinyals et al., 2016), is formed out of 100 ImageNet (Russakovsky et al., 2015) classes (64/16/20 for train/validation/test) and contains 600 examples per class. miniImageNet, albeit harder than Omniglot, has the same property that most recent methods trained on it present similar accuracy when controlling for model capacity, and we believe the dataset is approaching its limit in terms of allowing to discriminate between the merits of competing approaches. We hypothesize this can be due to an artificially constrained setup. In particular, current benchmarks:

  • Consider a fixed number of shots and ways. In contrast, real-life episodes are heterogeneous: they vary in terms of their number of classes and examples per class, and are unbalanced.

  • Measure only within-dataset generalization. However, realistic applications often involve generalization across datasets.

  • Ignore the relationships between classes when forming episodes. The coarse-grained classification of dogs and chairs may present different difficulties than the fine-grained classification of dog breeds, and current benchmarks do not establish a distinction between the two.

3 Approaches to Few-Shot Classification

In this section we review common baseline and meta-learning models that we evaluate on our benchmark, and introduce a novel meta-learner that achieves the state-of-the-art on our benchmark.

Non-episodic Baselines

Before diving into meta-learning, it is important to explore non-episodic solutions. Consider a deep neural network trained on a classification task. A natural non-episodic approach would exploit the meta-training data by simply using it to train a classifier over all of the meta-training classes

. Consider the embedding function , defined by all the layers except the penultimate layer of the classification network. The hope of the non-episodic baselines resides in the possibility that this embedding function produces ‘meaningful’ representations even for examples of previously-unseen classes, thus enabling few-shot classification. It then remains to define an algorithm for using these representations for few-shot classification. We consider two choices for this algorithm, yielding the ‘-NN’ and ‘Finetune’ variants of this baseline.

The ‘-NN’ baseline classifies each query example as the class that its ‘closest’ support example belongs to. Closeness is measured by either Euclidean or cosine distance in the learned embedding space

. We treat this choice over the metric as a hyperparameter. On the other hand, the ‘Finetune’ baseline exploits the support set of each new meta-test task to train a new ‘output layer’ on top of the embedding function

for the purpose of classifying between the new classes of the given task.

Episodic Models

In the episodic setting, models are trained end-to-end for the purpose of learning to build classifiers from a few examples. We choose to experiment with Matching Networks (Vinyals et al., 2016), Prototypical Networks (Snell et al., 2017) and Model Agnostic Meta-Learning (MAML) (Finn et al., 2017) since we believe that these three cover a diverse set of approaches to few-shot learning. We also introduce a novel meta-learner which is inspired by the last two models.

In each training episode, episodic models compute for each query example , the distribution for its label conditioned on the support set and allow training this differentiably-parameterized conditional distribution end-to-end via gradient descent. The different models are distinguished by the manner in which this conditioning on the support set is realized. In all cases, the performance on the query set drives the update of the meta-learner’s weights, which include (and sometimes consist only of) the embedding weights. We briefly describe each method below.

Prototypical Networks

Prototypical Networks construct a prototype for each class and then classify each query example as the class whose prototype is ‘nearest’ to it under Euclidean distance. More concretely, the probability that a query example

belongs to class is defined as:

where is the prototype for class , computed by averaging the embeddings of class ’s support examples.

Matching Networks

Matching Networks (in its simplest form) label each query example as a weighted linear combination of the support labels, where a support label is weighted more heavily the ‘closer’ the corresponding support embedding is to the query in question. Specifically the probability distribution over the label

of the query is given by:

where equals 1 if is true and 0 otherwise, and


denoting the cosine similarity.


Let us now expand the notation for the embedding function to , exposing its parameters

. MAML for few-shot classification starts by assuming a linear classifier, parametrized by a bias vector

and a weight matrix , applied in the embedding space. It then classifies a query example based on

where the support set is used to perform a small number of within-episode training steps for adjusting parameters and produce fine-tuned parameters . Training in MAML is made possible by unrolling the within-episode gradient descent steps performed on and optimizing the prediction of the query set labels with respect to the initial

by backpropagation. This normally requires computing second-order gradients, which can be expensive to obtain (both in terms of time and memory). For this reason, an approximation is often used whereby gradients of the within-episode descent steps are ignored. This variant is referred to as first-order MAML (fo-MAML) and was used our experiments. We did attempt to use the full-order version, but found it to be impractically expensive (e.g., it caused frequent out-of-memory problems).

Moreover, in our setting, since the number of ways will be varying between episodes, we do not learn and set them to zero (i.e., are the result of within-episode gradient descent initialized at 0), thus only training . In other words, MAML focuses on learning the within-episode initialization of the embedding network so that it can be rapidly adapted for a new task.

Introducing Proto-MAML

We introduce a novel meta-learner which we argue captures the best of both Prototypical Networks and MAML. In particular, the former exploits a very simple inductive bias that was demonstrated to be effective for reasoning about new classes from very few examples. However, it lacks a mechanism for within-task adaptation. On the other hand, the latter adopts a simple procedure for task adaptation using only a few steps. We view Proto-MAML as the marriage of these two complementary strengths.

As explained in Snell et al. (2017), Prototypical Networks can be re-interpreted as a linear classifier applied to a learned representation . In particular, due to using the squared Euclidean distance metric on top of the learned embeddings, the probability of a query example belonging to the different classes of the episode under the formulation of the Prototypical Network can be viewed as the output of a linear layer with a particular parameterization. Specifically, let denote a query example, the trainable embedding function, and the prototype for class

. Then, the ‘logit’ for

belonging to class is:

where the scalar value captures the term that does not relate to class and will not affect the softmax probabilities. The ’th unit of the equivalent linear layer therefore has weights and bias . It’s worth mentioning that since is a function of , and are thus differentiable with respect to .

We refer to Proto-MAML as the (fo-)MAML model where the task-specific linear layer of each episode is initialized from the Prototypical Network-equivalent weights and bias defined above and subsequently optimized as usual on the given support set. When computing the meta update for , we allow gradients to flow through the Prototypical Network-equivalent linear layer initialization. We show that this simple modification significantly helps the optimization of this model and outperforms vanilla fo-MAML by a large margin on Meta-Dataset.

4 Meta-Dataset: A New Few-Shot Classification Benchmark

Meta-Dataset aims to offer an environment for measuring progress in realistic few-shot classification tasks. In particular, we argue that it constitutes a more realistic environment for assessing performance on a more realistic version of the task. Our approach therefore is twofold: 1) changing the data and 2) changing the formulation of the task (i.e., how episodes are generated).

We accomplish the former by incorporating multiple diverse data sources, and the latter by introducing a new sampling procedure for episodes that more closely resembles realistic learning scenarios. The following sections describe these modifications in detail. The code is open source and publicly

4.1 Meta-Dataset’s Data

The data we propose to use is much larger in size than any previous benchmark, and is comprised of multiple different existing datasets. This invites research into how diverse sources of data can be exploited by a meta-learner, and allows us to evaluate a more challenging generalization problem, to new datasets altogether. Specifically, Meta-Dataset leverages data from the following 10 datasets: ILSVRC-2012 (ImageNet) (Russakovsky et al., 2015), Omniglot (Lake et al., 2015), Aircraft (Maji et al., 2013), CUB-200-2011 (Birds) (Wah et al., 2011), Describable Textures (Cimpoi et al., 2014), Quick Draw (Jongejan et al., 2016), Fungi (Schroeder & Cui, 2018), VGG Flower (Nilsback & Zisserman, 2008), Traffic Signs (Houben et al., 2013) and MSCOCO (Lin et al., 2014). These datasets were chosen because they are free and easy to obtain, span a variety of visual concepts (natural and human-made) and vary in how fine-grained the class definition is. More information about each of these datasets is provided in Appendix A.

However, to ensure that episodes correspond to realistic classification problems, all episodes generated in Meta-Dataset use classes from a single dataset at a time only. Moreover, two of these datasets, Traffic Signs and MSCOCO, are fully reserved for evaluation, meaning that no classes from them participate in the training set. The remaining ones contribute some classes to each of the training, validation and test splits of classes, roughly with 70% / 15% / 15% proportions. Two of these datasets, ImageNet and Omniglot, possess a class hierarchy that we exploit in Meta-Dataset. These are described below.


While ImageNet is commonly-used, we define a new class split for meta-learning on it and a novel procedure for sampling classes from it during episode creation. Both of these are informed by its class hierarchy, which we describe below.

ImageNet is a dataset comprised of 82,115 ‘synsets’, i.e., concepts of the WordNet ontology, and it provides ‘is-a’ relationships for its synsets, thus defining a DAG over them. In this benchmark, we only use the 1000 synsets that were chosen for the ILSVRC 2012 classification challenge as classes that can appear in our episodes. However, we leverage the ontology DAG for defining a sampling procedure that determines which of these 1000 classes should co-occur in each episode.

For this purpose, we consider a sub-graph of the overall DAG that consists of only the 1000 synsets of ILSVRC-2012 and their ancestors, so these 1000 synsets are all and only the leaves of the DAG. We then further ‘cut’ this sub-graph into three pieces, for the training, validation, and test splits, such that there is no overlap between the leaves of any of these pieces. We selected the synsets ‘carnivore’ and ‘device’ as the roots of the validation and test sub-graphs, respectively. The leaves that are reachable from ‘carnivore’ and ‘device’ form the sets of the validation and test classes, respectively. All of the remaining leaves constitute the training classes. This method of splitting ensures that the training classes (non-carnivore animals) are substantially semantically different from the test classes (inanimate devices such as various tools and instruments). We end up with 712 training, 202 validation and 188 test classes, roughly adhering to the standard 70 / 15 / 15 (%) splits.


This dataset is one of the established benchmarks for few-shot classification as mentioned earlier. The commonly used setup, however, flattens and ignores its two-level hierarchy of alphabets and characters. Instead, we take advantage of it to influence how we sample classes for creating episodes, yielding finer-grained within-alphabet classification problems. We also use the original splits proposed in (Lake et al., 2015): (all characters of) the ‘background’ and ‘evaluation’ alphabets are used for training and testing, respectively. However, we removed the 5 smallest alphabets from the ‘background’ set to reserve them for validation.

4.2 Episode Sampling

In this section we outline our proposed sampling algorithm for creating more realistic episodes.

Firstly, for ImageNet and Omniglot whose classes are hierarchically organized, we depart from the usual random uniform class selection process by incorporating this additional knowledge into the episode creation. Exploiting class structure can lead to more realistic tasks as, for example, it is arguably fairly unusual to classify frogs from laptops. A more natural task would instead be to classify desks from laptops, since these are more often observed simultaneously.

Further, we allow classes to have different ‘shots’, i.e., numbers of examples in the support set, therefore allowing for imbalanced episodes. Indeed, class imbalance is an inherent property of the world, so it is desirable to examine and improve upon the ability of few-shot learners to cope with it. Consider for example the classes of cat and a very specific species of alligator. There are plausibly more cats in the world than that specific alligator species, so there are more opportunities to learn about the former than the latter class. To obtain realistic imbalance ratios in episodes, we sample the number of examples of each class from a distribution derived from the relative class frequencies in the original dataset for these classes, as outlined later. Additionally, unlike current benchmarks, we allow the support sets to vary in size, both in its number of classes and in the total number of examples.

More concretely, our algorithm for sampling an episode from a given split of a dataset can be broken down into sampling two steps: sampling a set of classes from the given split and dataset, and sampling support and query sets of examples from those classes.

Sampling the episode’s class set

This procedure differs depending on which dataset is chosen. For datasets without a known class organization, we sample the ‘way’ uniformly from the range , where MAX-CLASSES is either or as many as there are available. Then we sample ‘way’ many classes uniformly at random from the requested class split (train, validation or test) of the given dataset. For ImageNet and Omniglot we employ a class-structure-aware procedure, outlined below.

ImageNet class sampling

We adopt a hierarchy-aware sampling algorithm for ImageNet, as follows. First, we sample a node uniformly at random from the set of ‘eligible’ nodes of the DAG structure corresponding to the specified split (train, validation or test). An internal node is ‘eligible’ for this selection if it spans at least 5 leaves, but no more than 392 leaves. The number 392 was chosen because it is the smallest number so that, collectively, all eligible internal nodes span all leaves in the DAG.

Once an eligible node is selected, some of the leaves that it spans will constitute the classes of the episode. Specifically, if the number of those leaves is no greater than 50, we use all of them. Otherwise, we randomly choose 50 of them.

This procedure enables the creation of tasks of varying degrees of fine-grainedness. For instance, if the sampled internal node has a small height, the leaf classes that it spans will represent semantically-related concepts, thus posing a fine-grained classification task. As the height of the sampled node increases, we ‘zoom out’ to consider a broader scope from which we sample classes and the resulting episodes are more coarse-grained.

Omniglot class sampling

We sample classes from Omniglot by first sampling an alphabet uniformly at random from the chosen split of alphabets (train, validation or test). Then, the ‘way’ of the episode is sampled uniformly at random using the same restrictions as for the rest of the datasets, but taking care not to sample a larger number than the number of characters that belong to the chosen alphabet. Finally, the prescribed number of characters of that alphabet are randomly sampled. This ensures that each episode presents a within-alphabet fine-grained classification.

Sampling the episode’s examples

We first sample the query set size. The query set is class-balanced, reflecting the fact that we care equally to perform well on all classes of an episode. The number of query images per class is computed as:

where is the set of selected classes and denotes the set of images belonging to class . The min over classes ensures that each class has at least images to add to the query set, thus allowing it to be class-balanced. The multiplier ensures that enough images of each class will be available to add to the support set, and the minimum with prevents the query set from being too large.

Then, we compute the total support set size:

where is a scalar sampled uniformly from interval . Intuitively, each class on average contributes either all its remaining examples (after placing of them in the query set) if there are less than or otherwise, to avoid having too large support sets. The multiplication with enables the potential generation of smaller support sets even when multiple images are available, since we are also interested in examining the very-low-shot end of the spectrum. The ‘ceiling’ operation ensures that each selected class will have at least one image in the support set. Finally, we cap the total support set size to .

We are now ready to compute the ‘shot’ of each class. Specifically, the proportion of the support set that will be devoted to class is computed as:

where is sampled uniformly from the interval . Intuitively, the un-normalized proportion of the support set that will be occupied by class is a noisy version of the total number of images of that class in the dataset . This design choice is made in the hopes of obtaining realistic class ratios, under the hypothesis that the dataset class statistics are a reasonable approximation of the real-world statistics of appearances of the corresponding classes. The shot of a class is then set to:

which ensures that at least one example is selected for each class, with additional examples selected proportionally to , if enough are available.

After these steps, we complete the episode creation process by choosing the prescribed number of examples of each chosen class uniformly at random to populate the support and query sets.

5 Related Work

In our experiments, we focused on the evaluation of three meta-learning methods: Matching Networks, Prototypical Networks and fo-MAML. Indeed, they are some of the first from the meta-learning family to have been proposed and are regularly used as the meta-learning baselines to beat in few-shot learning research. That said, in the past two years, there have been several other methods proposed for few-shot learning. Some bear similarity with MAML and correspond to a meta-learner trained to quickly adapt its parameters to various novel tasks (Ravi & Larochelle, 2017; Munkhdalai & Yu, 2017; Rusu et al., 2019; Yoon et al., 2018). Others relate to Prototypical Networks by learning a data representation as well as a compact representation for a classifier of data under that representation (Bertinetto et al., 2019; Gidaris & Komodakis, 2018; Oreshkin et al., 2018; Gidaris & Komodakis, 2018). Methods similar to Matching Networks, in how they classify a novel example by performing comparisons with each individual support set example, were also proposed. These were based on graph neural networks (Satorras & Estrach, 2018), relational networks (Sung et al., 2018) and attentional networks (Mishra et al., 2018). Hence, we believe that the three meta-learning methods evaluated in this work are reasonably representative of current few-shot learning research. That said, there are other methods that less directly relate to the methods above, such as the work of Santoro et al. (2016) based on a memory-augmented recurrent network (one of the earliest meta-learning approach to few-shot learning). Hence, we also look forward to future work evaluating more alternative methods on Meta-Dataset.

Similarly to our work, Antoniou et al. (2019) have also proposed improvements to MAML. Coined MAML++, their method is a collection of adjustments, including the use of multiple meta-trained inner loop learning rates, derivative-order annealing, and more. In comparison, our Proto-MAML variant simply changes the expression for the initial output weights in the inner loop and could easily be combined with the recommendations made by MAML++.

Finally, Meta-Dataset bears similarity to the CVPR 2017 Visual Domain Decathlon Challenge, in which contestants were tasked to train a joint system that can perform well on 10 different datasets, many of which are included in our benchmark. At test time, the submitted system classifies examples from the same classes as those available for training. This is unlike Meta-Dataset, which is designed for the few-shot learning scenario where generalization must be achieved for examples of never-before-seen classes.

6 Experiments

Meta-Dataset does not prescribe a procedure for learning from the training data. In fact, we believe that meta-learning from multiple heterogeneous sources of training classes is an open research problem. In these experiments though, keeping with the spirit of matching training and testing conditions, we trained each model through a series of training episodes that were sampled using the same algorithm as we used for Meta-Dataset’s evaluation episodes, described above. The choice of the dataset from which to sample the next episode was also random uniform. The non-episodic baseline is trained to solve the large classification problem that results from ‘concatenating’ the training classes of all datasets.

Further, we decided to perform validation on (the validation split of) ImageNet only, ignoring the validation sets of the other datasets. The rationale behind this choice is that the performance on ImageNet has been known to be a good proxy for the performance on different datasets. Notably, a common procedure for dealing with a new classification dataset is to finetune ImageNet-pre-trained weights on it, instead of learning new weights from scratch. However, we believe that the choice of the validation procedure used in this setup could benefit from additional research.

Test Source Method: Accuracy (%) confidence (%)
-NN Finetune MatchingNet ProtoNet fo-MAML Proto-MAML
ILSVRC 38.16 1.01 47.47 1.10 43.89 1.05 43.43 1.07 29.22 1.00 50.23 1.13
Omniglot 59.40 1.31 62.97 1.39 62.44 1.25 60.41 1.35 45.42 1.61 60.65 1.40
Aircraft 44.41 0.92 56.35 1.03 50.64 0.95 48.60 0.88 33.81 0.91 54.53 0.95
Birds 45.75 0.98 61.63 1.03 56.36 1.03 63.73 1.00 39.04 1.17 69.71 1.04
Textures 61.53 0.75 67.82 0.86 65.55 0.76 62.17 0.77 50.60 0.74 66.68 0.80
Quick Draw 46.42 1.10 50.89 1.15 50.24 1.12 50.53 0.97 24.33 1.39 49.03 1.12
Fungi 29.91 0.93 33.01 1.06 33.66 1.00 35.95 1.09 16.36 0.86 39.04 1.03
VGG Flower 77.23 0.74 82.30 0.85 80.21 0.74 79.47 0.81 56.01 1.22 85.78 0.80
Traffic Signs 58.42 1.28 55.67 1.19 59.64 1.20 46.93 1.11 23.53 1.17 47.83 1.03
MSCOCO 31.46 1.00 33.77 1.37 29.83 1.15 35.24 1.11 13.47 1.04 38.06 1.17
Avg. rank 3.20 1.90 2.40 2.30 4.50 1.20
Table 2: Few-shot classification results on Meta-Dataset using models trained on All datasets.
Test Source Method: Accuracy (%) confidence (%)
-NN Finetune MatchingNet ProtoNet fo-MAML Proto-MAML
ILSVRC 28.46 0.83 39.68 1.02 40.81 1.02 41.82 1.06 22.41 0.80 45.48 1.02
Omniglot 88.42 0.63 85.57 0.89 75.62 1.09 78.61 1.10 68.14 1.35 86.26 0.85
Aircraft 70.10 0.73 69.81 0.93 60.68 0.87 66.57 0.92 44.48 0.91 79.15 0.67
Birds 47.34 0.97 54.07 1.08 57.09 0.95 63.57 1.02 36.70 1.13 72.67 0.96
Textures 56.39 0.74 62.66 0.81 64.65 0.77 66.60 0.80 45.79 0.67 66.69 0.77
Quick Draw 66.12 0.91 73.88 0.81 58.86 1.01 63.55 0.92 41.27 1.46 67.83 0.90
Fungi 38.35 1.08 31.85 1.08 34.38 1.01 37.97 1.07 14.21 0.81 44.58 1.19
VGG Flower 73.21 0.75 77.55 0.94 82.60 0.66 84.43 0.69 61.10 1.11 88.21 0.68
Traffic Signs 49.84 1.23 53.07 1.13 57.90 1.16 50.60 1.02 24.03 1.08 46.38 1.03
MSCOCO 24.29 0.92 27.71 1.20 30.20 1.13 37.58 1.14 13.63 0.96 35.12 1.20
Avg. rank 3.2 2.8 2.8 2.2 5.2 1.5
Table 1: Few-shot classification results on Meta-Dataset using models trained on ILSVRC-2012 only.

This common ImageNet-pretraining procedure also inspired us to train variants of each meta-learner in which the embedding function is initialized from the representation to which the baseline model trained on ImageNet converged to. We find that this initialization helps meta-learners substantially.

We experiment with two architectures: a four-layer convolutional network that is commonly used for few-shot learning, and an 18-layer residual network. All models performed best with the latter. We also tried two different image resolutions: 84, which is the image resolution of the commonly-used ‘mini-ImageNet’ benchmark, and 126. All models performed better with the larger images, except for fo-MAML. Finally, we tuned the learning rate schedule, weight decay, and model-specific hyperparameters. We used ADAM to train all models.

Notably, for fo-MAML and Proto-MAML, we tuned the learning rate of the within-episode training, the number of within-episode training steps, and the number of additional such steps to perform in evaluation episodes (sampled from the validation or test splits). Our best-performing fo-MAML variant used 6 training steps, with a learning rate of 0.01 and no additional steps on evaluation episodes. Interestingly, Proto-MAML preferred the lower learning rate of 0.0001 but took 10 steps to adapt to each training task, and an additional 5 steps (totalling 15) in each validation or test episode. All other experimental details are included in the source code.

Results on Meta-Dataset
(a) Ways Analysis
(b) Shots Analysis
(c) Fine-grainedness Analysis (on ImageNet’s train graph)
Figure 1: Analysis of performance as a function of the episode’s way, shots, and degree of fine-grainedness.

Table 2 and Table 2 present the results of the evaluation on the test set of each of the 10 datasets. The difference between them is the training source, i.e., the data that the models were trained on, which is (the training classes of) ILSVRC-only, and all datasets, respectively. No classes from Traffic Signs and MSCOCO are used during training, since these datasets have no training split and are reserved for evaluation only. We propose to use the average (over the datasets) rank of each method as our metric for comparison, where smaller is better. A method receives rank 1 if it has the highest accuracy, rank 2 if it has the second highest, and so on. When two methods are ‘tied’ for a position, they both receive the corresponding rank. Both tables demonstrate the superiority of Proto-MAML over the remaining models in Meta-Dataset’s evaluation tasks. The Finetune Baseline notably presents a worthy opponent, while fo-MAML, to our surprise, performs quite poorly on Meta-Dataset.

We also recorded the performance of the different models for the various ways and shots that they encountered during their evaluation on test episodes of Meta-Dataset. This enables us to examine their robustness to these different settings. We show the evaluation results of the methods whose training source was (the training classes of) all datasets. Their ImageNet-only-trained counterparts exhibit the same trends, and we included them in the Appendix instead.

Figure (a)a plots the accuracy as a function of the episode’s ‘way’. These results reflect what we expected: the more classes a task has, the harder it is. Perhaps more interestingly, Figure (b)b illustrates the ability of the different models to benefit from larger shots. In particular, for every ‘shot’ of a class in a test episode, we plot the percentage of query examples of that class that are correctly classified (we refer to this as the ‘precision’ of the class). The general trend is not surprising: the more support examples a class has, the easier it is to correctly classify its query examples. However, this plot sheds light on some interesting trade-offs between the different models. In the very-low-shot end of the spectrum, Prototypical Networks and Proto-MAML outshine the other models. However, Prototypical Networks are evidently less capable of improving given more ‘shots’. On the other hand, the Finetune baseline, Matching Networks and fo-MAML improve at a faster rate given more data. Further, we argue that Proto-MAML indeed constitutes a step towards a ‘best of all worlds’ model, since it is the top-performer in the truly few-shot setting, and yet improves upon Prototypical Networks’ ability to benefit from more data. We think that taking additional steps in this direction is an interesting research problem.

Fine-grainedness analysis

We had hypothesized that finer-grained tasks (e.g., between dog breeds) are more challenging than coarse-grained ones (e.g., frogs versus laptops). To investigate this, we created binary ImageNet episodes whose two classes are chosen uniformly at random from the DAG’s set of leaves. We then define the degree of coarse-grainedness of a task as the height of the lowest common ancestor of the two chosen leaves, where the height is defined as the length of the longest path from the lowest common ancestor to one of the selected leaves. Larger heights then correspond to coarser-grained tasks. Surprisingly, we did not detect any trend when performing this analysis on the test DAG. The results on the training DAG, though, do seem to indicate that our hypothesis holds to some extent. These results are shown in Figure (c)c. We conjecture that this may be due to the richer structure of the training DAG, but we encourage further investigation.

7 Conclusion

We have introduced a new large-scale, diverse, and realistic environment for training and testing meta-learners for the task of few-shot classification. We believe that our exploration of various models on Meta-Dataset has uncovered certain weaknesses of the current state-of-the-art meta-learning methods which allow us to pinpoint interesting directions for future research.

In particular, we view our experiments as the first attempt to meta-learn from a diverse set of sources, and we feel there is plenty of room for improvement. In particular, we don’t always observe a generalization gain from training on all datasets over training on ImageNet only and in fact, in some cases the performance drops. This suggests that our method of consuming training data of different datasets can be improved. Further, through our analysis of the performance as a function of the shots, we discovered that different models perform well on different ends of this spectrum. We argue that our new Proto-MAML variant is a first step towards more robust meta-learners, but we believe that more work is needed on this front.

Generally, this benchmark opens the door to the use of multiple data sources for few-shot learning. Despite having only 10 datasets, developing this benchmark allowed us to explore and identify good practices for a codebase that supports this setting. On the longer term, we thus view Meta-Dataset as only a first step towards the establishment of more challenging benchmarks for few-shot learning research, with increasingly many dataset sources. To move in this direction, future work will likely require considering other domains beyond natural or man-made images (e.g., from the medical domain or from computer graphics simulations). It will also rely on the ability and willingness of the community to continue to release new freely available image classification datasets.

Author Contributions

Eleni, Hugo, and Kevin came up with the benchmark idea and requirements. Eleni developed the core of the project, and worked on the experiment design and management with Tyler and Kevin, as well as experiment analysis. Carles, Ross, Kelvin, Pascal, Vincent, and Tyler helped extend the benchmark by adding datasets. Eleni and Vincent contributed the Prototypical Networks and Matching Networks implementations, respectively. Tyler implemented baselines, MAML (with Kevin) and Proto-MAML models, and updated the backbones to support them. Writing was mostly led Eleni, with contributions by Hugo, Vincent, and Kevin and help from Tyler and Pascal for visualizations. Pascal and Pierre-Antoine worked on code organization, efficiency, and open-sourcing, Pascal and Vincent optimized the efficiency of the data input pipeline. Pierre-Antoine supervised the code development process and reviewed most of the changes, Hugo and Kevin supervised the overall direction of the research.


The authors would like to thank Chelsea Finn for fruitful discussions and advice on tuning fo-MAML and ensuring the correctness of implementation, as well as Zack Nado and Dan Moldovan for the initial dataset code that was adapted.


  • Antoniou et al. (2019) Antoniou, A., Edwards, H., and Storkey, A. How to train your MAML. In International Conference on Learning Representations, 2019.
  • Bertinetto et al. (2019) Bertinetto, L., Henriques, J. F., Torr, P., and Vedaldi, A. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, 2019.
  • Cimpoi et al. (2014) Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and Vedaldi, A. Describing textures in the wild. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2014.
  • Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks.

    International Conference of Machine Learning

    , 2017.
  • Gidaris & Komodakis (2018) Gidaris, S. and Komodakis, N. Dynamic few-shot visual learning without forgetting. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • Ha & Eck (2017) Ha, D. and Eck, D. A neural representation of sketch drawings. arXiv, abs/1704.03477, 2017.
  • Houben et al. (2013) Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., and Igel, C. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In International Joint Conference on Neural Networks, 2013.
  • Jongejan et al. (2016) Jongejan, J., Rowley, H., Kawashima, T., Kim, J., and Fox-Gieg, N. The Quick, Draw! – A.I. experiment., 2016.
  • Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft COCO: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014.
  • Maji et al. (2013) Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft. arXiv, abs/1306.5151, 2013.
  • Mishra et al. (2018) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
  • Munkhdalai & Yu (2017) Munkhdalai, T. and Yu, H. Meta networks. In International Conference on Machine Learning, pp. 2554–2563, 2017.
  • Nilsback & Zisserman (2008) Nilsback, M.-E. and Zisserman, A. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
  • Oreshkin et al. (2018) Oreshkin, B. N., Rodriguez, P., and Lacoste, A. TADAM: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 719–729, 2018.
  • Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • Rusu et al. (2019) Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.
  • Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning, pp. 1842–1850, 2016.
  • Satorras & Estrach (2018) Satorras, V. G. and Estrach, J. B. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.
  • Schroeder & Cui (2018) Schroeder, B. and Cui, Y. FGVCx fungi classification challenge 2018., 2018.
  • Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087, 2017.
  • Stallkamp et al. (2011) Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. The German Traffic Sign Recognition Benchmark: A multi-class classification competition. In IEEE International Joint Conference on Neural Networks, pp. 1453–1460, 2011.
  • Sung et al. (2018) Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638, 2016.
  • Wah et al. (2011) Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  • Yoon et al. (2018) Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, 2018.

Appendix A Datasets

Meta-Dataset is formed of data originating from 10 different image datasets. A complete list of the datasets we use is the following.

(a) ImageNet
(b) Omniglot
(c) Aircraft
(d) Birds
(e) DTD
(f) Quick Draw
(g) Fungi
(h) VGG Flower
(i) Traffic Signs
Figure 2: Training examples taken from the various datasets forming Meta-Dataset.
ILSVRC-2012 (ImageNet) (Russakovsky et al., 2015)

A dataset of natural images from 1000 categories (Figure (a)a). We removed some images that were duplicates of images in another dataset in Meta-Dataset (43 images that were also part of Birds) or other standard datasets of interest (92 from Caltech-101 and 286 from Caltech-256). The complete list of duplicates is part of the source code release.

Omniglot (Lake et al., 2015)

A dataset of images of 1623 handwritten characters from 50 different alphabets, with 20 examples per class (Figure (b)b). While recently (Vinyals et al., 2016) proposed a new split for this dataset, we instead make use of the original intended split (Lake et al., 2015) which is more challenging since the split is on the level of alphabets (30 training alphabets and 20 evaluation alphabets), not characters from those alphabets, therefore posing a more challenging generalization problem. Out of the 30 training alphabets, we hold out the 5 smallest ones (i.e. with the least number of character classes) to form our validation set, and use the remaining 25 for training.

Aircraft (Maji et al., 2013)

A dataset of images of aircrafts spanning 102 model variants, with 100 images per class (Figure (c)c).

CUB-200-2011 (Birds) (Wah et al., 2011)

A dataset for fine-grained classification of 200 different bird species. (Figure (d)d).

Describable Textures (DTD) (Cimpoi et al., 2014)

A texture database, consisting of 5640 images, organized according to a list of 47 terms (categories) inspired from human perception. (Figure (e)e).

Quick Draw (Ha & Eck, 2017)

A dataset of 50 million black-and-white drawings across 345 categories, contributed by players of the game Quick, Draw! (Figure (f)f).

Fungi (Schroeder & Cui, 2018)

A large dataset of approximately 100K images of nearly 1,500 wild mushrooms species (Figure (g)g).

VGG Flower (Nilsback & Zisserman, 2008)

A dataset of natural images of 102 flower categories. The flowers chosen to be ones commonly occurring in the United Kingdom. Each class consists of between 40 and 258 images. (Figure (h)h).

Traffic Signs (Stallkamp et al., 2011)

A dataset of 50,000 images of German road signs in 43 classes (Figure (i)i).

Mscoco (Lin et al., 2014)

A dataset of images collected from Flickr with 1.5 million object instances belonging to 80 classes labelled and localized using bounding boxes. We choose the train2017 split and create images crops from original images using each object instance’s groundtruth bounding box. (Figure (j)j).

Appendix B Analysis of Performance Across Shots and Ways

For completeness, we show the results of the analysis of the robustness to different ways and shots for the variants of the models that were trained on ImageNet only. We observe the same trends as discussed in our Experiments section for the variants of the models that were trained on all datasets.

(a) Ways Analysis
(b) Shots Analysis
Figure 3: Analysis of performance as a function of the episode’s way, shots for models whose training source is ImageNet only.
(a) Fine-grainedness Analysis (on ImageNet’s train graph graph)
(b) Fine-grainedness Analysis (on ImageNet’s test graph)
Figure 4: Analysis of performance as a function of the degree of fine-grainedness. Larger heights correspond to coarser-grained tasks.

Appendix C Training on more datasets than ILSVRC-2012

For more clearly observing whether training on all datasets leads to improved generalization over training on ImageNet only, Figure 5 displays the data of Tables 1 and 2, showing side-to-side the performance of each model trained on ILSVRC only vs. all datasets.

We also computed the ‘element-wise’ difference between the results in Table 2 and Table 1. These differences are shown in the following table, as well as in Figure 6. A positive entry indicates that the test performance on the corresponding datasets improved when using the variant of the corresponding model that was trained on all training sources.

Figure 5: Accuracy (%) on the test datasets, for each model. The difference between the plain-colored and hacked bars show the effect of training the model on ILSVRC-2012 only, vs. all the datasets.
Test Source Method: Accuracy (%) confidence (%)
-NN Finetune MatchingNet ProtoNet fo-MAML Proto-MAML
ILSVRC -9.71 1.31 -7.79 1.50 -3.08 1.47 -1.61 1.50 -6.81 1.28 -4.75 1.52
Omniglot 29.01 1.46 22.60 1.65 13.18 1.66 18.20 1.74 22.72 2.10 25.62 1.64
Aircraft 25.70 1.17 13.46 1.39 10.04 1.29 17.97 1.28 10.67 1.29 24.61 1.16
Birds 1.59 1.38 -7.56 1.49 0.73 1.40 -0.16 1.43 -2.34 1.63 2.97 1.41
Textures -5.13 1.05 -5.16 1.19 -0.90 1.08 4.43 1.11 -4.81 1.00 0.01 1.12
Quick Draw 19.70 1.43 22.99 1.40 8.62 1.51 13.02 1.33 16.94 2.02 18.80 1.44
Fungi 8.44 1.43 -1.15 1.51 0.72 1.42 2.02 1.53 -2.15 1.18 5.54 1.57
VGG Flower -4.02 1.05 -4.75 1.27 2.39 0.99 4.95 1.06 5.10 1.65 2.43 1.05
Traffic Signs -8.59 1.78 -2.59 1.64 -1.74 1.67 3.67 1.51 0.50 1.60 -1.45 1.46
MSCOCO -7.16 1.36 -6.07 1.82 0.37 1.61 2.34 1.59 0.16 1.42 -2.94 1.68
Table 3: The improvement on Meta-Dataset obtained by training on All Datasets instead of ILSVRC only.
Figure 6: Accuracy improvement on the test datasets, when training on all datasets instead of ILSVRC only.

This table shows that we do not always observe a clear generalization advantage in training from a wider collection of image datasets. While some of the datasets that were added to the meta-training phase did see an improvement across all models, in particular for Omniglot and Quick Draw, this was not true across the board. In fact, in certain cases the performance dropped. We believe that more successfully leveraging diverse sources of data is an interesting open research problem.