Meta-Learning Requires Meta-Augmentation

by   Janarthanan Rajendran, et al.
University of Michigan

Meta-learning algorithms aim to learn two components: a model that predicts targets for a task, and a base learner that quickly updates that model when given examples from a new task. This additional level of learning can be powerful, but it also creates another potential source for overfitting, since we can now overfit in either the model or the base learner. We describe both of these forms of metalearning overfitting, and demonstrate that they appear experimentally in common meta-learning benchmarks. We then use an information-theoretic framework to discuss meta-augmentation, a way to add randomness that discourages the base learner and model from learning trivial solutions that do not generalize to new tasks. We demonstrate that meta-augmentation produces large complementary benefits to recently proposed meta-regularization techniques.


page 6

page 13


Learning to generate imaginary tasks for improving generalization in meta-learning

The success of meta-learning on existing benchmarks is predicated on the...

Accelerating Gradient-based Meta Learner

Meta Learning has been in focus in recent years due to the meta-learner ...

Learning Abstract Task Representations

A proper form of data characterization can guide the process of learning...

Bootstrapped Meta-Learning

Meta-learning empowers artificial intelligence to increase its efficienc...

Towards Better Meta-Initialization with Task Augmentation for Kindergarten-aged Speech Recognition

Children's automatic speech recognition (ASR) is always difficult due to...

Few Shot Dialogue State Tracking using Meta-learning

Dialogue State Tracking (DST) forms a core component of automated chatbo...

An Optimization-Based Meta-Learning Model for MRI Reconstruction with Diverse Dataset

Purpose: This work aims at developing a generalizable MRI reconstruction...

1 Introduction

In several areas of machine learning, data augmentation is critical to achieving state-of-the-art performance. In computer vision 

Shorten and Khoshgoftaar (2019), speech recognition Ko et al. (2015)

, and natural language processing 

Fadaee et al. (2017), augmentation strategies effectively increase the support of the training distribution, improving generalization. Although data augmentation is often easy to implement, it has a large effect on performance. For a ResNet-50 model He et al. (2016) applied to the ILSVRC2012 object classification benchmark Russakovsky et al. (2015), removing randomly cropped images and color distorted images from the training data results in a 7% reduction in accuracy (see Appendix A

). Recent work in reinforcement learning 

Kostrikov et al. (2020); Laskin et al. (2020) also demonstrate large increases in reward from applying image augmentations.

Meta-learning has emerged in recent years as a popular framework for learning new tasks in a sample-efficient way. The goal is to learn how to learn new tasks from a small number of examples, by leveraging knowledge from learning other tasks Thrun (1998); Ellis (1965); Hochreiter et al. (2001); Schmidhuber et al. (1996); Bengio et al. (1991). Given that meta-learning adds an additional level of model complexity to the learning problem, it is natural to suspect that data augmentation plays an equally important - if not greater - role in helping meta-learners generalize to new tasks. In classical machine learning, data augmentation turns one example into several examples. In meta-learning, meta-augmentation turns one task into several tasks. Our key observation is that given labeled examples , classical data augmentation adds noise to inputs without changing , but for meta-learning we must do the opposite: add noise to labels , without changing inputs .

The main contributions of our paper are as follows: first, we present a unified framework for meta-data augmentation and an information theoretic view on how it prevents overfitting. Under this framework, we interpret existing augmentation strategies and propose modifications to handle two forms of overfitting: memorization overfitting, in which the model is able to overfit to the training set without relying on the learner, and learner overfitting, in which the learner overfits to the training set and does not generalize to the test set. Finally, we show the importance of meta-augmentation on a variety of benchmarks and meta-learning algorithms.

Figure 1: Meta-learning problems provide support inputs to a base learner, which applies an update to a model. Once applied, the model is given query input , and must learn to predict query target . (a) Memorization overfitting occurs when the base learner and does not impact the model’s prediction of . (b) Learner overfitting occurs when the model and base learner leverage both and to predict , but fails to generalize to the meta-test set. (c) Yin et al. (2020) propose an information bottleneck constraint on the model capacity to reduce memorization overfitting. (d) To tackle both forms of overfitting, we view meta-data augmentation as widening the task distribution, by encoding additional random bits in that must be decoded by the base learner and model in order to predict a transformed .

2 Background and definitions

A standard supervised machine learning problem considers a set of training data indexed by and sampled from a task , where the goal is to learn a function . In meta-learning, we have a set of tasks , where each task is made of a support set , and a query set . The grouped support and query sets are referred to as an episode. The training and test sets of examples are replaced by meta-training and meta-test sets of tasks, each of which consist of episodes. The goal is to learn a base learner that first observes support data for a new task, then outputs a model which yields correct predictions for . When applied to classification, this is commonly described as -shot, -way classification, indicating examples in the support set, with class labels . Following Triantafillou et al. (2020), we rely mostly on meta-learning centric nomenclature but borrow the terms “support”, “query”, and “episode” from the few-shot learning literature.

Meta-learning has two levels of optimization: an inner-loop optimization and an outer-loop optimization. The inner-loop optimization (the base learner) updates a model using such that it better predicts from , while the outer-loop optimization updates the base learner in order to improve how the inner-loop update is carried out. While many meta-learning benchmarks design the support set to have the same data type as the query set , the support need not match the data type of the query and can be any arbitrary data. Some meta-learners utilize an inner-loop objective that bears no resemblance to the objective computed on the query set Finn et al. (2017); Metz et al. (2018)

, Contextual meta-learners replace explicit inner-loop optimization with information extraction using a neural network, which is then learnt jointly in the outer-loop optimization 

Santoro et al. (2016); Garnelo et al. (2018). As shown in Figure 1, meta-learners can be treated as black-box functions mapping two inputs to one output, . We denote our support and query sets as and , as all of our tasks have matching inner and outer-loop data types.

Overfitting in the classical machine learning sense occurs when the model has memorized the specific instances of the training set at the expense of generalizing to new instances of the test set. Meta-learning has two levels of learning, so perhaps unsurprisingly, either of these learning loops can overfit at training time. We first discuss how the inner-loop can overfit.

A set of tasks are said to be mutually-exclusive if a single model cannot solve them all at once. For example, if task is “output 0 if the image is a dog, and 1 if it is a cat”, and task is “output 0 if the image is a cat, and 1 if it is a dog”, then are mutually-exclusive. Such tasks are easier to meta-learn, since the model must use to learn how to adapt to each task. A set of tasks is non-mutually-exclusive if the opposite is true: one model can solve all tasks at once. Yin et al. (2020) identify the memorization problem in meta-learning that can happen when the set of tasks are non-mutually-exclusive. A non-mutually-exclusive setting potentially leads to complete meta-learning memorization, where the model memorizes the support set and predicts using only , without relying on the base learner (shown in Figure 1a). This is acceptable for training performance, but results in poor generalization on the meta-test set, because the memorized model does not have knowledge of test-time tasks and does not know how to use the base learner to learn the new task. We refer to this as “memorization overfitting” for the remainder of the paper.

In a -shot, -way classification problem, we repeatedly create random tasks by sampling classes from the pool of all classes. Viewing this as a meta-augmentation, we note this creates a mutually-exclusive set of tasks, since class labels will conflict across tasks. For example, if we sampled and from the earlier example, the label for dog images is either or , depending on the task. Yin et al. (2020) argue it is not straightforward to replicate this recipe in tasks like regression. Yin et al. (2020) propose applying weight decay and an information bottleneck to the model (Figure 1c), which restricts the information flow from and the model to . They construct meta-regularized (MR) variants of contextual (CNP Garnelo et al. (2018)) and gradient-based (MAML Finn et al. (2017)) meta-learners and demonstrate its effectiveness on a novel non-mutually-exclusive pose regression dataset. In the proceeding sections, we will show that an alternative randomized augmentation can be applied to this dataset, and other such non-mutually-exclusive meta-learning problems.

While overfitting of the inner-loop optimization leads to memorization overfitting, overfitting of the outer-loop optimization leads to what we refer to as learner overfitting. Learner overfitting (shown in Figure 1b) happens when the base learner overfits to the training set tasks and does not generalize to the test set tasks. The learned base learner is able to leverage successfully to help the model predict for each episode in the meta-training set, coming from the different training set tasks. But, the base learner is unable to do the same for novel episodes from the meta-test set. Learner overfitting can happen both in non-mutually-exclusive and mutually-exclusive task settings, and can be thought of as the meta-learning version of classical machine learning overfitting.

3 Meta-augmentation


be random variables representing data from a task, from which we sample examples

. An augmentation is a process where a source of random bits , and a mapping , are combined to create new data . We assume all are also in . An example is where , and rotates image by , giving . We define an augmentation to be CE-preserving (conditional entropy preserving) if conditional entropy is conserved; for instance, the rotation augmentation is CE-preserving because rotations in do not affect the predictiveness of the original or rotated image to the class label. CE-preserving augmentations are commonly used in image-based problems Triantafillou et al. (2020); Liu et al. (2020); Shorten and Khoshgoftaar (2019). Conversely, an augmentation is CE-increasing if it increases conditional entropy, . For example, if is continuous and , then is CE-increasing, since will have two examples with shared and different , increasing . These are shown in Figure 2.

Figure 2: Meta-augmentation: We introduce the notion of CE-preserving and CE-increasing augmentations to explain why meta augmentation differs from standard data augmentation. Given random variables and an external source of random bits , we augment with a mapping . Left: An augmentation is CE-preserving if it preserves conditional entropy between . Center: A CE-increasing augmentation increases . Right: Invertible CE-increasing augmentations can be used to combat memorization overfitting: the model must rely on the base learner to implicitly recover from in order to restore predictiveness between the input and label.

The information capacity of a learned channel is given by , the conditional mutual information given model parameters and meta-training dataset . Yin et al. (2020) propose using a Variational Information Bottleneck Alemi et al. (2017) to regularize the model by restricting the information flow between and . Correctly balancing regularization to prevent overfitting or underfitting can be challenging, as the relationship between constraints on model capacity and generalization are hard to predict. For example, overparameterized networks have been empirically shown to have lower generalization error Novak et al. (2018). To corroborate the difficulty of understanding this relationship, in Section 5.2 we show that weight decay limited the baseline performance of MAML on the Pascal3D pose regression task, and when disabled it performs substantially better.

Rather than crippling the model by limiting its access to , we want to instead use data augmentation to encourage the model to pay more attention to . A naive approach would augment in the same way as classical machine learning methods, by applying a CE-preserving augmentation to each task. However, the overfitting problem in meta-learning requires different augmentation. We wish to couple together such that the model cannot minimize training loss using alone. This can be done through CE-increasing augmentation. Labels are encrypted to with the same random key , in a way such that the base learner can only recover by associating , and doing so is necessary to associate . See Figure 1d and Figure 2 for a diagram.

Although CE-increasing augmentations may use any , within our experiments we only consider CE-increasing augmentations where and is invertible. We do this because our aim is to increase task diversity. Given a single task , such CE-increasing augmentations create a distribution of tasks indexed by , and assuming a fixed noise source , said distribution is widest when is invertible. CE-increasing augmentations of this form raise by . In order to lower to a level where the task can be solved, the learner must extract at least bits from . This reduces memorization overfitting, since it guarantees has some information required to predict , even given the model and .

CE-increasing augmentations move the task setting from non-mutually-exclusive to mutually-exclusive, as once the data is augmented, no single model can solve all tasks at once. By adding new and different varieties of tasks to the meta-train set, CE-increasing augmentations also help avoid learner overfitting and help the base learner generalize to test set tasks. This effect is similar to the effect that data augmentation has in classical machine learning to avoid overfitting.

Few shot classification benchmarks such as Mini-ImageNet 

Vinyals et al. (2016) have meta-augmentation in them by default. New tasks created by shuffling the class index of previous tasks are added to the training set. This form of augmentation is CE-increasing and hence moves the task setting to mutually-exclusive, thereby avoiding memorization overfitting. This is accompanied by creation of new tasks through combining classes from different tasks, adding more variation to the meta-train set. These added tasks help avoid learner overfitting, and in Section 5.1 we analyze the size of this effect.

Our formulation generalizes to other forms of CE-increasing augmentation besides permuting class labels for classification. For multivariate regression tasks where the support set contains a regression target, the dimensions of

can be treated as class logits to be permuted. This reduces to an identical setup to the classification case. For scalar meta-learning regression tasks and situations where output dimensions cannot be permuted, we show the CE-increasing augmentation of adding uniform noise to the regression targets

and then clipping to valid bounds generates enough new tasks to help reduce overfitting.

4 Related work

Data augmentation has been applied to several domains with strong results, including image classification Krizhevsky et al. (2012), speech recognition Ko et al. (2015), reinforcement learning Kostrikov et al. (2020), and language learning Zhang et al. (2015). Within meta-learning, augmentation has been applied in several ways. Mehrotra and Dukkipati (2017) train a generator to generate new examples for one-shot learning problems, and Liu et al. (2020) applied task augmentation to few-shot image classification by rotating images, defining each rotation as a new task. These augmentations add more data and tasks, but do not turn non-mutually-exclusive problems into mutually-exclusive ones, since the pairing between is still consistent across meta-learning episodes, leaving open the possibility of memorization overfitting. Antoniou and Storkey (2019) and Khodadadeh et al. (2019) generate tasks by randomly selecting from an unsupervised dataset, using data augmentation on to generate more examples for the random task. We instead create a mutually-exclusive task setting by modifying to create more tasks with shared .

The large interest in the field has spurred the creation of meta-learning benchmarks Lake et al. (2015); Vinyals et al. (2016); Triantafillou et al. (2020); Yu et al. (2020); Ravi and Larochelle (2016), investigations into tuning few-shot models Antoniou et al. (2018), and analysis of what these models learn Raghu et al. (2020). For overfitting in MAML in particular, regularization has been done by encouraging the model’s output to be uniform before the base learner updates the model Jamal and Qi (2019), limiting the updateable parameters Zintgraf et al. (2019)

, or regularizing the gradient update based on cosine similarity or dropout 

Guiroy et al. (2019); Tseng et al. (2020). Our work is most closely related to Yin et al. (2020), which identifies the non-mutually-exclusive tasks problem, and proposes an information bottleneck to address memorization overfitting. Our paper tackles this problem through appropriately designed meta-augmentation.

5 Experiments

We ablate meta-augmentation across few-shot image classification benchmarks in Section 5.1, demonstrating the drop in performance when these benchmarks are made non-mutually-exclusive by removing the meta-augmentation present in them. Section 5.2 then demonstrates gains achievable from using CE-increasing augmentations on regression datasets.111Code and data available at

5.1 Few-shot image classification (Omniglot, Mini-ImageNet, D’Claw, Meta-Dataset)

We carry out our experiments on -shot -way classification tasks where . Common few-shot image classification benchmarks, like Omniglot and Mini-ImageNet, are already mutually-exclusive by default through meta-augmentation. In order to study the effect of meta-augmentation using task shuffling on various datasets, we turn these mutually-exclusive benchmarks into non-mutually-exclusive versions of themselves by partitioning the classes into groups of classes without overlap. These groups form the meta-train tasks, and over all of training, class order is never changed.


The Omniglot dataset Lake et al. (2015) is a collection of 1623 different handwritten characters from different alphabets. The task is to identify new characters given a few examples of that character. We train a 1-shot, 5-way classification model using MAML Finn et al. (2017). When early stopping is applied, all our MAML models reached 98% test-set accuracy, including on non-mutually-exclusive task sets. This suggests that although non-mutually-exclusive meta-learning problems allow for memorization overfitting, it is still possible to learn models which do not overfit and generalizes to novel test-time tasks if there is sufficient training data. We therefore moved to more challenging datasets.


Mini-ImageNet Vinyals et al. (2016) is a more complex few-shot dataset, based on the ILSVRC object classification dataset Deng et al. (2009). There are 100 classes in total, with 600 samples each. We train 1-shot, 5-way classification models using MAML. We ran experiments in three variations, shown in Figure 3. In non-mutually-exclusive

, classes are partitioned into ordered tasks that keep the class index consistent across epochs. In

intershuffle, tasks are formed by sampling classes randomly, where previously used classes can be resampled in later tasks. This is the default protocol for Mini-ImageNet. Doing so ensures that over the course of training, a given image will be assigned multiple different class indices , making the task setting mutually-exclusive. This also creates more diverse groups of classes / tasks compared to the previous partitioning method. To ablate whether diversity or mutual exclusivity is more important, we devise the intrashuffle scheme, which shuffles class label order within the partitions from the non-mutually-exclusive setup. This uses identical classes within each task to non-mutually-exclusive, and the only difference is varying class order within those tasks. This again gives a mutually-exclusive setting, but with less variety compared to intershuffle.

Table 1 (Mini-ImageNet (MAML)) shows that non-mutually-exclusive does worst, followed by intrashuffle, then intershuffle. There is an increase in accuracy from 30% to 43% between non-mutually-exclusive and intrashuffle, and only 43% to 46% between intrashuffle and intershuffle, which indicates shuffled task order is the more important factor. The gap between intrashuffle and intershuffle also indicates that fully randomizing the batches of tasks performs slightly better, as each class is compared to more diverse sets of negative classes. We emphasize that if shuffling is applicable to the problem, intershuffle is the recommended approach. Interestingly, Figure 4 shows that although all non-mutually-exclusive runs overfit at training time, whether they exhibit memorization overfitting or learner overfitting depends on random seed. Both forms of overfitting are avoided by meta-augmentation through intershuffle.

Figure 3: Non-mutually-exclusive, intrashuffle, and intershuffle.

In this example, the dataset has 4 classes, and the model is a 2-way classifier. In non-mutually-exclusive, the model always sees one of 2 tasks. In intrashuffle, the model sees permutations of the classes in the non-mutually-exclusive tasks, which changes class order. In intershuffle, the model sees

Figure 4: Mini-ImageNet results with MAML. Left: In a non-mutually-exclusive setting, this model exhibits memorization overfitting. Train-time performance is high, even before the base learner updates the model based on , indicating the model pays little attention to . The model fails to generalize to the held-out validation set. Center: This model exhibits learner overfitting. The gap between train pre-update and train post-update indicates the model does pay attention to , but the entire system overfits and does poorly on the validation set. The only difference between the left and center plots is the random seed. Right: With intershuffle augmentation, the gap between train pre-update and train post-update indicates the model pays attention to , and higher train time performance lines up with better validation set performance, indicating less overfitting.

D’Claw dataset

So far, we have only examined CE-increasing augmentations in existing datasets. To verify the phenomenon applies to a newly generated dataset, we collected a small robotics-inspired image classification dataset using hardware designs from ROBEL Ahn et al. (2019). Figure 5 shows a D’Claw mounted inside a D’Lantern. An object is placed between the fingers of the claw, and the task is to classify whether that object is in the proper orientation or not. A classifier learned on this dataset could be used to automatically annotate whether an autonomous robot has correctly manipulated an object. Labels are , where means wrong orientation and means correct orientation. The dataset has 20 different classes of 35 images each, and the problem is set up as a 1-shot, 2-way classification problem. We see a change from to for MAML between non-mutually-exclusive and intershuffle, a gain consistent with the Mini-ImageNet results.

Figure 5: Example images from the D’Claw classification task. An object is placed between the fingers of the claw. Each object has a target orientation, and the model must classify whether the object is in the correct orientation or not.
Problem setting Non-mutually-exclusive accuracy Intrashuffle accuracy Intershuffle accuracy
Omniglot 98.1% 98.5% 98.7%
Mini-ImageNet (MAML) 30.2% 42.7% 46.0%
Mini-ImageNet (Prototypical) 32.5% 32.5% 37.2%
Mini-ImageNet (Matching) 33.8% 33.8% 39.8%
D’Claw 72.5% 79.8% 83.1%
Table 1: Few-shot image classification test set results. Results use MAML unless otherwise stated. All results are in 1-shot 5-way classification, except for D’Claw which is 1-shot 2-way.


The problem of non-mutually-exclusive tasks is not just a problem for MAML – it is a general problem for any algorithm that trains a learning mechanism and model end-to-end. Applying the same non-mutually-exclusive Mini-ImageNet transform, we evaluate baselines from Meta-Dataset Triantafillou et al. (2020), and show similar drops in performance for their implementation of Matching Networks Vinyals et al. (2016) and Prototypical Networks Snell et al. (2017) (shown in Table 1). For these methods, performance is identical between non-mutually-exclusive and intrashuffle because they are nearest-neighbor based models, which are permutation invariant by construction.

5.2 Regression tasks (Sinusoid, Pascal3D Pose Regression)

Our regression experiments use tasks with a scalar output, which are not well suited to shuffling. For these tasks, we experiment with the CE-increasing augmentation of simply adding randomly sampled noise to regression targets .


We start with the toy 1D sine wave regression problem, where . Amplitude and phase are sampled as , phase , and inputs are sampled from . To create a non-mutually-exclusive version, domain is subdivided into 10 disjoint intervals . Each interval is assigned a different task and is generated based on the sine wave for the interval lies in. The 10 tasks are non-mutually-exclusive because there exists a continuous function that covers each piecewise component.

To augment MAML, for each task and the corresponding interval, new tasks where are added to the meta-train dataset. This is CE-increasing, since even given and the interval it lies in, the model can no longer uniquely identify the task, thereby increasing . Figure 5(a) compares the performance of models that are 1) trained with supervised regression on all the tasks 2) trained using MAML and 3) trained using MAML + augmentation. We observe that MAML + augmentation performs best. Since the tasks are non-mutually-exclusive, the baseline MAML model memorizes all the different training tasks (sine functions). The 1D sine regression problem is simple enough that MAML can learn even test tasks with a few gradient steps, but MAML + augmentation adapts more quickly because of less memorization.

(a) Sinusoid
(b) CNP
(c) MAML
Figure 6: (a) We add CE-increasing noise to such that the noise must be inferred from the support set, and show that it improves generalization on a sinusoid regression task. (b) Post-update MSE on the meta-test set for MR-CNP, sweeping combinations of model regularization strength and meta-data augmentation strength . Lower MSE is better. (c) Meta-Augmentation with noise sampled from a discrete set provides additional improvements on top of MR-MAML.

Pascal3D Pose Regression

We show that for the regression problem introduced by Yin et al. (2020)

, it is still possible to reduce overfitting via meta augmentation. We extended the open-source implementation provided by 

Yin et al. (2020) of their proposed methods, MR-MAML and MR-CNP and show that meta augmentation provides large improvements on top of IB regularization. Each task is to take a 128x128 grayscale image of an object from the Pascal 3D dataset and predict its angular orientation (normalized between 0 and 10) about the Z-axis, with respect to some unobserved canonical pose specific to each object. The set of tasks is non-mutually-exclusive because each object is visually distinct, allowing the model to overfit to the poses of training objects, neglecting the base learner and . Unregularized models have poor task performance at test time, because the model does not know the canonical poses of novel objects.

We use two different augmentation schemes: for CNP models, we added random uniform noise in the range to both and , with angular wraparound to the range . To study the interaction effects between meta augmentation and the information bottleneck regularizer, we did a grid search over noise scale and regularization strength . Figure 5(b) shows that data augmentation’s benefits are complementary to model regularization. Adding uniform noise to MAML and MR-MAML resulted in underfitting the data (Appendix B), so we chose a simpler CE-increasing augmentation where is randomly sampled from a discrete set . This fixed the problem of underfitting while still providing regularization benefits. Best results are obtained by applying augmentation to MR-MAML without the IB bottleneck (, weights are still sampled stochastically).

Table 2 reports meta-test MSE scores from Yin et al. (2020) along with our results, aggregated over 5 independent training runs. The baseline was underfitting due to excessive regularization; the existing MAML implementation is improved by removing the weight decay (WD) penalty.

Method MAML (WD=1e-3) MAML (WD=0) MR-MAML (=0.001) MR-MAML (=0) CNP MR-CNP
No Aug
Table 2:

Pascal3D pose prediction error (MSE) means and standard deviations. Removing weight decay (WD) improves the MAML baseline and augmentation improves the MAML, MR-MAML, CNP, MR-CNP results. Bracketed numbers copied from

Yin et al. (2020).

6 Discussion and future work

Setting aside how “task” and “example” are defined for a moment, the most general definition of a meta-learner is a black box function

. In this light, memorization overfitting is just a classical machine learning problem in disguise: a function approximator pays too much attention to one input , and not enough to the other input , when the former is sufficient to solve the task at training time. The two inputs could take many forms, such as different subsets of pixels within the same image.

By a similar analogy, learner overfitting corresponds to correct function approximation on input examples from the training set, and a systematic failure to generalize from those to the test set. Although we have demonstrated that meta-augmentation is helpful, it is important to emphasize it has its limitations. Distribution mismatch between train-time tasks and test-time tasks can be lessened through augmentation, but augmentation may not entirely remove it.

One crucial distinction between data-augmentation in classical machine learning and meta-augmentation in meta-learning is the importance of CE-increasing augmentations. For data-augmentation in classical machine learning, the aim is to generate more varied examples, within a single task. Meta-augmentation has the exact opposite aim: we wish to generate more varied tasks, for a single example, to force the learner to quickly learn a new task from feedback. Our experiments exclusively focused on changing outputs without changing inputs , as these augmentations were simpler and easier to study. However, given the usefulness of CE-preserving augmentations in classical setups, it is likely that a combination of CE-preserving noise on and CE-increasing noise on will do best. We leave this for future work.

Broader Impact

Our work discusses methods of improving meta-learning through meta-augmentation at the task level, and does not target any specific application of meta-learning. The learning algorithms meta-learning generates are ones learned from the data in train-time tasks. It is possible these approaches inject bias into not only how models perform after training, but also inject bias into the training procedure itself. Biased learners can result in biased model outcomes even when the support set presented at test time is unbiased, and this may not be straightforward to detect because the behavior of the learner occurs upstream of actual model predictions. We believe our work is a positive step towards mitigating bias in meta-learning algorithms, by helping avoid overfitting to certain parts of the inputs.

We thank Mingzhang Yin and George Tucker for discussion and help in reproducing experimental results for pose regression experiments. We thank Chelsea Finn and Honglak Lee for early stage project discussion, Erin Grant for help on the meta-dataset codebase, and Luke Metz, Jonathan Tompson, Chelsea Finn, and Vincent Vanhoucke for reviewing publication drafts of the work.

During his internship at Google, Jana devised, implemented and ran MAML experiments for Omniglot, mini-Imagenet and Sinusoid regression. Along with Alex, he collected the ROBEL classification datasets. Alex advised Jana’s internship and re-ran Jana’s code post-internship to include validation with early stopping. Eric advised Jana’s internship and ran experiments on pose prediction and meta-dataset. All three authors interpreted results and contributed to paper writing.


  • M. Ahn, H. Zhu, K. Hartikainen, H. Ponte, A. Gupta, S. Levine, and V. Kumar (2019) ROBEL: RObotics BEnchmarks for Learning with low-cost robots. In Conference on Robot Learning (CoRL), Cited by: §5.1.
  • A. Alemi, I. Fischer, J. Dillon, and K. Murphy (2017) Deep variational information bottleneck. In ICLR, External Links: Link Cited by: §3.
  • A. Antoniou, H. Edwards, and A. Storkey (2018) How to train your maml. In ICLR, Cited by: §4.
  • A. Antoniou and A. Storkey (2019) Assume, augment and learn: unsupervised few-shot meta-learning via random labels and data augmentation. arXiv preprint arXiv:1902.09884. Cited by: §4.
  • Y. Bengio, S. Bengio, and J. Cloutier (1991) Learning a synaptic learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, Vol. ii, pp. 969 vol.2–. Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §5.1.
  • H. C. Ellis (1965) The transfer of learning.. Cited by: §1.
  • M. Fadaee, A. Bisazza, and C. Monz (2017)

    Data augmentation for low-resource neural machine translation

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, pp. 567–573. External Links: Document Cited by: §1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §2, §5.1.
  • C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine (2017)

    One-shot visual imitation learning via meta-learning

    In Proceedings of the 1st Annual Conference on Robot Learning, S. Levine, V. Vanhoucke, and K. Goldberg (Eds.), Proceedings of Machine Learning Research, Vol. 78, , pp. 357–368. External Links: Link Cited by: §2.
  • M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. M. A. Eslami (2018) Conditional neural processes. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1704–1713. External Links: Link Cited by: §2, §2.
  • S. Guiroy, V. Verma, and C. Pal (2019) Towards understanding generalization in gradient-based meta-learning. arXiv preprint arXiv:1907.07287. Cited by: §4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1.
  • S. Hochreiter, A. S. Younger, and P. R. Conwell (2001) Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pp. 87–94. Cited by: §1.
  • M. A. Jamal and G. Qi (2019) Task agnostic meta-learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11719–11727. Cited by: §4.
  • S. Khodadadeh, L. Boloni, and M. Shah (2019) Unsupervised meta-learning for few-shot image classification. In Advances in Neural Information Processing Systems, pp. 10132–10142. Cited by: §4.
  • T. Ko, V. Peddinti, D. Povey, and S. Khudanpur (2015) Audio augmentation for speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1, §4.
  • I. Kostrikov, D. Yarats, and R. Fergus (2020) Image augmentation is all you need: regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649. Cited by: §1, §4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §4, §5.1.
  • M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas (2020) Reinforcement learning with augmented data. arXiv preprint arXiv:2004.14990. Cited by: §1.
  • J. Liu, F. Chao, and C. Lin (2020) Task augmentation by rotating for meta-learning. arXiv preprint arXiv:2003.00804. Cited by: §3, §4.
  • A. Mehrotra and A. Dukkipati (2017) Generative adversarial residual pairwise networks for one shot learning. arXiv preprint arXiv:1703.08033. Cited by: §4.
  • L. Metz, N. Maheswaranathan, B. Cheung, and J. Sohl-Dickstein (2018) Meta-learning update rules for unsupervised representation learning. arXiv preprint arXiv:1804.00222. Cited by: §2.
  • R. Novak, Y. Bahri, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein (2018) Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760. Cited by: §3.
  • A. Raghu, M. Raghu, S. Bengio, and O. Vinyals (2020) Rapid learning or feature reuse? towards understanding the effectiveness of maml. In International Conference on Learning Representations, Cited by: §4.
  • S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. In International Conference on Learning Representations, Cited by: §4.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §1.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §2.
  • J. Schmidhuber, J. Zhao, and M. Wiering (1996) Simple principles of metalearning. Technical report Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale (IDSIA) 69, pp. 1–23. Cited by: §1.
  • C. Shorten and T. M. Khoshgoftaar (2019)

    A survey on image data augmentation for deep learning

    Journal of Big Data 6 (1), pp. 60. Cited by: §1, §3.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §5.1.
  • S. Thrun (1998) Lifelong learning algorithms. In Learning to learn, pp. 181–209. Cited by: §1.
  • E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P. Manzagol, and H. Larochelle (2020) Meta-dataset: a dataset of datasets for learning to learn from few examples. In International Conference on Learning Representations, Cited by: §2, §3, §4, §5.1.
  • H. Tseng, Y. Chen, Y. Tsai, S. Liu, Y. Lin, and M. Yang (2020) Regularizing meta-learning via gradient dropout. External Links: 2004.05859 Cited by: §4.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §3, §4, §5.1, §5.1.
  • M. Yin, G. Tucker, M. Zhou, S. Levine, and C. Finn (2020) Meta-learning without memorization. In International Conference on Learning Representations, Cited by: Appendix B, Figure 1, §2, §2, §3, §4, §5.2, §5.2, Table 2.
  • T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine (2020) Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pp. 1094–1100. Cited by: §4.
  • X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §4.
  • L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson (2019) Fast context adaptation via meta-learning. In International Conference on Machine Learning, pp. 7693–7702. Cited by: §4.


Appendix A ImageNet augmentation ablation

ILSVRC2012 classification experiments were trained on a TPU v2-128, using publicly available source code from This codebase applies random crops and random left-right flips to the training images. Training a ResNet-50 baseline achieves 76% top-1 accuracy after 60 epochs. The same model with augmentations removed achieves 69% accuracy. This quantifies the performance increase data augmentation provides.

Appendix B Pose Regression

We modified the open-source implementation provided at

. Instead of re-generating a validation set from scratch using pose_data code, we used 10% of the training dataset as a validation dataset to determine optimal noise hyperparameters and outer learning rate while keeping the remaining default hyperparameters fixed. Afterward, the validation set was merged back into the training set.

We encountered difficulty reproducing the experimental results for MR-MAML (2.26 (0.09)) from Yin et al. [2020], despite using publicly available code and IB hyperparameters suggested by authors via email correspondence. It is possible that the discrepancy was due to another hyperparameter (e.g. outer and inner loop learning rates) not being set properly. We followed the same evaluation protocol as the authors via personal correspondence: for each independent trial, we trained until convergence (the default number of iterations specified) and report the average post-update test error for the last 1000 training iterations. Early stopping on validation error followed by evaluation on the test set may result in better performance across all methods, although we did not do this for the sake of consistency in comparing with prior work.

Surprisingly, adding uniformly sampled noise seems to hurt MR-MAML performance. We note that this augmentation successfully reduces the gap between training and test error - indicating that it fixes overfitting. However, it seems to result in underfitting on the training set as indicated by lower training set performance. This is shown in Figure 7. We hypothesize that gradients become too noisy under the current architecture, batch size, and learning rate hyperparameters set by the baseline when too much noise is added. Sampling from a discrete set of 4 noise values provides just enough task-level augmentation to combat overfitting without underfitting the training data.

(b) Uniform Augmentation
(c) Discrete Augmentation
Figure 7: (b) Adding data augmentation via continuous uniform noise to MR-MAML decreases the train-test gap, but seems to underfit data. We hypothesize that this might be addressed by increasing the size of the model or using an alternate architecture to aid optimization, though we leave model architecture changes out of the scope of this work. Sampling augmentations from a discrete set of noise values (c) reduces both training and test error.

To investigate this further, Figure 7(a) displays test loss as a function of the number of additional noise values added to . As the number of discrete noise values becomes large, this becomes approximately equivalent to the uniform noise scheme with . We find that MR-MAML achieves the lowest error at 4 augmentations, while MAML achieves it at 2 augmentations. Interestingly, as the number of noise values increases performance gets worse for all methods. MR-MAML exhibits less underfitting than MAML, suggesting that there are complementary benefits to IB regularization and augmentation when more noise is added. Figure 7(b) shows single-trial test error performance over combinations of how many discrete noise values are added across all IB strength parameters . Our best model combines noise with , which removes the IB constraint but preserves the stochastic weights used for MAML.

Figure 8: (a) Test performance as a function of number of discrete noise augmentations. Shaded regions correspond to standard deviation over 5 independent trials. (b) Figure 5(b), but for MR-MAML (single-trial).

Appendix C D’Claw data collection

The D’Claw dataset contains images from 10 different objects, placed between the fingers of the claw. Each object had 2 classes, corresponding to whether the object was in the correct natural orientation or not. This task is representative of defining a success detector that might be used as reward in a reinforcement learning context, although we purely treat it as a supervised image classification problem. The task is made more difficult because the claw occludes the view of the object.

The smartphone camera from a Google Pixel 2 XL was used to collect 35 example images for each class. All images were taken from the same point of view in a vertical orientation, but there is small variation in camera location and larger variation in lighting conditions. The images were resized to 84x84 images before training.

Appendix D Few-shot image classification

All few-shot image classification experiments were run on a cloud machine with 4 NVIDIA Tesla K80s and 32 Intel Broadwell CPUs. To maximize resource usage, four jobs were run at once.

Each experiment was trained for 60000 steps. The validation set accuracy was evaluated every 1000 steps, and the model with best validation set performance was then run on the test set. The code is a modified version of the MAML codebase from, and uses the same hyperparameters, network architecture, and datasets. These are briefly repeated below.


The MAML model was trained with a meta batch size of 32 tasks, using a convolutional model, 1 gradient step in the inner-loop, and an inner learning rate of . The problem was set up as a 1-shot 5-way classification problem. The training set contained 1200 characters, the validation set contained 100 characters, and the test set was the remaining 323 characters. Each character had 20 examples.


The MAML model was trained with a meta batch size of 4 tasks, using a convolutional model, 5 gradient steps in the inner loop, and an inner learning rate of . The problem was set up as a 1-shot 5-way classification problem. There are 64 training classes, 16 validation classes, and 20 test classes. Each class had 600 examples.


Mini-ImageNet experiments from the Meta-Dataset codebase (branched from used the default hyperparameters for Matching Networks and Prototypical Networks.


The experiments on D’Claw used the same model architecture as Mini-ImageNet, with a meta batch size of 4 tasks, 5 gradient steps in the inner loop, and an inner learning rate of . Unlike the Mini-ImageNet experiments, the problem was set up as a 1-shot 2-way classification problem. For D’Claw experiments, the train set has 6 objects (12 classes), the validation set has 2 objects (4 classes), and the test set has 2 objects (4 classes). Each class had 35 examples. The train, validation, and test splits were generated 3 times. Models were trained on each dataset, and the final reported performance is the average of the 3 test-set performances.