1 Introduction
The ability to learn new concepts and skills with small amounts of data is a critical aspect of intelligence that many machine learning systems lack. Metalearning
[29] has emerged as a promising approach for enabling systems to quickly learn new tasks by building upon experience from previous related tasks [32, 19, 28, 27, 8]. Metalearning accomplishes this by explicitly optimizing for fewshot generalization across a set of metatraining tasks. The metalearner is trained such that, after being presented with a small task training set, it can accurately make predictions on test datapoints for that metatraining task.While these methods have shown promising results, current methods require careful design of the metatraining tasks to prevent a subtle form of task overfitting
, distinct from standard overfitting in supervised learning. If the task can be accurately inferred from the test input alone, then the task training data can be ignored while still achieving low metatraining loss. In effect, the model will collapse to one that makes zeroshot decisions. This presents an opportunity for overfitting where the metalearner generalizes on metatraining tasks, but fails to adapt when presented with training data from novel tasks. We call this form of overfitting the
memorization problem in metalearning because the metalearner memorizes a function that solves all of the metatraining tasks, rather than learning to adapt.Existing metalearning algorithms implicitly resolve this problem by carefully designing the metatraining tasks such that no single model can solve all tasks zeroshot; we call tasks constructed in this way mutuallyexclusive. For example, for way classification, each task consists of examples from randomly sampled classes. The classes are labeled from to , and critically, for each task, we randomize the assignment of classes to labels (visualized in Appendix Figure 4). This ensures that the taskspecific classtolabel assignment cannot be inferred from a test input alone. However, the mutuallyexclusive tasks requirement places a substantial burden on the user to cleverly design the metatraining setup (e.g., by shuffling labels or omitting goal information). While shuffling labels provides a reasonable mechanism to force tasks to be mutuallyexclusive with standard fewshot image classification datasets such as MiniImageNet [27], this solution cannot be applied to all domains where we would like to utilize metalearning. For example, consider metalearning a pose predictor that can adapt to different objects: even if different objects are used for metatraining, a powerful model can simply learn to ignore the training set for each task, and directly learn to predict the pose of each of the objects. However, such a model would not be able to adapt to new objects at metatest time.
The primary contributions of this work are: 1) to identify and formalize the memorization problem in metalearning, and 2) to propose an metaregularizer (MR) using information theory as a general approach for mitigating this problem without placing restrictions on the task distribution. We formally differentiate the metalearning memorization problem from overfitting problem in conventional supervised learning, and empirically show that naïve applications of standard regularization techniques do not solve the memorization problem in metalearning. The key insight of our metaregularization approach is that the model acquired when memorizing tasks is more complex than the model that results from taskspecific adaptation because the memorization model is a single model that simultaneously performs well on all tasks. It needs to contain all information in its weights needed to do well on test points without looking at training points. Therefore we would expect the information content of the weights of a memorization model to be larger, and hence the model should be more complex. As a result, we propose an objective that regularizes the information complexity of the metalearned function class (motivated by Alemi et al. [2], Achille & Soatto [1]). Furthermore, we show that metaregularization in MAML can be rigorously motivated by a PACBayes bound on generalization. In a series of experiments on nonmutuallyexclusive task distributions entailing both fewshot regression and classification, we find that memorization poses a significant challenge for both gradientbased [8] and contextual [11] metalearning methods, resulting in near random performance on test tasks in some cases. Our metaregularization approach enables both of these methods to achieve efficient adaptation and generalization, leading to substantial performance gains across the board on nonmutuallyexclusive tasks.
2 Preliminaries
We focus on the standard supervised metalearning problem (see, e.g., Finn et al. [8]). Briefly, we assume tasks are sampled from a task distribution . During metatraining, for each task, we observe a set of training data and a set of test data with sampled from , and similarly for . We denote the entire metatraining set as . The goal of metatraining is to learn a model for a new task by leveraging what is learned during metatraining and a small amount of training data for the new task . We use to denote the metaparameters learned during metatraining and use to denote the taskspecific parameters that are computed based on the task training data.
Following Grant et al. [14], Gordon et al. [13], given a metatraining set , we consider metalearning algorithms that maximize conditional likelihood , which is composed of three distributions: that summarizes metatraining data into a distribution on metaparameters, that summarizes the pertask training set into a distribution on taskspecific parameters, and that is the predictive distribution. These distributions are learned to minimize
(1) 
For example, in MAML [8], and are the weights of a predictor network, is a delta function learned over the metatraining data, is a delta function centered at a point defined by gradient optimization, and parameterizes the predictor network [14]. In particular, to determine the taskspecific parameters , the task training data and are used in the predictor model
Another family of metalearning algorithms are contextual methods [28], such as conditional neural processes (CNP) [12, 11]. CNP instead defines as a mapping from to a summary statistic (parameterized by ). In particular, is the output of an aggregator applied to features extracted from the task training data. Then parameterizes a predictor network that takes and as input and produces a predictive distribution .
In the following sections, we describe a common pitfall for a variety of metalearning algorithms, including MAML and CNP, and a general metaregularization approach to prevent this pitfall.
3 The Memorization Problem in MetaLearning
The ideal metalearning algorithm will learn in such a way that generalizes to novel tasks. However, we find that unless tasks are carefully designed, current metalearning algorithms can overfit to the tasks and end up ignoring the task training data (i.e., either does not depend on or does not depend on , as shown in Figure 1), which can lead to poor generalization. This memorization phenomenon is best understood through examples.
Consider a 3D object pose prediction problem (illustrated in Figure 1), where each object has a fixed canonical pose. The pairs for the task are 2D greyscale images of the rotated object and the rotation angle relative to the fixed canonical pose for that object . In the most extreme case, for an unseen object, the task is impossible without using because the canonical pose for the unseen object is unknown. The number of objects in the metatraining dataset is small, so it is straightforward for a single network to memorize the canonical pose for each training object and to infer the object from the input image (i.e., task overfitting), thus achieving a low training error without using . However, by construction, this solution will necessarily have poor generalization to test tasks with unseen objects.
As another example, imagine an automated medical prescription system that suggests medication prescriptions to doctors based on patient symptoms and the patient’s previous record of prescription responses (i.e., medical history) for adaptation. In the metalearning framework, each patient represents a separate task. Here, the symptoms and prescriptions have a close relationship, so we cannot assign random prescriptions to symptoms, in contrast to the classification tasks where we can randomly shuffle the labels to create mutuallyexclusiveness. For this nonmutuallyexclusive task distribution, a standard metalearning system can memorize the patients’ identity information in the training, leading it to ignore the medical history and only utilize the symptoms combined with the memorized information. As a result, it may issue highly accurate prescriptions on the metatraining set, but fail to adapt to new patients effectively. While such a system would achieve a baseline level of accuracy for new patients, it would be no better than a standard supervised learning method applied to the pooled data.
We formally define (complete) memorization as:
Definition 1 (Complete MetaLearning Memorization).
Complete memorization in metalearning is when the learned model ignores the task training data such that (i.e., ).
Memorization describes an issue with overfitting the metatraining tasks, but it does not preclude the network from generalizing to unseen pairs on the tasks similar to the training tasks. Memorization becomes an undesired problem for generalization to new tasks when (i.e., the task training data is necessary to achieve good performance, even with exact inference under the data generating distribution, to make accurate predictions).
A model with the memorization problem may generalize to new datapoints in training tasks but cannot generalize to novel tasks, which distinguishes it from typical overfitting in supervised learning. In practice, we find that MAML and CNP frequently converge to this memorization solution (Table 2). For MAML, memorization can occur when a particular setting of that does not adapt to the task training data can achieve comparable metatraining error to a solution that adapts . For example, if a setting of can solve all of the metatraining tasks (i.e., for all in and the predictive error is close to zero), the optimization may converge to a stationary point of the MAML objective where minimal adaptation occurs based on the task training set (i.e., ). For a novel task where it is necessary to use the task training data, MAML can in principle still leverage the task training data because the adaptation step is based on gradient descent. However, in practice, the poor initialization of can affect the model’s ability to generalize from a small mount of data. For CNP, memorization can occur when the predictive distribution network can achieve low training error without using the task training summary statistics . On a novel task, the network is not trained to use , so it is unable to use the information extracted from the task training set to effectively generalize.
In some problem domains, the memorization problem can be avoided by carefully constructing the tasks. For example, for way classification, each task consists of examples from randomly sampled classes. If the classes are assigned to a random permutation of for each task, this ensures that the taskspecific classtolabel assignment cannot be inferred from the test inputs alone. As a result, a model that ignores the task training data cannot achieve low training error, preventing convergence to the memorization problem. We refer to tasks constructed in this way as mutuallyexclusive. However, the mutuallyexclusive tasks requirement places a substantial burden on the user to cleverly design the metatraining setup (e.g., by shuffling labels or omitting goal information) and cannot be applied to all domains where we would like to utilize metalearning.
4 Meta Regularization Using Information Theory
At a high level, the sources of information in the predictive distribution come from the input, the metaparameters, and the data. The memorization problem occurs when the model encodes task information in the predictive network that is readily available from the task training set (i.e., it memorizes the task information for each metatraining task). We could resolve this problem by encouraging the model to minimize the training error and to rely on the task training dataset as much as possible for the prediction of (i.e., to maximize ). Explicitly maximizing requires an intractable marginalization over task training sets to compute . Instead, we can implicitly encourage it by restricting the information flow from other sources ( and ) to . To achieve both low error and low mutual information between and , the model must use task training data to make predictions, hence increasing the mutual information , leading to reduced memorization. In this section, we describe two tractable ways to achieve this.
4.1 Meta Regularization on Activations
Given , the statistical dependency between and is controlled by the direct path from to and the indirect path through (see Figure 1), where the latter is desirable because it leverages the task training data. We can control the information flow between and by introducing an intermediate stochastic bottleneck variable such that [2] as shown in Figure 5. Now, we would like to maximize to prevent memorization. We can lower bound this mutual information by
(2) 
where is a variational approximation to the marginal, the first inequality follows from the statistical dependencies in our model (see Figure 5 and Appendix A.2 for the proof), and we use the fact that is conditionally independent of given and . By simultaneously minimizing and maximizing the mutual information , we can implicitly encourage the model to use the task training data .
For nonmutually exclusive problems, the true label is dependent on . Hence if (i.e., the prediction is independent of given the task training data and ) the predictive likelihood will be low. This suggests replacing the maximization of with minimization of the training loss in Eq. (2), resulting in the following regularized training objective
(3) 
where modulates the regularizer and we set as . We refer to this regularizer as metaregularization (MR) on the activations.
As we demonstrate in Section 6, we find that this regularizer performs well, but in some cases can fail to prevent the memorization problem. Our hypothesis is that in these cases, the network can sidestep the information constraint by storing the prediction of in a part of , which incurs only a small penalty.
4.2 Meta Regularization on Weights
Alternatively, we can penalize the task information stored in the metaparameters . Here, we provide an informal argument and provide the complete argument in Appendix A.3. Analogous to the supervised setting [1], given metatraining dataset , we consider
as random variable where the randomness can be introduced by training stochasticity. We model the stochasticity over
with a Gaussian distribution
with learned mean and variance parameters per dimension
[4, 1]. By penalizing , we can limit the information about the training tasks stored in the metaparameters and thus require the network to use the task training data to make accurate predictions. We can tractably upper bound it by(4) 
where is a variational approximation to the marginal, which we set to . In practice, we apply metaregularization to the metaparameters that are not used to adapt to the task training data and denote the other parameters as . In this way, we control the complexity of the network that can predict the test labels without using task training data, but we do not limit the complexity of the network that processes the task training data. Our final metaregularized objective can be written as
(5) 
4.3 Does Meta Regularization Lead to Better Generalization?
Now that we have derived meta regularization approaches for mitigating the memorization problem, we theoretically analyze whether meta regularization leads to better generalization via a PACBayes bound. In particular, we study meta regularization (MR) on the weights (W) of MAML, i.e. MRMAML (W), as a case study.
Meta regularization on the weights of MAML uses a Gaussian distribution to model the stochasticity in the weights. Given a task and task training data, the expected error is given by
(6) 
where the prediction loss is bounded^{1}^{1}1In practice, is MSE on a bounded target space or classification accuracy. We optimize the negative loglikelihood as a bound on the 01 loss.. Then, we would like to minimize the error on novel tasks
(7) 
We only have a finite sample of training tasks, so computing
is intractable, but we can form an empirical estimate:
(8) 
where for exposition we have assumed are the same for all tasks. We would like to relate and , but the challenge is that and are derived from the metatraining tasks . There are two sources of generalization error: (i) error due to the finite number of observed tasks and (ii) error due to the finite number of examples observed per task. Closely following the arguments in [3], we apply a standard PACBayes bound to each of these and combine the results with a union bound, resulting in the following Theorem.
Theorem 1.
Let be an arbitrary prior distribution over that does not depend on the metatraining data. Then for any
, with probability at least
, the following inequality holds uniformly for all choices of and ,(9) 
where is the number of metatraining tasks and is the number of pertask validation datapoints.
We defer the proof to the Appendix A.4. The key difference from the result in [3] is that we leverage the fact that the task training data is split into training and validation.
In practice, we set . If we can achieve a low value for the bound, then with high probability, our test error will also be low. As shown in the Appendix A.4, by a first order Taylor expansion of the the second term of the RHS in Eq.(9) and setting the coefficient of the KL term as , we recover the MRMAML(W) objective (Eq.(5)). tradesoff between the tightness of the generalization bound and the probability that it holds true. The result of this bound suggests that the proposed metaregularization on weights does indeed improve generalization on the metatest set.
5 Related Work
Previous works have developed approaches for mitigating various forms of overfitting in metalearning. These approaches aim to improve generalization in several ways: by reducing the number of parameters that are adapted in MAML [39], by compressing the task embedding [22], through data augmentation from a GAN [38], by using an auxiliary objective on task gradients [15], and via an entropy regularization objective [17]. These methods all focus on the setting with mutuallyexclusive task distributions. We instead recognize and formalize the memorization problem, a particular form of overfitting that manifests itself with nonmutuallyexclusive tasks, and offer a general and principled solution. Unlike prior methods, our approach is applicable to both contextual and gradientbased metalearning methods. We additionally validate that prior regularization approaches, namely TAML [17], are not effective for addressing this problem setting.
Our derivation uses a Bayesian interpretation of metalearning [31, 7, 6, 14, 13, 9, 18, 16]. Some Bayesian metalearning approaches place a distributional loss on the inferred task variables to constrain them to a prior distribution [12, 13, 26], which amounts to an information bottleneck on the latent task variables. Similarly Zintgraf et al. [39], Lee et al. [22], Guiroy et al. [15] aim to produce simpler or more compressed task adaptation processes. Our approach does the opposite, penalizing information from the inputs and parameters, to encourage the taskspecific variables to contain greater information driven by the pertask data.
We use PACBayes theory to study the generalization error of metalearning and metaregularization. Pentina & Lampert [25] extends the single task PACBayes bound [23] to the multitask setting, which quantifies the gap between empirical error on training tasks and the expected error on new tasks. More recent research shows that, with tightened generalization bounds as the training objective, the algorithms can reduce the test error for mutuallyexclusive tasks [10, 3]. Our analysis is different from these prior works in that we only include preupdate meta parameters in the generalization bound rather than both preupdate and postupdate parameters. In the derivation, we also explicitly consider the splitting of data into the task training set and task validation set, which is aligned with the practical setting.
The memorization problem differs from overfitting in conventional supervised learning in several aspects. First, memorization occurs at the task level rather than datapoint level and the model memorizes functions rather than labels. In particular, within a training task, the model can generalize to new datapoints, but it fails to generalize to new tasks. Second, the source of information for achieving generalization is different. For metalearning the information is from both the metatraining data and new task training data but in standard supervised setting the information is only from training data. Finally, the aim of regularization is different. In the conventional supervised setting, regularization methods such as weight decay [20], dropout [30], the information bottleneck [34, 33], and BayesbyBackprop [4] are used to balance the network complexity and the information in the data. The aim of metaregularization is different. It governs the model complexity to avoid one complex model solving all tasks, while allowing the model’s dependency on the task data to be complex. We further empirically validate this difference, finding that standard regularization techniques do not solve the memorization problem.
6 Experiments
In the experimental evaluation, we aim to answer the following questions: (1) How prevalent is the memorization problem across different algorithms and domains? (2) How does the memorization problem affect the performance of algorithms on nonmutuallyexclusive task distributions? (3) Is our metaregularization approach effective for mitigating the problem and is it compatible with multiple types of metalearning algorithms? (4) Is the problem of memorization empirically distinct from that of the standard overfitting problem?
To answer these questions, we propose several metalearning problems involving nonmutuallyexclusive task distributions, including two problems that are adapted from prior benchmarks with mutuallyexclusive task distributions. We consider modelagnostic metalearning (MAML) and conditional neural processes (CNP) as representative metalearning algorithms. We study both variants of our method in combination with MAML and CNP. When comparing with metalearning algorithms with and without metaregularization, we use the same neural network architecture, while other hyperparameters are tuned via crossvalidation perproblem.
6.1 Sinusoid Regression
First, we consider a toy sinusoid regression problem that is nonmutuallyexclusive. The data for each task is created in the following way: the amplitude of the sinusoid is uniformly sampled from a set of equallyspaced points ; is sampled uniformly from and is sampled from . We provide both and the amplitude
(as a onehot vector) as input, i.e.
. At the test time, we expand the range of the tasks by randomly sampling the datagenerating amplitude uniformly from and use a random onehot vector for the input to the network. The metatraining tasks are a proper subset of the metatest tasks.Without the additional amplitude input, both MAML and CNP can easily solve the task and generalize to the metatest tasks. However, once we add the additional amplitude input which indicates the task identity, we find that both MAML and CNP converge to the complete memorization solution and fail to generalize well to test data (Table 1 and Appendix Figures 7 and 8). Both metaregularized MAML and CNP (MRMAML) and (MRCNP) instead converge to a solution that adapts to the data, and as a result, greatly outperform the unregularized methods.
Methods  MAML 


CNP 



5 shot  0.46 (0.04)  0.17 (0.03)  0.16 (0.04)  0.91 (0.10)  0.10 (0.01)  0.11 (0.02)  
10 shot  0.13 (0.01)  0.07 (0.02)  0.06 (0.01)  0.92 (0.05)  0.09 (0.01)  0.09 (0.01) 
Test MSE for the nonmutuallyexclusive sinusoid regression problem. We compare MAML and CNP against metaregularized MAML (MRMAML) and metaregularized CNP (MRCNP) where regularization is either on the activations (A) or the weights (W). We report the mean over 5 trials and the standard deviation in parentheses.
6.2 Pose Prediction
To illustrate the memorization problem on a more realistic task, we create a multitask regression dataset based on the Pascal 3D data [37] (See Appendix A.5.1 for a complete description). We randomly select 50 objects for metatraining and the other 15 objects for metatesting. For each object, we use MuJoCo [35] to render images with random orientations of the instance on a table, visualized in Figure 1. For the metalearning algorithm, the observation is the grayscale image and the label is the orientation relative to a fixed canonical pose. Because the number of objects in the metatraining dataset is small, it is straightforward for a single network to memorize the canonical pose for each training object and to infer the orientation from the input image, thus achieving a low metatraining error without using . However, this solution performs poorly on the metatest set because it has not seen the novel objects and their canonical poses.
Optimization modes and hyperparameter sensitivity. We choose the learning rate from for each method, from for metaregularization and report the results with the best hyperparameters (as measured on the metavalidation set) for each method. In this domain, we find that the convergence point of the metalearning algorithm is determined by both the optimization landscape of the objective and the training dynamics, which vary due to stochastic gradients and the random initialization. In particular, we observe that there are two modes of the objective, one that corresponds to complete memorization and one that corresponds to successful adaptation to the task data. As illustrated in the Appendix, we find that models that converge to a memorization solution have lower training error than solutions which use the task training data, indicating a clear need for metaregularization. When the metaregularization is on the activations, the solution that the algorithms converge to depends on the learning rate, while MR on the weights consistently converges to the adaptation solution (See Appendix Figure 9 for the sensitivity analysis). This suggests that MR on the activations is not always successful at preventing memorization. Our hypothesis is that there exists a solution in which the bottlenecked activations encode only the prediction , and discard other information. Such a solution can achieve both low training MSE and low regularization loss without using task training data, particularly if the predicted label contains a small number of bits (i.e., because the activations will have low information complexity). However, note that this solution does not achieve low regularization error when applying MR to the weights because the function needed to produce the predicted label does not have low information complexity. As a result, metaregularization on the weights does not suffer from this pathology and is robust to different learning rates. Therefore, we will use regularization on weights as the proposed methodology in the following experiments and algorithms in Appendix A.1.
Quantitative results. We compare MAML and CNP with their metaregularized versions (Table 2). We additionally include finetuning as baseline, which trains a single network on all the instances jointly, and then finetunes on the task training data. Metalearning with metaregularization (on weights) outperforms all competing methods by a large margin. We show test error as a function of the metaregularization coefficient in Appendix Figure 3. The curve reflects the tradeoff when changing the amount of information contained in the weights. This indicates that gives a knob that allows us to tune the degree to which the model uses the data to adapt versus relying on the prior.
Method  MAML 

CNP 

FT  FT + Weight Decay  
MSE  5.39 (1.31)  2.26 (0.09)  8.48 (0.12)  2.89 (0.18)  7.33 (0.35)  6.16 (0.12) 
Comparison to standard regularization. We compare our metaregularization with standard regularization techniques, weight decay [20] and BayesbyBackprop [4], in Table 3. We observe that simply applying standard regularization to all the weights, as in conventional supervised learning, does not solve the memorization problem, which validates that the memorization problem differs from the standard overfitting problem.
Methods  CNP  CNP + Weight Decay  CNP + BbB  MRCNP (W) (ours) 
MSE  8.48 (0.12)  6.86 (0.27)  7.73 (0.82)  2.89 (0.18) 
6.3 Omniglot and MiniImagenet Classification
Next, we study memorization in the fewshot classification problem by adapting the fewshot Omniglot [21] and MiniImagenet [27, 36] benchmarks to the nonmutuallyexclusive setting. In the nonmutuallyexclusive Nway Kshot classification problem, each class is (randomly) assigned a fixed classification label from 1 to N. For each task, we randomly select a corresponding class for each classification label and task training data points and task test data points from that class^{2}^{2}2We assume that the number of classes in the metatraining set is larger than .. This ensures that each class takes only one classification label across tasks and different tasks are nonmutuallyexclusive (See Appendix A.5.2 for details).
We evaluate MAML, TAML [17], MRMAML (ours), finetuning, and a nearest neighbor baseline on nonmutuallyexclusive classification tasks (Table 4). We find that MRMAML significantly outperforms previous methods on all of these tasks. To better understand the problem, for the MAML variants, we calculate the preupdate accuracy (before adaptation on the task training data) on the metatraining data in Appendix Table 5. The high preupdate metatraining accuracy and low metatest accuracy are evidence of the memorization problem for MAML and TAML, indicating that it is learning a model that ignores the task data. In contrast, MRMAML successfully controls the preupdate accuracy to be near chance and encourages the learner to use the task training data to achieve low metatraining error, resulting in good performance at metatest time.
Finally, we verify that metaregularization does not degrade performance on the standard mutuallyexclusive task. We evaluate performance as a function of regularization strength on the standard 20way 1shot Omniglot task (Appendix Figure 10), and we find that small values of lead to slight improvements over MAML. This indicates that metaregularization substantially improves performance in the nonmutuallyexclusive setting without degrading performance in other settings.
7 Conclusion and Discussion
Metalearning has achieved remarkable success in fewshot learning problems. However, we identify a pitfall of current algorithms: the need to create task distributions that are mutually exclusive. This requirement restricts the domains that metalearning can be applied to. We formalize the failure mode, i.e. the memorization problem, that results from training on nonmutuallyexclusive tasks and distinguish it as a functionlevel overfitting problem compared to the the standard labellevel overfitting in supervised learning.
We illustrate the memorization problem with different metalearning algorithms on a number of domains. To address the problem, we propose an algorithmagnostic metaregularization (MR) approach that leverages an informationtheoretic perspective of the problem. The key idea is that by placing a soft restriction on the information flow from metaparameters in prediction of test set labels, we can encourage the metalearner to use task training data during metatraining. We achieve this by successfully controlling the complexity of model prior to the task adaptation.
The memorization issue is quite broad and is likely to occur in a wide range of realworld applications, for example, personalized speech recognition systems, learning robots that can adapt to different environments [24], and learning goalconditioned manipulation skills using trialanderror data. Further, this challenge may also be prevalent in other conditional prediction problems, beyond metalearning, an interesting direction for future study. By both recognizing the challenge of memorization and developing a general and lightweight approach for solving it, we believe that this work represents an important step towards making metalearning algorithms applicable to and effective on any problem domain.
References
 Achille & Soatto [2018] Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018.
 Alemi et al. [2016] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
 Amit & Meir [2018] Ron Amit and Ron Meir. Metalearning by adjusting priors based on extended pacbayes theory. In International Conference on Machine Learning, pp. 205–214, 2018.
 Blundell et al. [2015] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 Cover & Thomas [2012] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Edwards & Storkey [2016] Harrison Edwards and Amos Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.

FeiFei et al. [2003]
Li FeiFei et al.
A bayesian approach to unsupervised oneshot learning of object
categories.
In
Proceedings Ninth IEEE International Conference on Computer Vision
, pp. 1134–1141. IEEE, 2003.  Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. JMLR. org, 2017.
 Finn et al. [2018] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic modelagnostic metalearning. In Advances in Neural Information Processing Systems, pp. 9516–9527, 2018.

Galanti et al. [2016]
Tomer Galanti, Lior Wolf, and Tamir Hazan.
A theoretical framework for deep transfer learning.
Information and Inference: A Journal of the IMA, 5(2):159–209, 2016.  Garnelo et al. [2018a] Marta Garnelo, Dan Rosenbaum, Chris J Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J Rezende, and SM Eslami. Conditional neural processes. arXiv preprint arXiv:1807.01613, 2018a.
 Garnelo et al. [2018b] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint, 2018b.
 Gordon et al. [2018] Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard E Turner. Metalearning probabilistic inference for prediction. arXiv preprint arXiv:1805.09921, 2018.
 Grant et al. [2018] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradientbased metalearning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
 Guiroy et al. [2019] Simon Guiroy, Vikas Verma, and Christopher Pal. Towards understanding generalization in gradientbased metalearning. arXiv preprint arXiv:1907.07287, 2019.
 Harrison et al. [2018] James Harrison, Apoorva Sharma, and Marco Pavone. Metalearning priors for efficient online bayesian regression. arXiv preprint arXiv:1807.08912, 2018.

Jamal & Qi [2019]
Muhammad Abdullah Jamal and GuoJun Qi.
Task agnostic metalearning for fewshot learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 11719–11727, 2019.  Kim et al. [2018] Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian modelagnostic metalearning. arXiv preprint arXiv:1806.03836, 2018.
 Koch et al. [2015] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for oneshot image recognition. In ICML deep learning workshop, volume 2, 2015.
 Krogh & Hertz [1992] Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957, 1992.
 Lake et al. [2011] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, volume 33, 2011.
 Lee et al. [2019] Yoonho Lee, Wonjae Kim, and Seungjin Choi. Discrete infomax codes for metalearning. arXiv preprint arXiv:1905.11656, 2019.
 McAllester [1999] David A McAllester. Pacbayesian model averaging. In COLT, volume 99, pp. 164–170. Citeseer, 1999.
 Nagabandi et al. [2018] Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt in dynamic, realworld environments through metareinforcement learning. arXiv preprint arXiv:1803.11347, 2018.
 Pentina & Lampert [2014] Anastasia Pentina and Christoph Lampert. A pacbayesian bound for lifelong learning. In International Conference on Machine Learning, pp. 991–999, 2014.
 Rakelly et al. [2019] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Efficient offpolicy metareinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254, 2019.
 Ravi & Larochelle [2016] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. In ICLR 2016, 2016.
 Santoro et al. [2016] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memoryaugmented neural networks. In International conference on machine learning, pp. 1842–1850, 2016.
 Schmidhuber [1987] Jurgen Schmidhuber. Evolutionary principles in selfreferential learning. On learning how to learn: The metameta… hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1:2, 1987.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
 Tenenbaum [1999] Joshua Brett Tenenbaum. A Bayesian framework for concept learning. PhD thesis, Massachusetts Institute of Technology, 1999.
 Thrun & Pratt [2012] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.
 Tishby & Zaslavsky [2015] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. IEEE, 2015.
 Tishby et al. [2000] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
 Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012.
 Vinyals et al. [2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638, 2016.
 Xiang et al. [2014] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2014.
 Zhang et al. [2018] Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. Metagan: An adversarial approach to fewshot learning. In Advances in Neural Information Processing Systems, pp. 2365–2374, 2018.
 Zintgraf et al. [2019] Luisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via metalearning. In Thirtysixth International Conference on Machine Learning (ICML 2019), 2019.
Appendix A Appendix
a.1 Algorithm
We present the detailed algorithm for metaregularization on weights with conditional neural processes (CNP) in Algorithm 1 and with modelagnostic metalearning (MAML) in Algorithm 2. For CNP, we add the regularization on the weights of encoder and leave other weights unrestricted. For MAML, we similarly regularize the weights from input to an intermediate hidden layer and leave the weights for adaptation unregularized. In this way, we restrict the complexity of the preadaptation model not the postadaptation model.
a.2 Meta Regularization on Activations
a.3 Meta Regularization on Weights
Similar to [1], we use
to denote the unknown parameters of the true data generating distribution. This defines a joint distribution
. Furthermore, we have a predictive distribution .The metatraining loss in Eq. 1 is an upper bound for the cross entropy . Using an information decomposition of cross entropy [1], we have
(11) 
Here the only negative term is the , which quantifies the information that the metaparameters contain about the metatraining data beyond what can be inferred from the data generating parameters (i.e., memorization). Without proper regularization, the cross entropy loss can be minimized by maximizing this term. We can control its value by upper bounding it
where the second equality follows because and are conditionally independent given . This gives the regularization in Section 4.2.
a.4 Proof of the PACBayes Generalization Bound
First, we prove a more general result and then specialize it. The goal of the metalearner is to extract information about the metatraining tasks and the test task training data to serve as a prior for test examples from the novel task. This information will be in terms of a distribution over possible models. When learning a new task, the metalearner uses the training task data and a model parameterized by (sampled from ) and outputs a distribution over models. Our goal is to learn such that it performs well on novel tasks.
To formalize this, define
(12) 
where is a bounded loss in . Then, we would like to minimize the error on novel tasks
(13) 
Because we only have a finite training set, computing is intractable, but we can form an empirical estimate:
(14) 
where for exposition we assume is the same for all . We would like to relate and , but the challenge is that may depend on due to the learning algorithm. There are two sources of generalization error: (i) error due to the finite number of observed tasks and (ii) error due to the finite number of examples observed per task. Closely following the arguments in [3], we apply a standard PACBayes bound to each of these and combine the results with a union bound.
Theorem.
Let be a distribution over parameters and let be a prior distribution. Then for any , with probability at least , the following inequality holds uniformly for all distributions ,
(15) 
Proof.
To start, we state a classical PACBayes bound and use it to derive generalization bounds on task and datapoint level generalization, respectively.
Theorem 2.
Let be a sample space (i.e. a space of possible datapoints). Let be a distribution over (i.e. a data distribution). Let
be a hypothesis space. Given a “loss function”
and a collection of i.i.d. random variables sampled from , , let be a prior distribution over hypotheses in that does not depend on the samples but may depend on the data distribution . Then, for any , the following bound holds uniformly for all posterior distributions over(16) 
Metalevel generalization First, we bound the tasklevel generalization, that is we relate to . Letting the samples be , and , then Theorem 1 says that for any
(17) 
where is a prior over .
Within task generalization Next, we relate to via the PACBayes bound. For a fixed task , task training data , a prior that only depends on the training data, and any , we have that
Now, we choose to be and restrict to be of the form for any . While, and may be complicated distributions (especially, if they are defined implicitly), we know that with this choice of and , [5], hence, we have
Comments
There are no comments yet.