1 Introduction
Obtaining a model that generalizes well is a fundamental problem in machine learning, and is becoming even more important in the deep learning era where the models may have tens of thousands of parameters. Basically, a model that generalizes well should obtain low error on unseen test examples, but this is difficult since the distribution of test data is unknown during training. Thus, many approaches resort to variance reduction methods, that reduce the model variance with respect to the change in the input, since test examples could be thought as perturbations of training examples. These approaches includes controlling the model complexity
exploring_generalization , reducing information from inputs information_bottleneck , obtaining smoother loss surface on_large_batch ; exploring_generalization ; entropysgd , injecting Bernoulli or Gaussian noise dropout ; fast_dropout ; variational_dropout , or training for multiple tasks with multitask caruana97 and metalearning thrun98 .A more straightforward and direct way to achieve generalization is to simulate the test examples by perturbing the training examples during training. Some regularization methods such as mixup mixup follow this approach, and perturb the example to the direction of the other training examples to simulate the test example. This procedure could be also applied in the latent feature space, which is shown to obtain even better performance manifold_mixup . However, these approaches are all limited in that they do not explicitly aim to lower the generalization error on the test examples. How can we then perturb the training instances such that the perturbed instances will be actually helpful in lowering the test loss? Enforcing this generalization objective is not possible in a standard learning framework since the test data is unobservable.
To solve this seemingly impossible problem, we resort to metalearning thrun98 which aims to learn a model that generalize over a distribution of task, rather than a distribution of data instances from a single task. Generally, a metalearner is trained on a series of tasks with random training and test splits. While learning to solve diverse tasks, it accumulates the metaknowledge that is not specific to a single task, but is generic across all tasks, which is later leveraged when learning for a novel task. During this metatraining step, we observe both the training and test data. That is, we can explicitly learn to perturb the training instances to obtain low test loss in a metalearning framework. This learned noise generator then can be used to perturb instances for generalization at metatest time.
Yet, learning how much and which features to pertub a training instance is difficult for two reasons. First of all, meaningful directions of perturbation may differ from one instance to another, and one task to another. Secondly, a single training instance may need to cover largely different test instances with its perturbation, since we do not know which test instances will be given at test time. To handle this problem, we propose to learn an inputdependent stochastic noise; that is, we want to learn distribution of noise, or perturbation, that is meaningful for a given training instance. Specifically, we learn a noise generator for each layer features of the main network, given lower layer features as input. We refer to this metanoise generator as Metadropout, which is a novel framework for learning to regularize.
Our Metadropout model could be considered as a way to capture dataintrinsic variance, known as heteroscedastic aleatoric uncertainty
what_uncertainty , which is beneficial in preventing the noise from affecting the mean function (or the main model). Since in metalearning a training example should account for combinatorially many different test examples during training, there exists inherent ambiguity in which test example the model should cover. This brings in instability in the training process, since the same model should be optimized for completely different objective across two different training episodes. Our stochastic noise generation effectively handles this problem by introducing uncertainty in the features, which disentangles the model variance from the model mean. More importantly, the distribution of learned noise is also a type of transferrable knowledge, which is especially useful for fewshot learning setting where only limited number of examples are given to solve a task, as the learned noise generator will generate meaningful perturbations in order to simulate the test examples. In Figure 1, the noise generator perturbs each input instance to help the model predict better decision boundaries which obtain low errors on the test examples.In the remaining sections, we will explain our model in the context of existing work, propose the learning framework for Metadropout, and further show that it can be understood as metalearning the variational inference framework for the graphical model in Figure 3. Moreover, we explain its connection to existing regularizers such as Information Bottleneck ib_principle . Finally, we validate our work on multiple benchmark datasets for fewshot classification, with three gradientbased metalearning models, namely ModelAgnostic MetaLearning (MAML) maml , MetaSGD meta_sgd and Amortized Bayesian Meta Learning (ABML) abml .
Our contribution is threefold.

We propose a novel regularization method called Metadropout that generates stochastic inputdependent perturbations to regularize fewshot learning models, and propose a metalearning framework to train it.

We provide probabilistic interpretation of our approach, and show that it could be understood as metalearning of the variational inference framework for the graphical model in Figure 3, and also provide its connection to existing regularizers such as Information Bottleneck information_bottleneck method.

We validate our metaregularizer with MAML, MetaSGD and ABML on multiple benchmark datasets for fewshot classification, and show that by our regularizer not only helps obtain significant improvement in generalization performance over the base models, but expedites its convergence and stabilizes training.
2 Related Work
Meta learning
While the literature on metalearning thrun98 is vast, here we discuss a few relevant work for fewshot classification. One of the most popular approaches for fewshot classification is metricbased metalearning, which learns a shared metric space siamese ; matchingnet ; protonet ; tadam ; snail over randomly sampled fewshot classification problems, to embed the instances to be closer to their correct embeddings by some distance measure regardless of their classes. The most popular models among them are Matching networks matchingnet which leverages cosine distance measure, and Prototypical networks protonet that make use of Euclidean distance. On the other hand, gradientbased metalearning approaches, such as Model Agnostic Meta Learning (MAML) maml tries to solve fewshot prediction by learning a shared initialization parameter, from which each task can reach the optimal with only a few gradient steps. Metalearning of the regularizers has been also addressed in Balaji et al. metareg , which proposed to metalearn regularizer for domain adaptation. However, our model is more explicitly targeting generalization via perturbation rather than learning a parameter of a generic regularizer.
Regularization Methods
Dropout dropout
is a regularization technique to randomly drop out neurons during training. In addition to decorrelation of features and ensemble effect, it is also possible to interpret dropout as a variational approximation to posterior inference
dropout_as_bayesian, in which case we can even learn the dropout probability with stochatic gradient variational Bayes
variational_dropout ; concrete_dropout. The dropout regularization could be viewed as a noise injection process, where in the case of standard dropout the multiplicative noise follows the Bernoulli distribution. However, we could use Gaussian multiplicative noise to the parameters instead
fast_dropout ; variational_dropout . Variational dropout variational_dropout further learns the variance of the Gaussian noise using variational inference. It is also possible to learn the dropout probability in an inputdependent manner as done with Adaptive Dropout (Standout) adaptive_dropout . Adaptive dropout can also be interpreted as variational inference on inputdependent latent variables cvae ; show_attend_tell ; ua ; dropmax . Metadropout resembles probabilistic version of adaptive dropout; however the critical difference is that, rather than resorting to training data and prior distribution to make posterior inference, it makes use of the previously obtained metaknowledge in posterior variational inference. Metadropout is also related to Information Bottleneck (IB) method information_bottleneck , which assumes a bottleneck variable and aims to improve generalization by minimizing the amount of information the bottleck variables have on inputs, while retaining sufficient information on target variables. Recently, variational approximations to IB method have been proposed deep_vib ; information_dropout , which inject inputdependent noise during training to forget the inputs. Metadropout has a similar effect as IB, but its object directly aims to generalize on test examples, which as a result arrives at the optimal noise distribution that effectively forgets the inputs. Metadropout is also related to Mixup regularization mixup, which at each training step, randomly pairs training instances and generates an example that interpolates the two, to be used as additional training examples. Mixup simulates unseen datapoints to improve generalization as ours does, and the same strategy can be applied to hidden layers as well
manifold_mixup. However, whereas Mixup and its variant rely on heuristic strategies whose generated instances may or may not improve the generalization, our model explicitly learns to perturb each instance to minimize test loss via metalearning.
3 Learning to Perturb Features for Generalization
We now describe our problem setting and the metalearning framework which learns to perturb training instances for better generalization. The goal of metalearning is to learn a model that generalizes over a task distribution . This is usually done by training the model over large number of tasks (or episodes) sampled from , each of which consists of a training set and a test set .
Suppose that we are given such a split of and
. Denoting the initial model parameter of an arbitrary neural network as
, Model Agnostic Meta Learning (MAML) maml aims to infer taskspecific model parameter with one or a few gradient steps with the train set , such that can quickly generalize to as well. Let denote the innergradient step size and capital and the concatenation of inputs and labels, respectively for both training and test set. Then, we have(1) 
Optimizing the objective (1) is repeated over many random splits of and , such that the initial model parameter captures the most generic information over the task distribution.
3.1 Metadropout
However, it is evident that the single initial model parameter in Eq. (1) alone is insufficient in accounting for combinatorially many tasks at test time, whose optimal parameters may largely vary as well. Thus, to obtain optimal parameters for unknown tasks at test time, we propose to learn the inputdependent noise distribution for each input , where is the parameter for the noise generator shared across the training examples (See Figure 2). Such conditional modeling allows the noise generator to allocate varying degree of variance at each feature across input points what_uncertainty . This inputdependent variance modeling is used to handle intrinsic ambiguity where the model has largely varying outcomes with the same input. In our case, this ambiguity comes from the lack of knowledge of the test samples.
The final prediction is obtained by marginalizing over the noise distribution, incorporating all plausible perturbations for each instance:
(2) 
By collecting such local noise information for each training instance, we have the joint predictive distribution that incorporates the noise information over all training examples. Therefore, maximizing the training likelihood results in accounting for the plausible perturbations for the given training set , which allows us to better explain the nearby test examples. Toward this goal, we first construct the innergradient step with perturbations as follows:
(3)  
(4) 
where the expectation (over the negative log likelihood) is approximated with MonteCarlo integration with the sample size . We implicitly assume the reparameterization trick for with the associated samples parameterized by and , following Kingma and Welling vae . We then optimize the parameter for the noise generator as well as the initial model parameter over the task distribution in a meta learning fashion, to obtain a more generalizable final taskspecific parameter .
(5) 
We let the sample size for metatraining for computational efficiency; for metatest, we set to a sufficiently large number (e.g. ) to accurately marginalize over the noise distribution since the sampling cost at evaluation time is manageable for fewshot learning cases. Note that for both metatraining and metatesting, we do not apply noise to the test examples (i.e. by taking the expectation inside), since they are our targets we aim to generalize to, by perturbing training examples.
Extension of our noise learning framework to more than one innergradient step is also straightforward: for each innergradient step, we perform Monte Carlo integration to estimate the nextstep model parameter, and repeat this process until we get the final
. Thus at each gradient step, we perturb the training examples with the learned noise generator, in order to obtain the final predictor with better decision boundaries.3.2 Form of the noise
We apply inputdependent noise to all latent features at every layer of the network similarly to Adaptive dropout adaptive_dropout (See Figure 2). Here we suppress the dependency on and for better readability. Note that either of the two types of noise is applied according to characteristics of data.
Additive noise
One of the simplest form for the noise distribution is zeromean diagonal Gaussian, which can be added to the preactivation features at each layer. Although it cannot capture the correlation between elements at the same layer, the upper layer features will consider their correlations and nonlinearly transform them to yield a complicated noise distribution.
(6) 
where
is a hyperparameter for controlling how far each noise variable can spread out to cover nearby test examples. Reparameterization trick is applied to obtain stable and unbiased gradient estimation w.r.t. the mean and variance of the gaussian distribution, following Kingma and Welling
vae .Multiplicative noise
We also consider nonnegative noise multiplied to preactivation features for each layer, which is useful when input for the noise generator itself contains large amount of noise (e.g. complicated backgrounds of realworld images). In this case, the noise generator could first attend to relevant parts of the input, and perturb only those attended features. We propose a simple ReLU transformation of Gaussian distribution that resembles Lognormal distribution, which allows to explicitly sparsify the generated noise. We empirically verified that the method works well.
(7) 
Note that not only the variance but also the inputdependent mean jointly determine the amount of noise in this multiplicative case. For multiplicative case, we experimentally validated that the inputdependent modeling of the variance does not improve the results.
3.3 Locality of perturbation
During the metalearning phase, we need to compute the gradients of Eq. (5) with respect to meta parameter as well as , which involve secondorder derivatives. While we can optionally ignore the secondorder term in case of as suggested in maml , such approximation is not valid for since making the secondorder zero will boil the whole gradient down to zero. In fact, this secondorder derivative plays a crucial role in training and gives us useful intuition on the relationship between perturbation and test loss. Define as th training example loss and as th test example loss. Then, we have
(8) 
which is a component of the full gradient . We see that infinitesimal change of perturbation does neither increase nor decrease the test loss , when the collection of directions for reducing the training loss (by varying each dimension of ) is orthogonal to the direction for reducing the test loss. It explains why we need attentionlike multiplicative noise; attention mechanism essentially encourages unrelated examples to active different subnetworks, such that the change of perturbation in one subnetwork does not significantly affect unrelated examples that use a different subnetwork. This locality of perturbation allows only relevant trainingtest instance pairs to interact, although the task distribution generate completely random trainingtest pairs.
3.4 Metalearning variational inference framework
Lastly, we explain the connection of our model to a metalearning variational inference framework described with the graphical model in Figure 3. Suppose we are given an observation of training set , where for each instance, the generative process involves the latent conditioned on . Note that the global parameter is fixed and only is learnable (as our model does during the innergradient steps).
Based on this specific context, we see that the innergradient steps of Metadropout in Eq. (3) essentially perform posterior variational inference on (the latent inputdependent noise variable) by maximizing the following evidence lower bound:
(9) 
Note that this special type of ELBO in Eq. (9) lets the approximate posterior share the form with the conditional prior (hence the KL divergence between them is zero), frequently used in other literatures for its practicability cvae ; show_attend_tell ; ua .
At the end of maximizing , the fixed effectively constrains the form of the approximate posterior to obtain a more accurate predictive distribution on a novel instance
. Although Bayesian inference in general aims at improving generalization by incorporating prior, this metalearning process can add on it to further improve generalization.
4 Experiments
Baselines and our models
We first introduce baseline metalearning models and our models.
1) MAML. Model Agnostic Meta Learning by Finn et al maml . Firstorder approximation is not considered for fair comparison against the baselines that use secondorder derivatives.
2) MetaSGD.
A variant of MAML whose learning rate vector is elementwisely learned for the innergradient steps
meta_sgd .3) ABML. Amortized Bayesian Meta Learning by Ravi and Beatson abml . This model reformulates MAML under hierarchical Bayesian framework, such that the global latent includes initial model weights and variances, and taskspecific latent parameter is generated from them.
4) MAML + Gaussian dropout. MAML with zeromean constantvariance gaussian noise independently sampled for each dimension and added to layer preactivations fast_dropout .
4) MAML + indep. noise. MAML with inputindependent noise variables applied to each layer. We apply additive zeromean diagonal gaussian noise to the preactivations of each channel, and metalearn them similarly to ours. The noise scaling hyperparameter is set to for all experiments.
5) Metadropout. MAML, MetaSGD or ABML with our Metadropout regularization. We fix for all our experiments without hyperparameter tuning (except in Figure 6
for qualitative analysis), which we empirically found to work well in general. Note that we use additive noise for Omniglot and multiplicative noise for MiniImageNet experiments.
Datasets and Experimental Setups
We validate our method on Omniglot and MiniImageNet, the two most popular benchmark datasets for fewshot classification.
1) Omniglot: This grayscale handwritten character dataset consists of classes with examples for each class. Following the experimental setup of Vinyals et al. matchingnet , we use classes for metatraining, and the remaining classes for metatesting. We further augment classes by rotating 90 degrees multiple times, such that the total number of classes is . Base network: The network consists of 4 convolutional layers.
kernel ("same" padding) and
channels are used for each layer, followed by batch normalization, ReLU, and max pooling ("valid" padding).
Experimental setup: We mostly follow the settings from Finn et al. maml . For way shot classification, we set meta batchsize to and the innergradient stepsize to . For way shot condition, we set metabatchsize to and . We train for episodes. Metalearning rate starts from and decrease to after first iterations.2) MiniImageNet: This dataset consists of color images of classes, with examples per class. We use and classes for metatraining and metatesting, respectively. We do not use a metavalidation set. Base network: The network is identical to the one used for Omniglot except that channels are used for each conv layer. Experimental setup: way shot and way shot classifications are considered. Following maml , meta batchsize is set to and . We train for episodes. Metalearning rate starts from and decreases to after iterations.
For both datasets, we set the number of inner gradient steps to for both metatraining and metatesting, and the number of test (or query) examples to . We use Adam optimizer adam
with gradient clipping into the range
. We used TensorFlow
tensorflow for all our implementations.4.1 Quantitative analysis
Table 1 shows the fewshot classification results on Omniglot and MiniImageNet dataset. We first reproduce MAML, MetaSGD and ABML which performs similarly or outperforms the accuracies reported in the paper, and then add in our Metadropout to the training of each model.
Omniglot 20way  MiniImageNet 5way  

Models  1shot  5shot  1shot  5shot 
MAML (ours)  95.23 0.17  98.38 0.07  49.58 0.65  64.55 0.52 
MAML + Metadropout  96.55 0.14  99.04 0.05  50.92 0.66  65.49 0.55 
MetaSGD (ours)  96.16 0.14  98.54 0.07  48.30 0.64  65.55 0.56 
MetaSGD + Metadropout  97.37 0.12  99.02 0.06  49.03 0.63  66.79 0.52 
ABML (ours)  95.69 0.15  98.88 0.06  44.55 0.61  63.49 0.56 
ABML + Metadropout  96.72 0.13  98.91 0.06  51.03 0.69  66.56 0.53 
All reported results are average performances over 1000 randomly selected episodes with standard errors for 95% confidence interval over tasks.
Models  Omniglot 20way  

(MAML + )  1shot  5shot 
No noise  95.23 0.17  98.38 0.07 
Gaussian dropout  95.34 0.16  98.55 0.07 
Indep. noise  94.61 0.17  98.45 0.07 
Metadropout  96.55 0.14  99.04 0.05 
The models with Metadropout outperform baselines with significant margins. Especially, there is a huge gap between ABML and ABML + Metadropout model for MiniImageNet dataset. This could be explained in terms of epistemic and aleatoric uncertainty introduced in Gal et al. what_uncertainty . Note that the innergradient steps of ABML maximize ELBO of Bayes by backprop bayes_by_backprop given the learned weight prior. This learns the uncertainty in the weights, or epistemic uncertainty, which in this case is a taskspecific model’s uncertainty coming from lack of observed data. However, this is not the only source of uncertainty in the metalearning. As mentioned previously, in metatraining, a single training example should be able to account for combinatorially many test sets, and thus the decision boundaries each training example should explain can differ from one test set to another. Thus, there is an inherent ambiguity in which direction the noise should be generated, which corresponds to aleatoric uncertainty that captures inherent ambiguity in the task distribution.
The superiority of Metadropout over ABML demonstrates that modeling aleatoric uncertainty over the task distribution is more important than considering epistemic uncertainty for each taskspecific learner in solving fewshot classification. This is because for each taskspecific learner, incorporating prior can sometimes limit the expressivity of the posterior, while Metadropout does the opposite, effectively increasing the expressivity by transferring the learned knowledge on how to stochastically perturb each training example to explain large variance in test examples. This helps with the metatraining, which explains why metatraining losses of Metadropout in Figure 4(a) and 4(b) drop much faster and to lower points compared to those of ABML. Finally, ABML + Metadropout works the best, since it captures two different types of uncertainty at the same time.
To further see the importance of inputdependent modeling of noise, we further conduct an ablation study against parameterfree Gaussian dropout and inputindependent version of Metadropout (See Table 2). The results show that Metadropout improves upon the base models, while baselines do not. This is expected since we need to know the task and the instance in order to know in which directions to perturb.
4.2 Qualitative analysis
Figure 5 shows the visualizations of the learned additive noise for some of the channels in the 2nd layer of the base network trained on the Omniglot dataset. Metadropout seems to generate diverse perturbations such as noise on backgrounds, character contours, and some specific parts of the characters.
Figure 6 (left) shows the 1st layer postactivations from 4 different channels. First and the second images for each channel show the activation without and with perturbation, respectively. Whereas the channel 1 and 2 show little perturbation on activations, channel 3 shows stronger noise. In channel 4, noise is dominant such that it prevents any meaningful information from flowing to the next layer. This behavior is similar to that of the information bottleneck (IB) regularizer ib_principle , which aims to improve generalization by minimizing the mutual information between the input and the bottleneck variable.
Figure 6 (right) shows the visualization of the 3rd layer features, obtained using a separately trained deconvolutional decoder visualization . The reconstructed inputs show both background noise and semantic perturbation, which seems reasonable in terms of data augmentation.
Finally in Figure 7, we visualize the decision boundary from MAML and our model respectively, on the space of the last convolutional layer features. Comparing decision boundaries of MAML (Figure 7(a)) and Metadropout (Figure 7(b)), we see that Metadropout has significantly better decision boundary thanks to the perturbations, of which directions align well with the location of test examples, effectively broadening the area covered by each class. Note that the noise samples do not necessarily correspond to the test examples, since the only goal of the perturbation is to regularize the model decision boundary, not to estimate the manifold distribution (See Figure 1).
5 Conclusion
We proposed a novel method that learns to regularize for generalization, by learning to perturb latent features of training examples. To this end, we proposed a metalearning framework where the inputadaptive noise distribution explicitly tries to minimize the test loss during metatraining. We provided the probabilistic interpretation of our model as the metalearning of the variational inference framework for a specific graphical model, and also empirically show some connection to Information Bottleneck. The experimental results on three gradientbased metalearning models on benchmark datasets show that our model can significantly improve the generalization performance of the target meta learner as well as making them to converge considerably faster with more stable training. As future work, we will explore ways to further generalize our metadropout model to diverse network architectures and tasks under standard learning scenarios.
References
 [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Largescale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467, 2016.
 [2] A. Achille and S. Soatto. Information Dropout: Learning Optimal Representations Through Noisy Computation. In PAMI, 2018.
 [3] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep Variational Information Bottleneck. ICLR, 2017.
 [4] J. Ba and B. Frey. Adaptive dropout for training deep neural networks. In NIPS. 2013.
 [5] Y. Balaji, S. Sankaranarayanan, and R. Chellappa. Metareg: Towards domain generalization using metaregularization. In Advances in Neural Information Processing Systems 31. 2018.
 [6] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight Uncertainty in Neural Networks. In ICML, 2015.
 [7] R. Caruana. Multitask Learning. Machine Learning, 1997.
 [8] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. EntropySGD: Biasing Gradient Descent Into Wide Valleys. ICLR, 2017.
 [9] A. Dosovitskiy and T. Brox. Inverting Visual Representations with Convolutional Networks. arXiv eprints, Jun 2015.
 [10] C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 [11] Y. Gal and Z. Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ArXiv eprints, June 2015.
 [12] Y. Gal, J. Hron, and A. Kendall. Concrete Dropout. ArXiv eprints, May 2017.
 [13] J. Heo, H. B. Lee, S. Kim, J. Lee, K. J. Kim, E. Yang, and S. J. Hwang. UncertaintyAware Attention for Reliable Interpretation and Prediction. ArXiv eprints, May 2018.

[14]
A. Kendall and Y. Gal.
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
ArXiv eprints, Mar. 2017.  [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
 [16] D. P. Kingma, T. Salimans, and M. Welling. Variational Dropout and the Local Reparameterization Trick. ArXiv eprints, June 2015.
 [17] D. P. Kingma and M. Welling. Auto encoding variational bayes. In ICLR. 2014.
 [18] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for oneshot image recognition. In International Conference on Machine Learning, 2015.
 [19] H. B. Lee, J. Lee, S. Kim, E. Yang, and S. J. Hwang. Dropmax: Adaptive variationial softmax. In NeurIPS, 2018.
 [20] Z. Li, F. Zhou, F. Chen, and H. Li. Metasgd: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835, 2017.
 [21] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive metalearner. In International Conference on Learning Representations, 2018.
 [22] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring Generalization in Deep Learning. NIPS, 2017.
 [23] B. Oreshkin, P. R. López, and A. Lacoste. Tadam: Task dependent adaptive metric for improved fewshot learning. In NeurIPS, 2018.
 [24] S. Ravi and A. Beatson. Amortized bayesian metalearning. In International Conference on Learning Representations, 2019.
 [25] N. Shirish Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On LargeBatch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR, 2017.
 [26] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
 [27] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NIPS. NIPS, 2015.
 [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 [29] S. Thrun and L. Pratt, editors. Learning to Learn. Kluwer Academic Publishers, Norwell, MA, USA, 1998.
 [30] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Annual Allerton Conference on Communication, Control and Computing, 1999.
 [31] N. Tishby and N. Zaslavsky. Deep Learning and the Information Bottleneck Principle. In In IEEE Information Theory Workshop, 2015.
 [32] V. Verma, A. Lamb, C. Beckham, A. Najafi, A. Courville, I. Mitliagkas, and Y. Bengio. Manifold Mixup: Learning Better Representations by Interpolating Hidden States. arXiv eprints, June 2018.
 [33] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. 2016.
 [34] S. Wang and C. Manning. Fast dropout training. In ICML, pages 118–126, 2013.
 [35] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ArXiv eprints, Feb. 2015.
 [36] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz. mixup: Beyond empirical risk minimization. In ICLR, 2017.
Appendix A Algorithm
We provide the pseudocode for Metadropout model for metatraining and metatesting, respectively.
Appendix B Graphical model for the innergradient steps
In this section, we provide more detailed explanation about what kind of graphical model each baseline optimizes in its innergradient steps, and the corresponding variational frameworks with the learned parameter (or learned prior).
b.1 Metadropout
The innergradient steps of Metadropout make a posterior inference based on the graphical model in Figure 8(a) (note that is constant). Let denote the approximate posterior. The standard variational inference framework maximizes the following evidence lower bound (ELBO) [27].
However, the KLdivergence in the last equation easily become negligible (close to zero), mainly because both and are equally instancewisely conditional. Therefore, one of the usual practices is to let the approxiamte posterior share the form with the conditional prior (i.e. ), resulting in the following simpler form with the KL term vanishing into zero [27, 35]:
(10) 
which corresponds to the ELBO that Metadropout maximizes in its innergradient steps (See the main paper). The learned regularizes the final solution for the approximate posterior (or conditional prior, equivalently).
b.2 Abml
This is the model proposed by Ravi and Beatson [24]. They assume the global initial model parameter as gaussian whose mean and variance are given as and , respectively. The innergradient steps of ABML thus maximize the ELBO for the graphical model in Figure 8(b) (e.g. Bayes by backprop [6]). They further assume and as latent, but they point estimate on those, so here we simply let them be deterministic for simpler analysis. The lower bound is given as follows:
where is the approximate posterior. The dependency on the whole dataset is achieved by performing gradient steps with from the initial parameters and . Therefore, unlike our model where does not participate in the innergradient stpes, here in ABML both parameters do so.
b.3 ABML + Metadropout
Lastly, if we apply Metadropout to ABML, then the innergradient steps of such model actually optimizes the ELBO for the combined graphical model in Figure 8(c). The lower bound is as follows:
where for the expectation on the r.h.s., we first sample and then sample . It shows that ABML and Metadropout are well compatable within a single variational inference framework, given the learned optimal parameter for the noise generator and the learned weight prior . They also effectively compensate each other in terms of different types of uncertanty; ABML accounts for the epistemic uncertainty of each taskspecific parameter as a latent variable, and Metadropout accounts for the aleatoric (and also heteroscedastic) uncertainty of the global task distribution.
Appendix C MAML + Manifold Mixup experiments
We did additional experiments with Manifold Mixup [32]
, where for each step we randomly select one hidden layer (possibly including inputs) to interpolate the pairs of hidden representations of the training examples as well as the corresponding onehot encoded labels. Interpolation is linear with random ratio sampled from
We used , which seems to work well according to the paper [32]. Similarly to other stochastic baselines, we train with a single interpolation per each innergradient step at metatraining time, and sample interpolations (each with independent sample of layer index and random mixing ratio) per step at metatesting time to accurately evaluate the final taskspecific model parameter.In Table 3, we see that Manifold Mixup does not help improve on the base MAML. This is because there are only few ways to interpolate each other with only few training instances. Even in 1shot scenario there is no other examples from the same class at all, degrading the quality of interpolation.
Omniglot 20way  MiniImageNet 5way  

Models  1shot  5shot  1shot  5shot 
MAML (ours)  95.23 0.17  98.38 0.07  49.58 0.65  64.55 0.52 
MAML + Manifold Mixup ()  89.78 0.25  97.86 0.08  48.62 0.66  63.86 0.53 
MAML + Manifold Mixup ()  87.26 0.28  97.14 0.17  48.42 0.64  62.56 0.55 
MAML + Metadropout  96.55 0.14  99.04 0.05  50.92 0.66  65.49 0.55 
Appendix D Additional visualization of decision boundaries
In this section, we provide additional decision boundary visualizations ^{1}^{1}1We referred to: https://github.com/tmadl/highdimensionaldecisionboundaryplot, from the Omniglot 20way 1shot examples. From the initial parameter, we proceed the innergradient steps with two examples for each class and visualize for the randomly selected two classes. Figure 9 shows the change of decision boundaries over the course of innergradient steps (Step 02). We omit to draw step 35 since the models roughly converge around step 2.
In Figure 10, we draw the finalstep decision boundaries estimated from different task samples (Task 14). We see that the decision boundary of Metadropout is much simpler and correct, than the one from the base MAML.
Comments
There are no comments yet.