Meta Dropout: Learning to Perturb Features for Generalization

05/30/2019 ∙ by Hae Beom Lee, et al. ∙ KAIST 수리과학과 0

A machine learning model that generalizes well should obtain low errors on the unseen test examples. Test examples could be understood as perturbations of training examples, which means that if we know how to optimally perturb training examples to simulate test examples, we could achieve better generalization at test time. However, obtaining such perturbation is not possible in standard machine learning frameworks as the distribution of the test data is unknown. To tackle this challenge, we propose a meta-learning framework that learns to perturb the latent features of training examples for generalization. Specifically, we meta-learn a noise generator that will output the optimal noise distribution for latent features across all network layers to obtain low error on the test instances, in an input-dependent manner. Then, the learned noise generator will perturb the training examples of unseen tasks at the meta-test time. We show that our method, Meta-dropout, could be also understood as meta-learning of the variational inference framework for a specific graphical model, and describe its connection to existing regularizers. Finally, we validate Meta-dropout on multiple benchmark datasets for few-shot classification, whose results show that it not only significantly improves the generalization performance of meta-learners but also allows them to obtain fast converegence.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Obtaining a model that generalizes well is a fundamental problem in machine learning, and is becoming even more important in the deep learning era where the models may have tens of thousands of parameters. Basically, a model that generalizes well should obtain low error on unseen test examples, but this is difficult since the distribution of test data is unknown during training. Thus, many approaches resort to variance reduction methods, that reduce the model variance with respect to the change in the input, since test examples could be thought as perturbations of training examples. These approaches includes controlling the model complexity 

exploring_generalization , reducing information from inputs information_bottleneck , obtaining smoother loss surface on_large_batch ; exploring_generalization ; entropy-sgd , injecting Bernoulli or Gaussian noise dropout ; fast_dropout ; variational_dropout , or training for multiple tasks with multi-task caruana97 and meta-learning thrun98 .

A more straightforward and direct way to achieve generalization is to simulate the test examples by perturbing the training examples during training. Some regularization methods such as mixup mixup follow this approach, and perturb the example to the direction of the other training examples to simulate the test example. This procedure could be also applied in the latent feature space, which is shown to obtain even better performance manifold_mixup . However, these approaches are all limited in that they do not explicitly aim to lower the generalization error on the test examples. How can we then perturb the training instances such that the perturbed instances will be actually helpful in lowering the test loss? Enforcing this generalization objective is not possible in a standard learning framework since the test data is unobservable.

Figure 1: Concepts. In some feature space, each training instance stochastically perturbs so that the resultant decision boundary (red line) explains well for the test examples. Note that the noise distribution does not have to cover the test instances directly.

To solve this seemingly impossible problem, we resort to meta-learning thrun98 which aims to learn a model that generalize over a distribution of task, rather than a distribution of data instances from a single task. Generally, a meta-learner is trained on a series of tasks with random training and test splits. While learning to solve diverse tasks, it accumulates the meta-knowledge that is not specific to a single task, but is generic across all tasks, which is later leveraged when learning for a novel task. During this meta-training step, we observe both the training and test data. That is, we can explicitly learn to perturb the training instances to obtain low test loss in a meta-learning framework. This learned noise generator then can be used to perturb instances for generalization at meta-test time.

Yet, learning how much and which features to pertub a training instance is difficult for two reasons. First of all, meaningful directions of perturbation may differ from one instance to another, and one task to another. Secondly, a single training instance may need to cover largely different test instances with its perturbation, since we do not know which test instances will be given at test time. To handle this problem, we propose to learn an input-dependent stochastic noise; that is, we want to learn distribution of noise, or perturbation, that is meaningful for a given training instance. Specifically, we learn a noise generator for each layer features of the main network, given lower layer features as input. We refer to this meta-noise generator as Meta-dropout, which is a novel framework for learning to regularize.

Our Meta-dropout model could be considered as a way to capture data-intrinsic variance, known as heteroscedastic aleatoric uncertainty 

what_uncertainty , which is beneficial in preventing the noise from affecting the mean function (or the main model). Since in meta-learning a training example should account for combinatorially many different test examples during training, there exists inherent ambiguity in which test example the model should cover. This brings in instability in the training process, since the same model should be optimized for completely different objective across two different training episodes. Our stochastic noise generation effectively handles this problem by introducing uncertainty in the features, which disentangles the model variance from the model mean. More importantly, the distribution of learned noise is also a type of transferrable knowledge, which is especially useful for few-shot learning setting where only limited number of examples are given to solve a task, as the learned noise generator will generate meaningful perturbations in order to simulate the test examples. In Figure 1, the noise generator perturbs each input instance to help the model predict better decision boundaries which obtain low errors on the test examples.

In the remaining sections, we will explain our model in the context of existing work, propose the learning framework for Meta-dropout, and further show that it can be understood as meta-learning the variational inference framework for the graphical model in Figure 3. Moreover, we explain its connection to existing regularizers such as Information Bottleneck ib_principle . Finally, we validate our work on multiple benchmark datasets for few-shot classification, with three gradient-based meta-learning models, namely Model-Agnostic Meta-Learning (MAML) maml , Meta-SGD meta_sgd and Amortized Bayesian Meta Learning (ABML) abml .

Our contribution is threefold.

  • We propose a novel regularization method called Meta-dropout that generates stochastic input-dependent perturbations to regularize few-shot learning models, and propose a meta-learning framework to train it.

  • We provide probabilistic interpretation of our approach, and show that it could be understood as meta-learning of the variational inference framework for the graphical model in Figure 3, and also provide its connection to existing regularizers such as Information Bottleneck information_bottleneck method.

  • We validate our meta-regularizer with MAML, Meta-SGD and ABML on multiple benchmark datasets for few-shot classification, and show that by our regularizer not only helps obtain significant improvement in generalization performance over the base models, but expedites its convergence and stabilizes training.

2 Related Work

Meta learning

While the literature on meta-learning thrun98 is vast, here we discuss a few relevant work for few-shot classification. One of the most popular approaches for few-shot classification is metric-based meta-learning, which learns a shared metric space siamese ; matchingnet ; protonet ; tadam ; snail over randomly sampled few-shot classification problems, to embed the instances to be closer to their correct embeddings by some distance measure regardless of their classes. The most popular models among them are Matching networks matchingnet which leverages cosine distance measure, and Prototypical networks protonet that make use of Euclidean distance. On the other hand, gradient-based meta-learning approaches, such as Model Agnostic Meta Learning (MAML) maml tries to solve few-shot prediction by learning a shared initialization parameter, from which each task can reach the optimal with only a few gradient steps. Meta-learning of the regularizers has been also addressed in Balaji et al. metareg , which proposed to meta-learn regularizer for domain adaptation. However, our model is more explicitly targeting generalization via perturbation rather than learning a parameter of a generic regularizer.

Regularization Methods

Dropout dropout

is a regularization technique to randomly drop out neurons during training. In addition to decorrelation of features and ensemble effect, it is also possible to interpret dropout as a variational approximation to posterior inference 

dropout_as_bayesian

, in which case we can even learn the dropout probability with stochatic gradient variational Bayes 

variational_dropout ; concrete_dropout

. The dropout regularization could be viewed as a noise injection process, where in the case of standard dropout the multiplicative noise follows the Bernoulli distribution. However, we could use Gaussian multiplicative noise to the parameters instead 

fast_dropout ; variational_dropout . Variational dropout variational_dropout further learns the variance of the Gaussian noise using variational inference. It is also possible to learn the dropout probability in an input-dependent manner as done with Adaptive Dropout (Standout) adaptive_dropout . Adaptive dropout can also be interpreted as variational inference on input-dependent latent variables cvae ; show_attend_tell ; ua ; dropmax . Meta-dropout resembles probabilistic version of adaptive dropout; however the critical difference is that, rather than resorting to training data and prior distribution to make posterior inference, it makes use of the previously obtained meta-knowledge in posterior variational inference. Meta-dropout is also related to Information Bottleneck (IB) method information_bottleneck , which assumes a bottleneck variable and aims to improve generalization by minimizing the amount of information the bottleck variables have on inputs, while retaining sufficient information on target variables. Recently, variational approximations to IB method have been proposed deep_vib ; information_dropout , which inject input-dependent noise during training to forget the inputs. Meta-dropout has a similar effect as IB, but its object directly aims to generalize on test examples, which as a result arrives at the optimal noise distribution that effectively forgets the inputs. Meta-dropout is also related to Mixup regularization mixup

, which at each training step, randomly pairs training instances and generates an example that interpolates the two, to be used as additional training examples. Mixup simulates unseen data-points to improve generalization as ours does, and the same strategy can be applied to hidden layers as well 

manifold_mixup

. However, whereas Mixup and its variant rely on heuristic strategies whose generated instances may or may not improve the generalization, our model explicitly learns to perturb each instance to minimize test loss via meta-learning.

3 Learning to Perturb Features for Generalization

We now describe our problem setting and the meta-learning framework which learns to perturb training instances for better generalization. The goal of meta-learning is to learn a model that generalizes over a task distribution . This is usually done by training the model over large number of tasks (or episodes) sampled from , each of which consists of a training set and a test set .

Suppose that we are given such a split of and

. Denoting the initial model parameter of an arbitrary neural network as

, Model Agnostic Meta Learning (MAML) maml aims to infer task-specific model parameter with one or a few gradient steps with the train set , such that can quickly generalize to as well. Let denote the inner-gradient step size and capital and the concatenation of inputs and labels, respectively for both training and test set. Then, we have

(1)

Optimizing the objective (1) is repeated over many random splits of and , such that the initial model parameter captures the most generic information over the task distribution.

3.1 Meta-dropout

Figure 2: Model architecture. Each bottom layer generates the noise distribution, from which noise is sampled and applied back to the current layer.

However, it is evident that the single initial model parameter in Eq. (1) alone is insufficient in accounting for combinatorially many tasks at test time, whose optimal parameters may largely vary as well. Thus, to obtain optimal parameters for unknown tasks at test time, we propose to learn the input-dependent noise distribution for each input , where is the parameter for the noise generator shared across the training examples (See Figure 2). Such conditional modeling allows the noise generator to allocate varying degree of variance at each feature across input points what_uncertainty . This input-dependent variance modeling is used to handle intrinsic ambiguity where the model has largely varying outcomes with the same input. In our case, this ambiguity comes from the lack of knowledge of the test samples.

The final prediction is obtained by marginalizing over the noise distribution, incorporating all plausible perturbations for each instance:

(2)

By collecting such local noise information for each training instance, we have the joint predictive distribution that incorporates the noise information over all training examples. Therefore, maximizing the training likelihood results in accounting for the plausible perturbations for the given training set , which allows us to better explain the nearby test examples. Toward this goal, we first construct the inner-gradient step with perturbations as follows:

(3)
(4)

where the expectation (over the negative log likelihood) is approximated with Monte-Carlo integration with the sample size . We implicitly assume the reparameterization trick for with the associated samples parameterized by and , following Kingma and Welling vae . We then optimize the parameter for the noise generator as well as the initial model parameter over the task distribution in a meta learning fashion, to obtain a more generalizable final task-specific parameter .

(5)

We let the sample size for meta-training for computational efficiency; for meta-test, we set to a sufficiently large number (e.g. ) to accurately marginalize over the noise distribution since the sampling cost at evaluation time is manageable for few-shot learning cases. Note that for both meta-training and meta-testing, we do not apply noise to the test examples (i.e. by taking the expectation inside), since they are our targets we aim to generalize to, by perturbing training examples.

Extension of our noise learning framework to more than one inner-gradient step is also straightforward: for each inner-gradient step, we perform Monte Carlo integration to estimate the next-step model parameter, and repeat this process until we get the final

. Thus at each gradient step, we perturb the training examples with the learned noise generator, in order to obtain the final predictor with better decision boundaries.

3.2 Form of the noise

We apply input-dependent noise to all latent features at every layer of the network similarly to Adaptive dropout adaptive_dropout (See Figure 2). Here we suppress the dependency on and for better readability. Note that either of the two types of noise is applied according to characteristics of data.

Additive noise

One of the simplest form for the noise distribution is zero-mean diagonal Gaussian, which can be added to the pre-activation features at each layer. Although it cannot capture the correlation between elements at the same layer, the upper layer features will consider their correlations and non-linearly transform them to yield a complicated noise distribution.

(6)

where

is a hyperparameter for controlling how far each noise variable can spread out to cover nearby test examples. Reparameterization trick is applied to obtain stable and unbiased gradient estimation w.r.t. the mean and variance of the gaussian distribution, following Kingma and Welling 

vae .

Multiplicative noise

We also consider nonnegative noise multiplied to pre-activation features for each layer, which is useful when input for the noise generator itself contains large amount of noise (e.g. complicated backgrounds of real-world images). In this case, the noise generator could first attend to relevant parts of the input, and perturb only those attended features. We propose a simple ReLU transformation of Gaussian distribution that resembles Log-normal distribution, which allows to explicitly sparsify the generated noise. We empirically verified that the method works well.

(7)

Note that not only the variance but also the input-dependent mean jointly determine the amount of noise in this multiplicative case. For multiplicative case, we experimentally validated that the input-dependent modeling of the variance does not improve the results.

3.3 Locality of perturbation

During the meta-learning phase, we need to compute the gradients of Eq. (5) with respect to meta parameter as well as , which involve second-order derivatives. While we can optionally ignore the second-order term in case of as suggested in maml , such approximation is not valid for since making the second-order zero will boil the whole gradient down to zero. In fact, this second-order derivative plays a crucial role in training and gives us useful intuition on the relationship between perturbation and test loss. Define as -th training example loss and as -th test example loss. Then, we have

(8)

which is a component of the full gradient . We see that infinitesimal change of perturbation does neither increase nor decrease the test loss , when the collection of directions for reducing the training loss (by varying each dimension of ) is orthogonal to the direction for reducing the test loss. It explains why we need attention-like multiplicative noise; attention mechanism essentially encourages unrelated examples to active different sub-networks, such that the change of perturbation in one subnetwork does not significantly affect unrelated examples that use a different subnetwork. This locality of perturbation allows only relevant training-test instance pairs to interact, although the task distribution generate completely random training-test pairs.

3.4 Meta-learning variational inference framework

Figure 3: Graphical model

Lastly, we explain the connection of our model to a meta-learning variational inference framework described with the graphical model in Figure 3. Suppose we are given an observation of training set , where for each instance, the generative process involves the latent conditioned on . Note that the global parameter is fixed and only is learnable (as our model does during the inner-gradient steps).

Based on this specific context, we see that the inner-gradient steps of Meta-dropout in Eq. (3) essentially perform posterior variational inference on (the latent input-dependent noise variable) by maximizing the following evidence lower bound:

(9)

Note that this special type of ELBO in Eq. (9) lets the approximate posterior share the form with the conditional prior (hence the KL divergence between them is zero), frequently used in other literatures for its practicability cvae ; show_attend_tell ; ua .

At the end of maximizing , the fixed effectively constrains the form of the approximate posterior to obtain a more accurate predictive distribution on a novel instance

. Although Bayesian inference in general aims at improving generalization by incorporating prior, this meta-learning process can add on it to further improve generalization.

4 Experiments

Baselines and our models

We first introduce baseline meta-learning models and our models.

1) MAML. Model Agnostic Meta Learning by Finn et al maml . First-order approximation is not considered for fair comparison against the baselines that use second-order derivatives.

2) Meta-SGD.

A variant of MAML whose learning rate vector is element-wisely learned for the inner-gradient steps

meta_sgd .

3) ABML. Amortized Bayesian Meta Learning by Ravi and Beatson  abml . This model reformulates MAML under hierarchical Bayesian framework, such that the global latent includes initial model weights and variances, and task-specific latent parameter is generated from them.

4) MAML + Gaussian dropout. MAML with zero-mean constant-variance gaussian noise independently sampled for each dimension and added to layer pre-activations fast_dropout .

4) MAML + indep. noise. MAML with input-independent noise variables applied to each layer. We apply additive zero-mean diagonal gaussian noise to the preactivations of each channel, and meta-learn them similarly to ours. The noise scaling hyperparameter is set to for all experiments.

5) Meta-dropout. MAML, Meta-SGD or ABML with our Meta-dropout regularization. We fix for all our experiments without hyperparameter tuning (except in Figure 6

for qualitative analysis), which we empirically found to work well in general. Note that we use additive noise for Omniglot and multiplicative noise for Mini-ImageNet experiments.

Datasets and Experimental Setups

We validate our method on Omniglot and Mini-ImageNet, the two most popular benchmark datasets for few-shot classification.

1) Omniglot: This gray-scale hand-written character dataset consists of classes with examples for each class. Following the experimental setup of Vinyals et al. matchingnet , we use classes for meta-training, and the remaining classes for meta-testing. We further augment classes by rotating 90 degrees multiple times, such that the total number of classes is . Base network: The network consists of 4 convolutional layers.

kernel ("same" padding) and

channels are used for each layer, followed by batch normalization, ReLU, and max pooling ("valid" padding).

Experimental setup: We mostly follow the settings from Finn et al. maml . For -way -shot classification, we set meta batchsize to and the inner-gradient stepsize to . For -way -shot condition, we set meta-batchsize to and . We train for episodes. Meta-learning rate starts from and decrease to after first iterations.

2) Mini-ImageNet: This dataset consists of color images of classes, with examples per class. We use and classes for meta-training and meta-testing, respectively. We do not use a meta-validation set. Base network: The network is identical to the one used for Omniglot except that channels are used for each conv layer. Experimental setup: -way -shot and -way -shot classifications are considered. Following maml , meta batchsize is set to and . We train for episodes. Meta-learning rate starts from and decreases to after iterations.

For both datasets, we set the number of inner gradient steps to for both meta-training and meta-testing, and the number of test (or query) examples to . We use Adam optimizer adam

with gradient clipping into the range

. We used TensorFlow 

tensorflow for all our implementations.

4.1 Quantitative analysis

Table 1 shows the few-shot classification results on Omniglot and Mini-ImageNet dataset. We first reproduce MAML, Meta-SGD and ABML which performs similarly or outperforms the accuracies reported in the paper, and then add in our Meta-dropout to the training of each model.

Omniglot 20-way Mini-ImageNet 5-way
Models 1-shot 5-shot 1-shot 5-shot
MAML (ours) 95.23 0.17 98.38 0.07 49.58 0.65 64.55 0.52
MAML + Meta-dropout 96.55 0.14 99.04 0.05 50.92 0.66 65.49 0.55
Meta-SGD (ours) 96.16 0.14 98.54 0.07 48.30 0.64 65.55 0.56
Meta-SGD + Meta-dropout 97.37 0.12 99.02 0.06 49.03 0.63 66.79 0.52
ABML (ours) 95.69 0.15 98.88 0.06 44.55 0.61 63.49 0.56
ABML + Meta-dropout 96.72 0.13 98.91 0.06 51.03 0.69 66.56 0.53
Table 1: Few-shot classification performance.

All reported results are average performances over 1000 randomly selected episodes with standard errors for 95% confidence interval over tasks.

(a) Omniglot 20-way 1-shot
(b) M-Imgnet 5-way 1-shot
(c) Omniglot 20-way
Figure 4: (a,b) Convergence plots with loss. Dotted transparent lines denote meta-training loss and bold lines show meta-testing loss. (c) Wall-clock convergence plots with meta-testing accuracy.
Models Omniglot 20-way
(MAML + ) 1-shot 5-shot
No noise 95.23 0.17 98.38 0.07
Gaussian dropout 95.34 0.16 98.55 0.07
Indep. noise 94.61 0.17 98.45 0.07
Meta-dropout 96.55 0.14 99.04 0.05
Table 2: Ablation study on the noise type

The models with Meta-dropout outperform baselines with significant margins. Especially, there is a huge gap between ABML and ABML + Meta-dropout model for Mini-ImageNet dataset. This could be explained in terms of epistemic and aleatoric uncertainty introduced in Gal et al. what_uncertainty . Note that the inner-gradient steps of ABML maximize ELBO of Bayes by backprop bayes_by_backprop given the learned weight prior. This learns the uncertainty in the weights, or epistemic uncertainty, which in this case is a task-specific model’s uncertainty coming from lack of observed data. However, this is not the only source of uncertainty in the meta-learning. As mentioned previously, in meta-training, a single training example should be able to account for combinatorially many test sets, and thus the decision boundaries each training example should explain can differ from one test set to another. Thus, there is an inherent ambiguity in which direction the noise should be generated, which corresponds to aleatoric uncertainty that captures inherent ambiguity in the task distribution.

The superiority of Meta-dropout over ABML demonstrates that modeling aleatoric uncertainty over the task distribution is more important than considering epistemic uncertainty for each task-specific learner in solving few-shot classification. This is because for each task-specific learner, incorporating prior can sometimes limit the expressivity of the posterior, while Meta-dropout does the opposite, effectively increasing the expressivity by transferring the learned knowledge on how to stochastically perturb each training example to explain large variance in test examples. This helps with the meta-training, which explains why meta-training losses of Meta-dropout in Figure 4(a) and 4(b) drop much faster and to lower points compared to those of ABML. Finally, ABML + Meta-dropout works the best, since it captures two different types of uncertainty at the same time.

To further see the importance of input-dependent modeling of noise, we further conduct an ablation study against parameter-free Gaussian dropout and input-independent version of Meta-dropout (See Table 2). The results show that Meta-dropout improves upon the base models, while baselines do not. This is expected since we need to know the task and the instance in order to know in which directions to perturb.

4.2 Qualitative analysis

Figure 5: Visualization of the 2nd layer learned additive noise ( in Eq. (6)) from Omniglot dataset.
Figure 6: (Left) Visualization of the 1st layer mean and sampled post-activations for each channel. (Right) The reconstructed inputs from the 3rd layer features with and without noise (from the model trained with ).

Figure 5 shows the visualizations of the learned additive noise for some of the channels in the 2nd layer of the base network trained on the Omniglot dataset. Meta-dropout seems to generate diverse perturbations such as noise on backgrounds, character contours, and some specific parts of the characters.

(a) MAML
(b) Meta-dropout
Figure 7: Decision boundary visualization (Omniglot 1-shot)

Figure 6 (left) shows the 1st layer post-activations from 4 different channels. First and the second images for each channel show the activation without and with perturbation, respectively. Whereas the channel 1 and 2 show little perturbation on activations, channel 3 shows stronger noise. In channel 4, noise is dominant such that it prevents any meaningful information from flowing to the next layer. This behavior is similar to that of the information bottleneck (IB) regularizer ib_principle , which aims to improve generalization by minimizing the mutual information between the input and the bottleneck variable.

Figure 6 (right) shows the visualization of the 3rd layer features, obtained using a separately trained deconvolutional decoder visualization . The reconstructed inputs show both background noise and semantic perturbation, which seems reasonable in terms of data augmentation.

Finally in Figure 7, we visualize the decision boundary from MAML and our model respectively, on the space of the last convolutional layer features. Comparing decision boundaries of MAML (Figure 7(a)) and Meta-dropout (Figure 7(b)), we see that Meta-dropout has significantly better decision boundary thanks to the perturbations, of which directions align well with the location of test examples, effectively broadening the area covered by each class. Note that the noise samples do not necessarily correspond to the test examples, since the only goal of the perturbation is to regularize the model decision boundary, not to estimate the manifold distribution (See Figure 1).

5 Conclusion

We proposed a novel method that learns to regularize for generalization, by learning to perturb latent features of training examples. To this end, we proposed a meta-learning framework where the input-adaptive noise distribution explicitly tries to minimize the test loss during meta-training. We provided the probabilistic interpretation of our model as the meta-learning of the variational inference framework for a specific graphical model, and also empirically show some connection to Information Bottleneck. The experimental results on three gradient-based meta-learning models on benchmark datasets show that our model can significantly improve the generalization performance of the target meta learner as well as making them to converge considerably faster with more stable training. As future work, we will explore ways to further generalize our meta-dropout model to diverse network architectures and tasks under standard learning scenarios.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467, 2016.
  • [2] A. Achille and S. Soatto. Information Dropout: Learning Optimal Representations Through Noisy Computation. In PAMI, 2018.
  • [3] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep Variational Information Bottleneck. ICLR, 2017.
  • [4] J. Ba and B. Frey. Adaptive dropout for training deep neural networks. In NIPS. 2013.
  • [5] Y. Balaji, S. Sankaranarayanan, and R. Chellappa. Metareg: Towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems 31. 2018.
  • [6] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight Uncertainty in Neural Networks. In ICML, 2015.
  • [7] R. Caruana. Multitask Learning. Machine Learning, 1997.
  • [8] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. ICLR, 2017.
  • [9] A. Dosovitskiy and T. Brox. Inverting Visual Representations with Convolutional Networks. arXiv e-prints, Jun 2015.
  • [10] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • [11] Y. Gal and Z. Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ArXiv e-prints, June 2015.
  • [12] Y. Gal, J. Hron, and A. Kendall. Concrete Dropout. ArXiv e-prints, May 2017.
  • [13] J. Heo, H. B. Lee, S. Kim, J. Lee, K. J. Kim, E. Yang, and S. J. Hwang. Uncertainty-Aware Attention for Reliable Interpretation and Prediction. ArXiv e-prints, May 2018.
  • [14] A. Kendall and Y. Gal.

    What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

    ArXiv e-prints, Mar. 2017.
  • [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
  • [16] D. P. Kingma, T. Salimans, and M. Welling. Variational Dropout and the Local Reparameterization Trick. ArXiv e-prints, June 2015.
  • [17] D. P. Kingma and M. Welling. Auto encoding variational bayes. In ICLR. 2014.
  • [18] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In International Conference on Machine Learning, 2015.
  • [19] H. B. Lee, J. Lee, S. Kim, E. Yang, and S. J. Hwang. Dropmax: Adaptive variationial softmax. In NeurIPS, 2018.
  • [20] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835, 2017.
  • [21] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
  • [22] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring Generalization in Deep Learning. NIPS, 2017.
  • [23] B. Oreshkin, P. R. López, and A. Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In NeurIPS, 2018.
  • [24] S. Ravi and A. Beatson. Amortized bayesian meta-learning. In International Conference on Learning Representations, 2019.
  • [25] N. Shirish Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR, 2017.
  • [26] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • [27] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NIPS. NIPS, 2015.
  • [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
  • [29] S. Thrun and L. Pratt, editors. Learning to Learn. Kluwer Academic Publishers, Norwell, MA, USA, 1998.
  • [30] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Annual Allerton Conference on Communication, Control and Computing, 1999.
  • [31] N. Tishby and N. Zaslavsky. Deep Learning and the Information Bottleneck Principle. In In IEEE Information Theory Workshop, 2015.
  • [32] V. Verma, A. Lamb, C. Beckham, A. Najafi, A. Courville, I. Mitliagkas, and Y. Bengio. Manifold Mixup: Learning Better Representations by Interpolating Hidden States. arXiv e-prints, June 2018.
  • [33] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. 2016.
  • [34] S. Wang and C. Manning. Fast dropout training. In ICML, pages 118–126, 2013.
  • [35] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ArXiv e-prints, Feb. 2015.
  • [36] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2017.

Appendix A Algorithm

We provide the pseudocode for Meta-dropout model for meta-training and meta-testing, respectively.

1:Input: Task distribution , Number of inner steps , Inner step size , Outer step size .
2:while not converge do
3:     Sample
4:     
5:     for  to  do
6:         Sample for  
7:          for
8:         
9:     end for
10:     
11:      for
12:     
13:     
14:end while
Algorithm 1 Meta-training
1:Input: Number of inner steps , Inner step size , MC sample size .
2:Input: Learned paramter and from Algorithm 1.
3:Input: Meta-test dataset .
4:
5:for  to  do
6:     Sample for
7:      for
8:     
9:end for
10:
11:Evaluate for
Algorithm 2 Meta-testing

Appendix B Graphical model for the inner-gradient steps

(a) Meta-dropout      
(b) ABML             
(c) ABML +
Meta-dropout
Figure 8: Graphical model for inner-gradient steps.

In this section, we provide more detailed explanation about what kind of graphical model each baseline optimizes in its inner-gradient steps, and the corresponding variational frameworks with the learned parameter (or learned prior).

b.1 Meta-dropout

The inner-gradient steps of Meta-dropout make a posterior inference based on the graphical model in Figure 8(a) (note that is constant). Let denote the approximate posterior. The standard variational inference framework maximizes the following evidence lower bound (ELBO) [27].

However, the KL-divergence in the last equation easily become negligible (close to zero), mainly because both and are equally instance-wisely conditional. Therefore, one of the usual practices is to let the approxiamte posterior share the form with the conditional prior (i.e. ), resulting in the following simpler form with the KL term vanishing into zero [27, 35]:

(10)

which corresponds to the ELBO that Meta-dropout maximizes in its inner-gradient steps (See the main paper). The learned regularizes the final solution for the approximate posterior (or conditional prior, equivalently).

b.2 Abml

This is the model proposed by Ravi and Beatson [24]. They assume the global initial model parameter as gaussian whose mean and variance are given as and , respectively. The inner-gradient steps of ABML thus maximize the ELBO for the graphical model in Figure 8(b) (e.g. Bayes by backprop [6]). They further assume and as latent, but they point estimate on those, so here we simply let them be deterministic for simpler analysis. The lower bound is given as follows:

where is the approximate posterior. The dependency on the whole dataset is achieved by performing gradient steps with from the initial parameters and . Therefore, unlike our model where does not participate in the inner-gradient stpes, here in ABML both parameters do so.

b.3 ABML + Meta-dropout

Lastly, if we apply Meta-dropout to ABML, then the inner-gradient steps of such model actually optimizes the ELBO for the combined graphical model in Figure 8(c). The lower bound is as follows:

where for the expectation on the r.h.s., we first sample and then sample . It shows that ABML and Meta-dropout are well compatable within a single variational inference framework, given the learned optimal parameter for the noise generator and the learned weight prior . They also effectively compensate each other in terms of different types of uncertanty; ABML accounts for the epistemic uncertainty of each task-specific parameter as a latent variable, and Meta-dropout accounts for the aleatoric (and also heteroscedastic) uncertainty of the global task distribution.

Appendix C MAML + Manifold Mixup experiments

We did additional experiments with Manifold Mixup [32]

, where for each step we randomly select one hidden layer (possibly including inputs) to interpolate the pairs of hidden representations of the training examples as well as the corresponding one-hot encoded labels. Interpolation is linear with random ratio sampled from

We used , which seems to work well according to the paper [32]. Similarly to other stochastic baselines, we train with a single interpolation per each inner-gradient step at meta-training time, and sample interpolations (each with independent sample of layer index and random mixing ratio) per step at meta-testing time to accurately evaluate the final task-specific model parameter.

In Table 3, we see that Manifold Mixup does not help improve on the base MAML. This is because there are only few ways to interpolate each other with only few training instances. Even in 1-shot scenario there is no other examples from the same class at all, degrading the quality of interpolation.

Omniglot 20-way Mini-ImageNet 5-way
Models 1-shot 5-shot 1-shot 5-shot
MAML (ours) 95.23 0.17 98.38 0.07 49.58 0.65 64.55 0.52
MAML + Manifold Mixup () 89.78 0.25 97.86 0.08 48.62 0.66 63.86 0.53
MAML + Manifold Mixup () 87.26 0.28 97.14 0.17 48.42 0.64 62.56 0.55
MAML + Meta-dropout 96.55 0.14 99.04 0.05 50.92 0.66 65.49 0.55
Table 3: Few-shot classification performance. All reported results are average performances over 1000 randomly selected episodes with standard errors for 95% confidence interval over tasks.

Appendix D Additional visualization of decision boundaries

In this section, we provide additional decision boundary visualizations 111We referred to: https://github.com/tmadl/highdimensional-decision-boundary-plot, from the Omniglot 20-way 1-shot examples. From the initial parameter, we proceed the inner-gradient steps with two examples for each class and visualize for the randomly selected two classes. Figure 9 shows the change of decision boundaries over the course of inner-gradient steps (Step 0-2). We omit to draw step 3-5 since the models roughly converge around step 2.

(a) MAML (Step 0)
(b) MAML (Step 1)
(c) MAML (Step 2)
(d) Meta-dropout (Step 0)
(e) Meta-dropout (Step 1)
(f) Meta-dropout (Step 2)
Figure 9: Given the same sampled task, we draw decision boundaries over the course of inner-gradient steps with MAML (upper row) and Meta-dropout (bottom row), respectively.

In Figure 10, we draw the final-step decision boundaries estimated from different task samples (Task 1-4). We see that the decision boundary of Meta-dropout is much simpler and correct, than the one from the base MAML.

(a) MAML (Task 1)
(b) Meta-dropout (Task 1)
(c) MAML (Task 2)
(d) Meta-dropout (Task 2)
(e) MAML (Task 3)
(f) Meta-dropout (Task 3)
(g) MAML (Task 4)
(h) Meta-dropout (Task 4)
Figure 10: Visualization of the final-step decision boundaries.

Appendix E Visualization with Mini-ImageNet examples

(a) Inputs
(b) Layer 2 channel 1
(c) Layer 2 channel 2
(d) Layer 2 channel 3
(e) Layer 2 channel 4
(f) Layer 2 channel 5
Figure 11: The learned multiplicative noise strength on the 2nd layer pre-activations

(a) Inputs
(b) Layer 4 channel 1
(c) Layer 4 channel 2
(d) Layer 4 channel 3
(e) Layer 4 channel 4
(f) Layer 4 channel 5
(g) Layer 4 channel 6
(h) Layer 4 channel 7
Figure 12: The learned multiplicative noise strength on the 4th layer pre-activations.

(a) Inputs
(b) Perturbation 1
(c) Perturbation 2
(d) Perturbation 3
Figure 13: Perturbations of a 2nd layer post-activations