MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures

06/13/2020 ∙ by Jeongun Ryu, et al. ∙ 0

Regularization and transfer learning are two popular techniques to enhance generalization on unseen data, which is a fundamental problem of machine learning. Regularization techniques are versatile, as they are task- and architecture-agnostic, but they do not exploit a large amount of data available. Transfer learning methods learn to transfer knowledge from one domain to another, but may not generalize across tasks and architectures, and may introduce new training cost for adapting to the target task. To bridge the gap between the two, we propose a transferable perturbation, MetaPerturb, which is meta-learned to improve generalization performance on unseen data. MetaPerturb is implemented as a set-based lightweight network that is agnostic to the size and the order of the input, which is shared across the layers. Then, we propose a meta-learning framework, to jointly train the perturbation function over heterogeneous tasks in parallel. As MetaPerturb is a set-function trained over diverse distributions across layers and tasks, it can generalize to heterogeneous tasks and architectures. We validate the efficacy and generality of MetaPerturb trained on a specific source domain and architecture, by applying it to the training of diverse neural architectures on heterogeneous target datasets against various regularizers and fine-tuning. The results show that the networks trained with MetaPerturb significantly outperform the baselines on most of the tasks and architectures, with a negligible increase in the parameter size and no hyperparameters to tune.



There are no comments yet.


page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The success of Deep Neural Networks (DNNs) largely owes to their ability to accurately represent arbitrarily complex functions. However, at the same time, the excessive number of parameters, which enabled such expressive power, renders them susceptible to overfitting especially when we do not have a sufficient amount of data to ensure generalization. There are two popular techniques that can help with generalization of deep neural networks: transfer learning and regularization.

Transfer learning Tan et al. (2018) methods aim to overcome this data scarcity problem by transferring knowledge obtained from a source dataset to effectively guide the learning on the target task. Whereas the existing transfer learning methods have been proven to be very effective, there also exist some limitations. Firstly, their performance gain highly depends on the similarity between source and target domain, and knowledge transfer across different domains may not be effective or even degenerate the performance on the target task. Secondly, many transfer learning methods require the neural architectures for the source and the target tasks to be the same, as in the case of fine-tuning. Moreover, transfer learning methods usually require additional memory and computational cost for knowledge transfer. Many require to store the entire set of parameters for the source network (e.g. fine-tuning, LwF Li and Hoiem (2017), attention transfer Zagoruyko and Komodakis (2017)), and some methods require extra training to transfer the source knowledge to the target task Jang et al. (2019)

. Such restriction makes transfer learning unappealing, and thus not many of them are used in practice except for simple fine-tuning of the networks pre-trained on large datasets (e.g. convolutional networks pretrained on ImageNet 

Russakovsky et al. (2015), BERT Devlin et al. (2019) trained on Wikipedia).

On the other hand, regularization techniques, which leverage human prior knowledge on the learning task to help with generalization, are more versatile as they are domain- and architecture- agnostic. Penalizing the -norm of the weight Neyshabur et al. (2017a), dropping out random units or filters Srivastava et al. (2014); Ghiasi et al. (2018), normalizing the distribution of latent features at each input Ioffe and Szegedy (2015); Ulyanov et al. (2016); Wu and He (2018), randomly mixing or perturbing samples Zhang et al. (2018); Verma et al. (2019), are instances of such domain-agnostic regularizations. They are more favored in practice over transfer learning since they can work with any architectures and do not incur extra memory or computational overhead, which is often costly with many advanced transfer learning techniques. However, regularization techniques are limited in that they do not exploit the rich information in the large amount of data available.

These limitations of transfer learning and regularization techniques motivate us to come up with transferable regularization technique that can bridge the gap between the two different approaches for enhancing generalization. Such a transferable regularizer should learn useful knowledge from the source task for regularization, while generalizing across different domains and architectures, with minimal extra cost. A recent work Lee et al. (2020) propose to meta-learn a noise generator for few-shot learning, to improve generalization on unseen tasks. Yet, the proposed gradient-based meta-learning scheme cannot scale to standard learning setting which will require large amount of steps to converge to good solutions and is inapplicable to architectures that are different from the source network architecture.

To overcome these difficulties, we propose a novel lightweight, scalable perturbation function that is meta-learned to improve generalization on unseen tasks and architectures for standard training (See Figure 1 for the concept). Our model generates regularizing perturbations to latent features, given the set of original latent features at each layer. Since it is implemented as an order-equivariant set function, it can be shared across layers and networks learned with different initializations. We meta-learn our perturbation function by a simple joint training over multiple subsets of the source dataset in parallel, which largely reduces the computational cost of meta-learning.

We validate the efficacy and efficiency of our transferable regularizer MetaPerturb by training it on a specific source dataset and applying the learned function to the training of heterogeneous architectures on a large number of datasets with varying degree of task similarity. The results show that networks trained with our meta regularizer outperforms recent regularization techniques and fine-tuning, and obtain largely improved performances even on largely different tasks on which fine-tuning fails. Also, since the optimal amount of perturbation is automatically learned at each layer, MetaPerturb does not have any hyperparameters unlike most of the existing regularizers. Such effectiveness, efficiency, and versatility of our method makes it an appealing transferable regularization technique that can replace or accompany fine-tuning and conventional regularization techniques.

Figure 1: Concepts. We learn our perturbation function at meta-training stage and use it to solve diverse meta-testing tasks that come with diverse network architectures.

The contribution of this paper is threefold:

  • We propose a lightweight and versatile perturbation function that can transfer the knowledge of a source task to heterogeneous target tasks and architectures.

  • We propose a novel meta-learning framework in the form of joint training, which allows to efficiently perform meta-learning on large-scale datasets in the standard learning framework.

  • We validate our perturbation function on a large number of datasets and architectures, on which it successfully outperforms existing regularizers and finetuning.

2 Related Work

Transfer Learning

Transfer learning Tan et al. (2018)

is one of the popular tools in deep learning to solve the data scarcity problem. The most widely used method in transfer learning is fine-tuning 

Sharif Razavian et al. (2014) which first trains parameters in the source domain and then use them as the initial weights when learning for the target domain. ImageNet Russakovsky et al. (2015)

pre-trained network weights are widely used for fine-tuning, achieving impressive performance on various computer vision tasks (e.g. semantic segmentation 

Long et al. (2015), object detection Girshick et al. (2014)). However, if the source and target domain are semantically different, fine-tuning may result in negative transfer Yosinski et al. (2014). Further it is inapplicable when the target network architecture is different from that of the source network. Transfer learning frameworks often require extensive hyperparameter tuning (e.g. until which layer to transfer, fine-tuning or not, etc). Recently, Jang et al. Jang et al. (2019) proposed a framework to overcome this limitation which can automatically learn what knowledge to transfer from the source network and between which layer to perform knowledge transfer. However, it requires large amount of additional training for knowledge transfer, which limits its practicality. Most of the existing transfer learning methods aim to transfer the features themselves, which may result in negative or zero transfer when the source and the target domains are dissimilar. Contrary to existing frameworks, our framework transfers how to perturb the features in the latent space, which can yield performance gains even on domain dissimilar cases.

Regularization methods

Training with our input-dependent perturbation function is reminiscent of some of existing input-dependent regularizers. Specifically, information bottleneck methods Tishby et al. (1999) with variational inference have input-dependent form of perturbation function applied to both training and testing examples as with ours. Variational Information Bottleneck Alemi et al. (2017) introduces additive noise whereas Information Dropout Achille and Soatto (2018) applies multiplicative noise as with ours. The critical difference from those existing regularizers is that our perturbation function is meta-learned while they do not involve such knowledge transfer. A recently proposed meta-regularizer, Meta Dropout Lee et al. (2020) is relevant to ours as it learns to perturb the latent features of training examples for generalization. However, it specifically targets for meta-level generalization in few-shot meta-learning, and does not scale to standard learning frameworks with large number of inner gradient steps as they run on MAML framework Finn et al. (2017). Meta Dropout also requires the noise generator to have the same architecture as the source network, which limits its practicality for large networks and makes it impossible to generalize over heterogeneous architectures.

Meta Learning

Our regularizer is meta-learned to generalize over heterogeneous tasks and architectures. Meta-learning Ioffe and Szegedy (2015) aims to learn common knowledge that can be shared over distribution of tasks, such that the model can generalize to unseen tasks. While the literature on meta-learning is vast, we name a few works that are most relevant to ours. Finn et al. Finn et al. (2017) proposed a model-agnostic meta-learning (MAML) framework to find a shared initialization parameter that can be fine-tuned to obtain good performance on an unseen target task a few gradient steps. The main difficulty is that the number of inner-gradient steps is excessively large compared to few-shot learning problems. This led the follow-up works to focus on reducing the computational cost of extending the inner-gradient steps Nichol et al. (2018); Flennerhag et al. (2019); Rajeswaran et al. (2019); Andrychowicz et al. (2016), but still they assume we take at most hundreds of gradient steps from a shared initialization. On the other hand, Ren et al. Ren et al. (2018) and its variant Shu et al. (2019) propose to use an online approximation of the full inner-gradient steps, such that we lookahead only a single gradient step and the meta-parameter is optimized with the main network parameter at the same time in online manner. While effective for standard learning, they are still computationally inefficient due to the expensive bi-level optimization. On the other hand, by resorting to simple joint training on fixed subsets of the dataset, we efficiently extend the meta-learning framework from few-shot learning into a standard learning frameworks for transfer learning.

3 Approach

In this section, we introduce our perturbation function that is applicable to any convolutional network architectures and to any image datasets. We then further explain our meta-learning framework for efficiently learning the proposed perturbation function in the standard learning framework.

3.1 Dataset and Network agnostic perturbation function

The conventional transfer learning method transfers the entire set or a subset of the main network parameters . However such parameter transfer may become ineffective when we transfer knowledge across a dissimilar pair of source and target tasks. Further, if we need to use a different neural architecture for the target task, it becomes simply inapplicable. Thus, we propose to focus on transferring another set of parameters which is disjoint from and is extremely light-weight. In this work, we let be the parameter for the perturbation function

which are learned to regularize latent features of convolutional neural networks. The important assumption here is that even if a disjoint pair of source and target task requires different feature extractors for each, there may exist some general rule of perturbation that can effectively regularize both feature extractors at the same time.

Another property that we want to impose upon our perturbation function is its general applicability to any convolutional neural network architectures. The perturbation function should be applicable to:

  • Neural networks with undefined number of convolutional layers. We can solve this problem by allowing the function to be shared across the convolutional layers.

  • Convolutional layers with undefined number of channels. We can tackle this problem either by sharing the function across channels or using permutation-equivariant set encodings.

Figure 2: Left: The architecture of channel-wise permutation equivariant operation. Right: The architecture of channel-wise scaling function taking a batch of instances as an input.

3.2 MetaPerturb

We now describe our novel perturbation function, MetaPerturb that satisfies the above requirements. It consists of the following two components: input-dependent stochastic noise generator and batch-dependent scaling function.

Input-dependent stochastic noise generator

The first component is an input-dependent stochastic noise generator, which has been empirically shown by Lee et al. Lee et al. (2020) to often outperform the input-independent counterparts. To make the noise applicable to any convolutional layers, we propose to use permutation equivariant set-encoding Zaheer et al. (2017) across the channels. It allows to consider interactions between the feature maps at each layer while making the generated perturbations to be invariant to the re-orderings caused by random initializations.

Zaheer et al. Zaheer et al. (2017)

showed that for a linear transformation

parmeterized by a matrix , is permutation equivariant to the input elements iff the diagonal elements of are equal and also the off-diagonal elements of are equal as well, i.e. with and . The diagonal elements map each of the input elements to themselves, whereas the off-diagonal elements capture the interactions between the input elements.

Here, we propose an equivalent form for convolution operation, such that the output feature maps are equivariant to the channel-wise permutations of the input feature maps . We assume that consists of the following two types of parameters: for self-to-self convolution operation and for all-to-self convolution operation. We then similarly combine and

to produce a convolutional weight tensor of dimension

for input and output channels (See Figure 2 (left)). Zaheer et al. Zaheer et al. (2017) also showed that a stack of multiple permutation equivariant operations is also permutation equivariant. Thus we stack two layers of

with different parameters and ReLU nonlinearity in-between them in order to increase the flexibility of

(See Figure 2 (left)).

Finally, we sample the input-dependent stochastic noise from the following distribution:


where we fix the variance of

to following Lee et al. Lee et al. (2020), which seems to work well.

2:Input: Learning rate
4:Randomly initialize
5:while not converged do
6:     for  to  do
7:          Sample and .
8:          Compute w/ perturbation.
10:          Compute w/ perturbation.
11:     end for
13:end while
Algorithm 1 Meta-training
2:Input: Learning rate
4:Randomly initialize
5:while not converged do
6:     Sample .
7:     Compute w/ perturbation.
9:end while
10:Evaluate the test examples in with MC approximation and the parameter .
Algorithm 2 Meta-testing

Batch-dependent scaling function

The next component is batch-dependent scaling function, which scales each channel to different values between for the given batch of examples. The assumption here is that the optimal amount of the parameter usage for each channel should be differently controlled for each dataset by using a soft multiplicative gating mechanism. In Figure 2 (right), at training time, we first collect examples in batch , apply convolution, and global average pooling (GAP) for each channel to extract

-dimensional vector representations of the channel. We then compute statistics of them such as mean and diagonal covariance over batch and further concatenate the layer information such as the number of channels

and width (or equivalently, height ) to the statistics. We finally generate the scales

with a shared affine transformation and a sigmoid function, and collect them into a single vector

. At testing time, instead of using batch-wise scales, we use global scales accumulated by moving average at the training time similarly to batch normalization 

Ioffe and Szegedy (2015).

Figure 3: The architecture of our perutrbation function applicable to any convolutional neural networks (e.g. ResNet)

Final form

We lastly combine and to obtain the following form of the perturbation :


where denotes channel-wise multiplication. We then multiply back to the input feature maps , at every layer (every block for ResNet He et al. (2016)) of the network (See Figure 3). Note that the cost of knowledge transfer is marginal thanks to the small dimensionality of (e.g. ). Further, there is no hyperparameter to tune, since the optimal amount of the two perturbations is meta-learned and automatically decided for each layer and channel.

3.3 Meta-learning framework

The next important question is how to efficiently meta-learn the parameter for the perturbation function. There are two challenges: 1) Because of the large size of each source task, it is costly to sequentially alternate between the tasks within a single GPU, unlike few-shot learning where each task is sufficiently small. 2) The computational cost of lookahead operation and second-order derivative in online approximation proposed by Ren et al. Ren et al. (2018) is still too expensive.

Distributed meta-learning

To solve the first problem, we class-wisely divide the source dataset to generate (e.g. ) tasks with fixed samples and distribute them across multiple GPUs for parallel learning of the tasks. Then, throughout the entire meta-training phase, we only need to share the low-dimensional (e.g. ) meta parameter between the GPUs without sequential alternating training over the tasks. Such a way of meta-learning is simple yet novel, and scalable to the number of tasks given a sufficient number of GPUs.

Knowledge transfer at the limit of convergence

To solve the second problem, we propose to further approximate the online approximation Ren et al. (2018) by simply ignoring the bi-level optimization and the corresponding second-order derivative. It means we simply focus on knowledge transfer across the tasks only at the limit of the convergence of the tasks. Toward this goal, we propose to perform a joint optimization of and , each of which maximizes the log likelihood of the training dataset and test dataset , respectively:


where denotes that we do not compute the gradient and consider as constant. See the Algorithm 1 and 2 for meta-training and meta-test, respectively. The intuition is that, even with this naive approximation, the final will be transferable if we confine the limit of transfer to around the convergence, since we know that already has satisfied the desiried property at the end of the convergence of multiple meta-training tasks, i.e. over . It is natural to expect similar consequence at meta-test time if we let the novel task jointly converge with the meta-learned to obtain . We empirically verified that gradually increasing the strength of our perturbation function performs much better than without such annealing, which means that the knowledge transfer may be less effective at the early stage of the training, but becomes more effective at later steps, i.e. near the convergence. We can largely reduce the computational cost of meta-training with this naive approximation.

4 Experiments

Model # Transfer Source Target Dataset
params dataset STL10 s-CIFAR100 Dogs Cars Aircraft CUB
Base 0 None 66.780.59 31.790.24 34.651.05 44.341.10 59.230.95 30.630.66
Info. Dropout Achille and Soatto (2018) 0 None 67.460.17 32.320.33 34.630.68 43.132.31 58.590.90 30.830.79
DropBlock Ghiasi et al. (2018) 0 None 68.510.67 32.740.36 34.590.87 45.111.47 59.761.38 30.550.26
Manifold Mixup Verma et al. (2019) 0 None 72.830.69 39.060.73 36.290.70 48.971.69 64.351.23 37.800.53
MetaPerturb 82 TIN 69.790.60 34.470.45 38.550.51 62.490.96 66.120.70 39.941.30
Finetuning (FT) .3M TIN 77.160.41 43.690.22 40.090.31 58.611.16 66.030.85 34.890.30
FT + Info. Dropout .3M + 0 TIN 77.410.13 43.920.44 40.040.46 58.070.57 65.470.27 35.550.81
FT + DropBlock .3M + 0 TIN 78.320.31 44.840.37 40.540.56 61.080.61 66.300.84 34.610.54
FT + Manif. Mixup .3M + 0 TIN 79.600.27 47.920.79 42.540.70 64.810.97 71.530.80 43.070.83
FT + MetaPerturb .3M + 82 TIN 78.400.18 46.600.32 45.240.22 72.480.08 73.000.66 46.900.49
Table 1: Transfer to multiple datasets. Source and target network are ResNet20. TIN: Tiny ImageNet.
Figure 4: Convergence plots on Aircraft Maji et al. (2013) and Stanford Cars Krause et al. (2013) datasets.

We next validate our method on realistic learning scenarios where target task can come with arbitrary image datasets and arbitrary convolutional network architectures. For the base regularizations, we apply weight decay of and random cropping and horizontal flipping to all our experiments.

4.1 Transfer to multiple datasets

We first validate if our meta-learned perturbation function can generalize to multiple target datasets.


We use Tiny ImageNet 1 as the source dataset, which is a subset of the ImageNet Russakovsky et al. (2015) dataset. It consists of size images from 200 classes, with training images for each class. We class-wisely split the dataset into splits to produce heterogeneous task samples. We then transfer our perturbation function to the following target tasks: STL10 Coates et al. (2011), CIFAR-100 Krizhevsky et al. (2009), Stanford Dogs Khosla et al. (2011), Stanford Cars Krause et al. (2013), Aircraft Maji et al. (2013), and CUB Wah et al. (2011). STL10 and CIFAR-100 are benchmark classification datasets of general categories, which is similar to the source dataset. Other datasets are for fine-grained classification, and thus quite dissimilar from the source dataset. We resize the images of the fine-grained classification datasets into . Lastly, for CIFAR-100, we sub-sample images from the original training set in order to simulate data-scarse scenario (i.e. prefix s-). See the Appendix for more detailed information for the datasets.


We consider the following well-known stochastic regularizers to compare our model with. We carefully tuned the hyperparameters of each baseline with a holdout validation set for each dataset. Note that MetaPerturb does not have any hyperparameters. Information Dropout: This model Achille and Soatto (2018) is an instance of Information Bottleneck (IB) method Tishby et al. (1999), where the bottleneck variable is defined as multiplicative perturbation as with ours. DropBlock: This model Ghiasi et al. (2018) is a type of structured dropout Srivastava et al. (2014) specifically developed for convolutional networks, which randomly drops out units in a contiguous region of a feature map together. Manifold Mixup: A recently introduced stochastic regularizer Verma et al. (2019)

that randomly pairs training examples to linearly interpolate between the latent features of them. We also compare with

Base and Finetuning which have no regularizer added.


Table 1 shows that our MetaPerturb regularizer significantly outperforms all the baselines on most of the datasets with only dimesions of parameters transferred. MetaPerturb is especially effective on the fine-grained datasets. This is because the generated perturbations help focus on correct part of the input by injecting noise or downweighting the scale of the distracting parts of the input. Our model also outperforms the baselines with significant margins when used along with finetuning from the source dataset (Tiny ImageNet). All these results demonstrate that our model can effectively regularize the networks trained on unseen tasks from heterogeneous task distributions. Figure 4 shows that MetaPerturb shows better convergence than the baselines in terms of test loss and accuracy.

Model Source Target Network
Network Conv4 Conv6 VGG9 ResNet20 ResNet44 WRN-28-2
Base None 83.930.20 86.140.23 88.440.29 87.960.30 88.940.41 88.950.44
Infomation Dropout None 84.910.34 87.230.26 88.291.18 88.460.65 89.330.20 89.510.29
DropBlock None 84.290.24 86.220.26 88.680.35 89.430.26 90.140.18 90.550.25
Finetuning Same 84.000.27 86.560.23 88.170.18 88.770.26 89.620.05 89.850.31
MetaPerturb ResNet20 86.590.29 88.790.11 90.200.11 90.420.27 91.410.13 90.900.24
Table 2: Transfer to multiple networks. Source dataset is Tiny ImageNet and target dataset is small-SVHN. For Finetuning baseline, we match the source and target network since it cannot be applied to different networks.
Figure 5: (a-c) Adversarial robustness against PGD attack with varying size of radius . (d) Calibration plot.

4.2 Transfer to multiple networks

We next validate if our meta-learned perturbation can generalize to multiple network architectures.

Dataset and Networks

We use small version of SVHN dataset Netzer et al. (2011) (total instances). We use networks with 4 or 6 convolutional layers with channels (Conv4 Vinyals et al. (2016) and Conv6), VGG9 (a small version of VGG Simonyan and Zisserman (2015) used in Simonyan et al. (2014)), ResNet20, ResNet44 He et al. (2016) and Wide ResNet 28-2 Zagoruyko and Komodakis (2016).


Table 2 shows that our MetaPerturb regularizer significantly outperforms the baselines on all the network architectures we considered. Note that although the source network is fixed as ResNet20 during meta-training, the statistics of the layers are very diverse, such that the shared perturbation function is learned to generalize over diverse input statistics. We conjecture that such sharing across layers is the reason MetaPerturb effectively generalize to diverse target networks.

Variants s-CIFAR100 Aircraft CUB
Base 31.790.24 59.230.95 30.630.66
(a) Components of perturbation w/o channel-wise scaling 33.710.46 61.740.76 31.460.44
w/o stochastic noise 20.220.93 45.822.69 14.862.60
(b) Location of perturbation Only before pooling 32.920.33 59.300.96 33.520.61
Only at top layers 32.540.19 53.420.79 27.700.68
Only at bottom layers 31.750.97 61.930.86 31.400.24
(c) Meta-training strategy Homogeneous task distribution 34.160.77 61.260.24 33.040.85
MetaPerturb 34.470.45 66.120.70 39.941.30
Table 3: Ablation study.
Figure 6: The scale at each block of ResNet20.

4.3 Adversarial robustness and calibration performance


Figure 5(a-c) shows that unlike the typical adverarial training methods based on PGD attack Madry et al. (2018) (adversarial baselines in Figure 5(a-c)), MetaPerturb improves both clean accuracy and adversarial robustness against all the , and attacks, without explicit adversarial training. Figure 5(d) shows that our MetaPerturb also improves the calibration performance in terms of the expected calibration error (ECE Naeini et al. (2015)) and calibration plot, while other regularizers do not.

(a) Base
(b) DropBlock
(c) M. Mixup
(d) MetaPerturb
Figure 7: Visualization of training loss surface Li et al. (2018) (CUB, ResNet20)

Qualitative analysis

Figure 6 shows the learned scale across the layers for each dataset. We see that for each channel and layer are generated differently for each dataset according to what has been learned in the meta-training stage. Whereas the amount of penalization at the lower layers are nearly constant across the datasets, the amount of perturbation at the upper layers are highly variable, for example the fine-grained datasets (e.g . Aircraft and CUB) do not penalize the upper layer feature activations much. Figure 7 shows that MetaPerturb and Manifold Mixup model have flatter loss surface than the baselines’. It is known that flatter loss surface is closely related to generalization performance Keskar et al. (2017); Neyshabur et al. (2017b), which partly explains why our model generalize well.

Ablation study

(a) Components of the perturbation function: In Table 3(a), we can see that both components of our perturbation function, the input-dependent stochastic noise and the channel-wise scaling jointly contribute to the good performance of our MetaPerturb regularizer.
(b) Location of the perturbation function: Also, in order to find appropriate location of the perturbation function, we tried applying it to various parts of the networks in Table 3(b) (e.g. only before pooling layers or only at top/bottom layers). We can see that applying the function to a smaller subset of layers largely underperforms applying it to all the ResNet blocks as done with MetaPerturb.
(c) Source task distribution: Lastly, in order to verify the importance of heterogeneous task distribution, we compare with the homogeneous task distribution by splitting the source dataset across the instances, rather than across the classes as done with MetaPetrub. We see that this results in large performance degradation with the fine-grained classification datasets, since the lack of diversity prevents the perturbation function from effectively extrapolating to fine granularity in the target tasks.

5 Conclusion

We proposed a light-weight perturbation function that can transfer the knowledge of a source task to any convolutional architectures and image datasets, by bridging the gap between regularization methods and transfer learning. This is done by implementing the noise generator as a permutation-equivariant set function that is shared across different layers of deep neural networks, and meta-learning it. To scale up meta-learning to standard learning frameworks, we proposed a simple yet effective meta-learning approach, which divides the dataset into multiple subsets and train the noise generator jointly over the subsets, to regularize networks with different initializations. With extensive experimental validation on multiple architectures and tasks, we show that MetaPerturb trained on a single source task and architecture significantly improves the generalization of unseen architectures on unseen tasks, largely outperforming advanced regularization techniques and fine-tuning. MetaPerturb is highly practical as it requires negligible increase in the parameter size, with no adaptation cost and hyperparameter tuning. We believe that with such effectiveness, versatility and practicality, our regularizer has a potential to become a standard tool for regularization.

Broader Impact

Our MetaPerturb regularizer effectively eliminates the need for retraining of the source task because it can generalize to any convolutional neural architectures and to any image datasets. This versatility is extremely helpful for lowering the energy consumption and training time required in transfer learning, because in real world there exists extremely diverse learning scenarios that we have to deal with. Previous transfer learning or meta-learning methods have not been flexible and versatile enough to solve those diverse large-scale problems simultaneously, but our model can efficiently improve the performance with a single meta-learned regularizer. Also, MetaPerturb efficiently extends the previous meta-learning to standard learning frameworks by avoiding the expensive bi-level optimization, which reduces the computational cost of meta-training, which will result in further reduction in the energy consumption and training time.


  • [1] Note: Cited by: §C.1, §4.1.
  • [2] A. Achille and S. Soatto (2018) Information Dropout: Learning Optimal Representations Through Noisy Computation. In TPAMI, Cited by: Table 4, §2, §4.1, Table 1.
  • [3] A. Alemi, I. Fischer, J. Dillon, and K. Murphy (2017) Deep Variational Information Bottleneck. In ICLR, Cited by: §2.
  • [4] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas (2016) Learning to learn by gradient descent by gradient descent. In NIPS, Cited by: §2.
  • [5] A. Coates, A. Ng, and H. Lee (2011) An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In AISTATS, Cited by: Appendix A, §C.2, §4.1.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL, Cited by: §1.
  • [7] C. Finn, P. Abbeel, and S. Levine (2017) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML, Cited by: §2, §2.
  • [8] S. Flennerhag, P. G. Moreno, N. Lawrence, and A. Damianou (2019) Transferring Knowledge across Learning Processes. In ICLR, Cited by: §2.
  • [9] G. Ghiasi, T. Lin, and Q. V. Le (2018) Dropblock: A regularization method for convolutional networks. In NIPS, Cited by: Table 4, §1, §4.1, Table 1.
  • [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In CVPR, Cited by: §2.
  • [11] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In ICML, Cited by: Appendix A.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §C.3, §C.3, §3.2, §4.2.
  • [13] S. Ioffe and C. Szegedy (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, Cited by: §1, §2, §3.2.
  • [14] Y. Jang, H. Lee, S. J. Hwang, and J. Shin (2019) Learning What and Where to Transfer. In ICML, Cited by: §1, §2.
  • [15] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017) On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In ICLR, Cited by: §4.3.
  • [16] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei (2011) Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, CVPR, Cited by: §C.2, §4.1.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §C.4.
  • [18] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Cited by: §C.2, Figure 4, §4.1.
  • [19] A. Krizhevsky, G. Hinton, et al. (2009) Learning Multiple Layers of features from Tiny Images. Cited by: §C.2, §4.1.
  • [20] H. Lee, T. Nam, E. Yang, and S. J. Hwang (2020) Meta Dropout: Learning to Perturb Latent Features for Generalization. In ICLR, Cited by: §1, §2, §3.2, §3.2.
  • [21] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein (2018) Visualizing the Loss Landscape of Neural Nets. In NIPS, Cited by: Figure 7.
  • [22] Z. Li and D. Hoiem (2017) Learning without Forgetting. In TPAMI, Cited by: §1.
  • [23] J. Long, E. Shelhamer, and T. Darrell (2015) Fully Convolutional Networks for Semantic Segmentation. In CVPR, Cited by: §2.
  • [24] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR, Cited by: Figure 8, §4.3.
  • [25] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi (2013) Fine-Grained Visual Classification of Aircraft. arXiv preprint arXiv:1306.5151. Cited by: §C.2, Figure 4, §4.1.
  • [26] M. P. Naeini, G. Cooper, and M. Hauskrecht (2015)

    Obtaining well calibrated probabilities using bayesian binning

    In AAAI, Cited by: Appendix A, §4.3.
  • [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading Digits in Natural Images with Unsupervised Feature Learning. Cited by: §C.2, §4.2.
  • [28] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017) Exploring Generalization in Deep Learning. In NIPS, Cited by: §1.
  • [29] B. Neyshabur, S. Bhojanapalli, D. Mcallester, and N. Srebro (2017) Exploring generalization in deep learning. In NIPS, Cited by: §4.3.
  • [30] A. Nichol, J. Achiam, and J. Schulman (2018) On First-Order Meta-Learning Algorithms. arXiv e-prints. Cited by: §2.
  • [31] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine (2019) Meta-Learning with Implicit Gradients. In NeurIPS, Cited by: §2.
  • [32] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to Reweight Examples for Robust Deep Learning. ICML. Cited by: §2, §3.3, §3.3.
  • [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §C.1, §1, §2, §4.1.
  • [34] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN Features off-the-shelf: an Astounding Baseline for Recognition. In CVPR, Cited by: §2.
  • [35] J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng (2019) Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting. In NeurIPS, Cited by: §2.
  • [36] K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In ICLR Workshop, Cited by: §4.2.
  • [37] K. Simonyan and A. Zisserman (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, Cited by: §C.3, §C.3, §4.2.
  • [38] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15, pp. 1929–1958. Cited by: §1, §4.1.
  • [39] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu (2018) A Survey on Deep Transfer Learning. In ICANN, Cited by: §1, §2.
  • [40] N. Tishby, F. C. Pereira, and W. Bialek (1999) The Information Bottleneck Method. In Annual Allerton Conference on Communication, Control and Computing, Cited by: §2, §4.1.
  • [41] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv preprint arXiv:1607.08022. Cited by: §1.
  • [42] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio (2019) Manifold Mixup: Better Representations by Interpolating Hidden States. In ICML, Cited by: Table 4, §1, §4.1, Table 1.
  • [43] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching Networks for One Shot Learning. In NIPS, Cited by: §C.3, §4.2.
  • [44] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §C.2, §4.1.
  • [45] Y. Wu and K. He (2018) Group Normalization. In ECCV, Cited by: §1.
  • [46] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In NIPS, Cited by: §2.
  • [47] S. Zagoruyko and N. Komodakis (2016) Wide Residual Networks. In BMVC, Cited by: §C.3, §4.2.
  • [48] S. Zagoruyko and N. Komodakis (2017) Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In ICLR, Cited by: §1.
  • [49] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep Sets. In NIPS, Cited by: §3.2, §3.2, §3.2.
  • [50] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) mixup: Beyond Empirical Risk Minimization. In ICLR, Cited by: §1.

Appendix A More Results and Analysis on Robustness and Calibration


In Figure 8, we measure the adversarial robustness with the additional dataset, STL10 [5]. We use PGD attack of steps with some range of and the inner-learning rate is set to for and attack and for attack. We observe that the baseline regularizers are not as robust against PGD attacks as our method, meaning that it is not easy to defend against PGD attacks without explicit adversarial training. However, our MetaPerturb provides an efficient way of doing so. We also compare with adversarial training baselines, which take projected gradient descent steps at training. See Figure 8 for the value used for adversarial training for each dataset. We can see that whereas adversarial training is beneficial for the adversarial accuracies, it largely degrades the clean accuracies. On the other hand, our MetaPerturb regularizer improves both clean accuracy and adversarial robustness than the base model, even without explicit adversarial training.


In the main paper, we showed that the predictions with MetaPerturb regularizer are better calibrated than those of the baselines. In this section, we provide more results and analysis of calibration on various datasets. First of all, calibration performance is frequently quantified with Expected Calibration Error (ECE) [26]. ECE is computed by dividing the confidence values into multiple bins and averaging the gap between the actual accuracy and the confidence value over all the bins. Formally, it is defined as


Table 4 and Figure 9 show that MetaPerturb produces better-calibrated confidence scores than the baselines on most of the datasets. We conjecture that it is because the parameter of the perturbation function has been meta-learned to lower the negative log-likelihood (NLL) of the test set, similarly to temperature scaling [11] or other popular calibration methods. In other words, we argue that the learning objective of meta-learning is inherently good for calibration by learning to lower the test NLL.

Appendix B Visualizations of Perturbation Function

In this section, we visualize the feature maps before and after passing the perturbation function from various datasets. We use ResNet20 network for visualization. We visualize the feature maps from the top to bottom layers in order to see the different levels of layers. Although it is not very straightforward to interpret the results, we can roughly observe that the activation strengths are suppressed by the scale , and see how the stochastic noise transforms the original feature maps.

Model # Transfer Source Target Dataset
params dataset STL10 s-CIFAR100 Dogs Cars Aircraft CUB
Base 0 None 23.361.10 33.090.50 8.400.66 9.780.72 10.370.92 21.770.80
Finetuning .3M TIN 15.680.40 29.780.33 11.410.18 7.000.84 8.040.65 23.050.31
Info. Dropout [2] 0 None 22.870.28 32.780.21 8.270.80 8.840.77 9.991.15 20.410.34
DropBlock [9] 0 None 19.650.50 28.700.17 5.890.71 5.831.02 7.261.55 18.640.40
Manifold Mixup [42] 0 None 5.410.25 2.260.52 5.820.42 17.000.79 19.800.45 9.950.50
MetaPerturb 82 TIN 4.800.63 14.410.65 2.050.31 2.820.46 2.960.37 15.621.10
Table 4: ECE of multiple datasets. Source and target network are ResNet20. TIN: Tiny ImageNet.
Figure 9: Calibration plot on STL10, s-CIFAR100, Stanford Dogs, Stanford Cars, Aircraft and CUB datasets using ResNet20.
(a) Dogs
(b) Layer 1, : 0.6463
(c) Layer 3, : 0.7324
(d) Layer 9, : 0.4963
(e) Layer 8, : 0.8078
(f) Cars
(g) Layer 3, : 0.6693
(h) Layer 2, : 0.7129
(i) Layer 7, : 0.5854
(j) Layer 7, : 0.8546
(k) Aircraft
(l) Layer 3, : 0.6601
(m) Layer 1, : 0.7942
(n) Layer 9, : 0.5110
(o) Layer 8, : 0.8824
Figure 10: (a) Original image (b-e) Left: feature map before passing the perturbation Center: generated noise Right: feature map after passing the perturbation.

Appendix C Experimental Setup

c.1 Meta-training Dataset

Tiny ImageNet

This dataset [1] is a subset of ImageNet [33] dataset, consisting of size images from classes. There are , , and images for training, validation, and test dataset, respectively. We use the training dataset for the source training, by resizing images to size and dividing dataset into class-wise splits to produce heterogeneous task samples.

c.2 Meta-testing Datasets


This dataset [5] consists of classes of general objects such as airplane, bird, and car, which is similar to CIFAR-10 dataset but has higher resolution of . There are and examples per class for training and test set, respectively. We resized the images to size.

small CIFAR-100

This dataset [19] consists of classes of general objects such as beaver, aquarium fish, and cloud. The image size is and there are and examples for training and test set, respectively. In order to demonstrate that our model performs well on small dataset, we randomly sample examples from the whole training set and use this smaller set for meta-testing.

Stanford Dogs

This dataset [16] is for fine-grained image categorization and contains images from breeds of dogs from around the world. It has total and images for training and testing, respectively. We resized the images to size.

Stanford Cars

This dataset [18]

is also for fine-grained classification, classifying between the Makes, Models, Years of various cars, e.g. 2012 Tesla Model S or 2012 BMW M3 coupe. It contains

images from classes of cars, where and images are assigned for training and test set, respectively. We resized the images to size.


This dataset [25] consists of images from different aircraft model variants (most of them are airplane). There are images for each class and we use examples for training and examples for testing. We resized the images to size.


This dataset [44] consists of bird classes such as Black Tern, Blue Jay, and Palm Warbler. It has training images and test images, and we did not use bounding box information for our experiments. We resized the images to size.

small SVHN

The origianl dataset [27] consists of color images from digit classes. The image size is . In our experiments, we use only subsampled examples for training in order to simulate data scarse scenario. There are examples for testing.

c.3 Networks

We use 6 networks (Conv4 [43], Conv6, VGG9 [37], ResNet20 [12], ResNet44, and Wide ResNet 28-2 [47]) in our experiments. For Conv4, Conv6, and VGG9, we add our perturbation function in every convolution blocks, before activation. For ResNet architectures, we add our perturbation function in every residual blocks, before last activation.

To simply describe the networks, let Ck denote a sequence of a convolutional layer with k channels - batch normalization - ReLU activation, M

denote a max pooling with a stride of

, and FC denote a fully-connected layer. We provide a implementation of the networks in our code.


This network is frequently used in few-shot classification literature. This model can be described with C64-M-C64-M-C64-M-C64-M-FC.


This network is similar to the Conv4 network, except that we increase the depth by adding two more convolutional layers. This model can be described with C64-M-C64-M-C64-C64-M-C64 -C64-M-FC.


This network is a small version of VGG [37] with a single fully-connected layer at the last. This model can be described with C64-M-C128-M-C256-C256-M-C512-C512-M-C512-C512 -M-FC.


This network is used for CIFAR-10 classification task in [12]. The network consists of residual block layers that consist of multiple residual blocks, where each residual block consists of two convolution layers. Down-sampling is performed by stride pooling in the first convolution layer in a residual block layer and is used at the second and the third residual block layers. Let ResBlk(n,k) denote a residual block layer with residual blocks of channel , and GAP denote a global average pooling. Then, the network can be described with C16-ResBlk(3,16)-ResBlk(3,32)-ResBlk(3,64)-GAP-FC.


This network is similar to the ResNet20 network, but with more residual blocks in each residual block layer. The network can be described with C16-ResBlk(7,16)-ResBlk(7,32) -ResBlk(7,64)-GAP-FC.

Wide ResNet 28-2

This network is a variant of ResNet, which decrease the depth and increase the width of conventional ResNet architecture. We use Wide ResNet 28-2 which has depth and widening factor .

c.4 Experimental Details


We use an Adam optimizer [17] and train the model for steps. We use an initial learning rate of and decay the learning rate by at , , and steps. We set the mini-batch size to . Lastly, for the base regularizations during training, we use weight decay of and simple data augmentations such as random resizing & cropping and random horizontal flipping. In order to efficiently train multiple tasks, we distribute the tasks to multiple processing units and each process has its own main-model parameters and perturbation function parameter . After one gradient step of the whole model, we share only the perturbation function parameters across the processes.


We use the same configurations as the meta-training stage. After the meta-training is done, only the perturbation function parameter is transferred to the meta-testing stage. Note that is not updated in the meta-testing stage.