Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

10/15/2020 ∙ by Long Zhao, et al. ∙ Google University of Delaware Rutgers University 7

Adversarial data augmentation has shown promise for training robust deep neural networks against unforeseen data shifts or corruptions. However, it is difficult to define heuristics to generate effective fictitious target distributions containing "hard" adversarial perturbations that are largely different from the source distribution. In this paper, we propose a novel and effective regularization term for adversarial data augmentation. We theoretically derive it from the information bottleneck principle, which results in a maximum-entropy formulation. Intuitively, this regularization term encourages perturbing the underlying source distribution to enlarge predictive uncertainty of the current model, so that the generated "hard" adversarial perturbations can improve the model robustness during training. Experimental results on three standard benchmarks demonstrate that our method consistently outperforms the existing state of the art by a statistically significant margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

Code Repositories

ME-ADA

"Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness" (NeurIPS 2020).


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks can achieve good performance on the condition that the training and testing data are drawn from the same distribution. However, this condition might not hold true in practice. Data shifts caused by mismatches between training and testing domain Balaji et al. (2018); Li et al. (2019); Sinha et al. (2018); Volpi et al. (2018); Zhao et al. (2019), small corruptions to data distributions Hendrycks and Dietterich (2019); Zhao et al. (2020), or adversarial attacks Goodfellow et al. (2015); Kurakin et al. (2017)

are often inevitable in real-world applications, and lead to significant performance degradation of deep learning models. Recently, adversarial data augmentation 

Ganin et al. (2016); Sinha et al. (2018); Volpi et al. (2018) emerges as a strong baseline where fictitious target distributions are generated by an adversarial loss to resemble unforseen data shifts, and used to improve model robustness through training. The adversarial loss is leveraged to produce perturbations that fool the current model. However, as shown in Qiao et al. (2020)

, this heuristic loss function is insufficient to synthesize large data shifts, i.e., “hard” adversarial perturbations from the source domain, which makes the model still vulnerable to severely shifted or corrupted testing data.

To mitigate this issue, we propose a regularization technique for adversarial data augmentation from an information theory perspective using the Information Bottleneck (IB) Tishby et al. (1999) principle. The IB principle encourages the model to learn an optimal representation by diminishing the irrelevant parts of the input variable that do not contribute to the prediction. Recently, there has been a surge of interest in combining the IB method with training of deep neural networks Achille and Soatto (2018b); Amjad and Geiger (2019); Elad et al. (2019); Kolchinsky et al. (2019); Ozair et al. (2019); Tschannen et al. (2020), while its effectiveness for adversarial data augmentation still remains unclear.

In the IB context, a neural network does not generalize well on out-of-domain data often when the information of the input cannot be well-compressed by the model, i.e., the mutual information of the input and its associated latent representation is high Shamir et al. (2010); Tishby and Zaslavsky (2015). Motivated by this conceptual observation, we aim to regularize adversarial data augmentation through maximizing the IB function. Specifically, we produce “hard” fictitious target domains that are largely shifted from the source domain by enlarging the mutual information of the input and latent distribution within the current model. However, mutual information is shown to be intractable in the literature Belghazi et al. (2018); Paninski (2003); Song and Ermon (2020), and therefore directly optimizing this objective is challenging.

In this paper, we develop an efficient maximum-entropy regularizer to achieve the same goal by making the following contributions: (i) to the best of our knowledge, we are the first work to investigate adversarial data argumentation from an information theory perspective, and address the problem of generating “hard” adversarial perturbations from the IB principle which has not been studied yet; (ii) we theoretically show that the IB principle can be bounded by a maximum-entropy regularization term in the maximization phase of adversarial data argumentation, which results in a notable improvement over Volpi et al. (2018); (iii) we also show that our formulation holds in an approximate sense under certain non-deterministic conditions (e.g., when the neural network is stochastic or contains Dropout Srivastava et al. (2014) layers). Note that our maximum-entropy regularizer can be implemented by one line of code with minor computational cost, while it consistently and statistically significantly improves the existing state of the art on three standard benchmarks.

2 Background and Related Work

Information Bottleneck Principle. We begin by summarizing the concept of information bottleneck and, along the way, introduce the notations. The Information Bottleneck (IB) Tishby et al. (1999) is a principled way to seek a latent representation that an input variable contains about an output . Let be the mutual information of and , i.e., , where denotes the KL-divergence Kullback and Leibler (1951). Intuitively, measures the uncertainty in given . The representation can be quantified by two terms: which reflects how much compresses , and which reflects how well predicts . In practice, this IB principle is explored by minimizing the following IB Lagrangian:

(1)

where is a positive parameter that controls the trade-off between compression and prediction. By controlling the amount of compression within the representation via the compression term , we can tune desired characteristics of trained models such as robustness to adversarial samples Alemi et al. (2017), generalization error Amjad and Geiger (2019); Cheng et al. (2019); Kolchinsky et al. (2019); Shamir et al. (2010); Tishby and Zaslavsky (2015); Vera et al. (2018), and detection of out-of-distribution data Alemi et al. (2018).

Domain Generalization. Domain adaptation Ganin and Lempitsky (2015); Tzeng et al. (2017) transfers models in source domains to related target domains with different distributions during the training procedure. On the other hand, domain generalization Balaji et al. (2018); Bousmalis et al. (2016); Li et al. (2017, 2018, 2019); Mancini et al. (2018) aims to learn features that perform well when transferred to unseen domains during evaluation. This paper further studies a more challenging setting named single domain generalization Qiao et al. (2020), where networks are learned using one single source domain compared with conventional domain generalization that requires multiple training source domains. Recently, adversarial data augmentation Volpi et al. (2018) is proven to be a promising solution which synthesizes virtual target domains during training so that the generalization and robustness of the learned networks to unseen domains can be improved. Our approach improves it by proposing an efficient regularizer.

Adversarial Data Augmentation. We are interested in the problems of training deep neural networks in a single source domain and deploying it to unforeseen domains following different underlying distributions. Let be random data points with associated labels ( is finite) drawn from the source distribution . We consider the following worst-case problem around :

(2)

where is the network parameters, is the loss function, and measures the distance between two distributions and . We denote , where represents the parameters of the final prediction layer and represents the parameters of the rest of the network. Letting be the latent representation of input , we feed it into a

-way classifier such that using the

softmax activation

, the probability

of the -th class is:

(3)

where is the parameters for the -th class. In the classification setting, we minimize the cross-entropy loss over each sample in the training domain: . Moreover, in order to preserve the semantics of the input samples, the metric is defined in the latent space . Let denote the transportation cost of moving mass from to : , where . For probability measures and supported on , let be their couplings. Then, we use the Wasserstein metric defined by . The solution to the worst-case problem (2) ensures good performance (robustness) against any data distribution that is distance away from the source domain . However, for deep neural networks, this formulation is intractable with arbitrary . Instead, following the reformulation of Sinha et al. (2018); Volpi et al. (2018), we consider its Lagrangian relaxation for a fixed penalty parameter :

(4)

3 Methodology

In this paper, our main idea is to incorporate the IB principle into adversarial data augmentation so as to improve model robustness to large domain shifts. We start by adapting the IB Lagrangian (1

) to supervised-learning scenarios so that the latent representation

can be leveraged for classification purposes. To this end, we modify the IB Lagrangian (1) following Achille and Soatto (2018a, b); Amjad and Geiger (2019) to , where the constraint on is replaced with the risk associated to the prediction according to the loss function . We can see that appears as a standard cross-entropy loss augmented with a regularizer promoting minimality of the representation. Then, we rewrite Eq. (4) to leverage the newly defined loss function:

(5)

As discussed in Sinha et al. (2018); Volpi et al. (2018), the worst-case setting of Eq. (5) can be formalized as a minimax optimization problem. It is solved by an iterative training procedure where two phases are alternated in iterations. In the maximization phase, new data points are produced by computing the inner maximization problem to mimic fictitious target distributions that satisfy the constraint . In the minimization phase, the network parameters are updated by the loss function evaluated on the adversarial examples generated from the maximization phase.

The main challenge in optimizing Eq. (5) is that exact computation of the compression term in is almost impossible due to the high dimensionality of the data. The way of approximating this term in the minimization phase has been widely studied in recent years, and we follow Elad et al. (2019); Gal and Ghahramani (2015) to express by penalty (also known as weight decay Krogh and Hertz (1992)). Below, we discuss how to effectively implement in the maximization phase for adversarial data augmentation. The full algorithm is summarized in Algorithm 1.

1:Source dataset and initialized network weights
2:Learned network weights
3:Initialize ,
4:for  do Run the minimax procedure times
5:     for  do Run the minimization phase times
6:         Sample uniformly from dataset
7:               
8:     for all  do
9:         
10:         for  do Run the maximization phase times
11:                        
12:         Append to dataset      
13:while not reach maximum steps do
14:     Sample uniformly from dataset
15:     
Algorithm 1 Max-Entropy Adversarial Data Augmentation (ME-ADA)

3.1 Regularizing Maximization Phase via Maximum Entropy

Intuitively, regularizing the mutual information in the maximization phase encourages adversarial perturbations that cannot be effectively “compressed” by the current model. From the information theory perspective, these perturbations usually imply large domain shifts, and thus can potentially benefit model generalization and robustness. However, since is high dimensional, maximizing is intractable. One of our key results is that, when we restrict to classification scenarios, we can efficiently approximate and maximize during adversarial data augmentation. As we will show, this process can be effectively implemented through maximizing the entropy of network predictions, which is a tractable lower bound of .

To set the stage, we let denote the predicted class label given the input . As described in Amjad and Geiger (2019); Tishby et al. (1999)

, deep neural networks can be considered as a Markov chain of successive representations of the input where information flows obeying the structure:

. By the Data Processing Inequality Cover and Thomas (2012), we have . On the other hand, when performing data augmentation during each maximization phase, the model parameters are fixed; and is a deterministic function of , i.e., any given input is mapped to a single class. Consequently, it holds that , where is the Shannon entropy. After combining all these together, then we have Proposition 1.

Proposition 1.

Consider a deterministic neural network, the parameters of which are fixed. Given the input , let be the network prediction and be the latent representation of . Then, the mutual information is lower bounded by , i.e., we have that,

(6)

Note that Eq. (6) does not mean that calculating does not need , since is generated by inputting into the network. There are two important benefits of our formulation (6) to be discussed. First, it provides a method to maximize that is not related to the input dimensionality. Thus, for high dimensional images, we can still maximize mutual information this way. Second, our formulation is closely related to the Deterministic Information Bottleneck Strouse and Schwab (2017), where is approximated by . However, is still intractable in general. Instead, can be directly computed from the softmax output of a classification network as we will show later. Next, we modify Eq. (5) by replacing with , which becomes a relaxed worst-case problem:

(7)

From a Bayesian perspective, the prediction entropy can be viewed as the predictive uncertainty Kendall and Gal (2017); Snoek et al. (2019) of a model. Therefore, our maximum-entropy formulation (7) is equivalent to perturbing the underlying data distribution so that the predictive uncertainty of the current model is enlarged in the maximization phase. This motivates us to extend our approach to stochastic neural networks for better capturing the model uncertainty as we will show in the experiment.

Empirical Estimation. Now, involves the expected prediction entropy

over the data distribution. However, during training we only have sample access to the data distribution, which we can use as a surrogate for empirical estimation. Given an observed input

sampled from the source distribution , we start from defining the prediction entropy of its corresponding output by:

(8)

Then, through calculating the expectation over the prediction entropies of all possible observations contained in the source dataset , we can obtain the empirical estimation of :

(9)

where denotes the empirical entropy, and is the network prediction of . After combing Eq. (9) with the relaxed worst-case problem (7), we will have the empirical counterpart of which is defined by . Taking the dual reformulation of the penalty problem , we can obtain an efficient solution procedure. The following result is a minor adaptation of Volpi et al. (2018) (Lemma 1):

Proposition 2.

Let and be continuous. Let denote the robust surrogate loss. Then, for any distribution and any , we have that,

(10)
(11)

To solve the penalty problem of Eq. (5), in the minimization phase of the iterative training procedure, we can perform Stochastic Gradient Descent (SGD) on the robust surrogate loss . To be specific, under suitable conditions Boyd and Vandenberghe (2004), we have that , where is an adversarial perturbation of at the current model . On the other hand, in the maximization phase, we solve the maximization problem (11) by Maximum-Entropy Adversarial Data Augmentation (ME-ADA) in this work. Concretely, in the -th maximization phase, we compute adversarially perturbed samples at the current model :

(12)

Note that the entropy term is efficient to be calculated from the softmax output of a model, which can be implemented with one line of code in modern deep learning frameworks, and substantial performance improvement can be achieved by it as we will show in the experiments.

Theoretic Bound. It is essential to guarantee that the empirical estimate of the entropy (from a training set containing samples) is an accurate estimate of the true expected entropy . The next proposition ensures that for large , in a classification problem, the sample estimate of average entropy is close to the expected entropy.

Proposition 3.

Let be a fixed probabilistic function of into an arbitrary finite target space

, determined by a fixed and known conditional probability distribution

, and be a sample set of size

drawn from the joint probability distribution

. For any , with probability of at least over the sample set , we have,

(13)

We prove Proposition 3 in the supplementary material. The proof adapts the setting in Shamir et al. (2010), where we bound the deviations of the information estimations from their expectation and then use the bound on the expected bias of entropy estimation. Here, it is also worth discussing two important properties of this bound. First, we note that Proposition 3 holds for any fixed probabilistic function. Compared with prior studies on the plug-in estimate of discrete entropy over a finite size alphabet Valiant and Valiant (2011); Wu and Yang (2016), we focus on the bound of non-optimal estimators. In particular, this proposition holds for any , even if is not a globally optimal solution for in Eq. (7). This is the case of models in the maximization phase, which thus ensures the effectiveness of our formulation across the whole iterative training procedure. Second, the bound does not depend on . In addition, the complexity of the bound is mainly controlled by . By constraining to be small, a tight bound can be achieved. This assumption usually holds for the setting of training classification models, i.e., .

3.2 Maximum Entropy in Non-Deterministic Conditions

It is important to note that not all models are deterministic, e.g., when deep neural networks are stochastic Florensa et al. (2017); Tang and Salakhutdinov (2013) or contain Dropout layers Gal and Ghahramani (2016); Srivastava et al. (2014). The mapping from to may be intrinsically noisy or non-deterministic. Here, we show that when is a small perturbation away from being a deterministic function of , our maximum-entropy formulation (7

) still applies in an approximate sense. We now consider the case when the joint distribution of

and is -close to having be a deterministic function of . The next result is a minor adaptation of Kolchinsky et al. (2019) (Theorem 1) and it shows that the conditional entropy is away from being zero.

Corollary 1.

Let

be a random variable and

be a random variable with a finite set of outcomes . Let be a joint distribution over and under which . Let be a joint distribution over and which has the same marginal over as , i.e., , and obey . Then, we have that,

(14)

As we show in this corollary, even if the relationship between and is not perfectly deterministic but close to being so, i.e., it is -close to a deterministic function, then we have . Hence, in this case, the proposed Proposition 1 and our maximum-entropy adversarial data augmentation formulation (7) still hold in an approximate sense.

4 Experiments

In this section, we evaluate our approach over a variety of settings. We first test with MNIST under the setting of large domain shifts, and then test on a more challenging dataset, with PACS data under the domain generalization setting. Further, we test on CIFAR-10-C and CIFAR-100-C which are standard benchmarks for evaluating model robustness to common corruptions. We compare the proposed

Maximum-Entropy Adversarial Data Augmentation (ME-ADA) with previous state of the art when available. We note that Adversarial Data Augmentation (ADA) Volpi et al. (2018) is our main competitor, since our method downgrades to Volpi et al. (2018) when the maximum-entropy term is discarded.

Datasets. MNIST dataset LeCun et al. (1998)

consists of handwritten digits with 60,000 training examples and 10,000 testing examples. Other digit datasets, including SVHN 

Netzer et al. (2011), MNIST-M Ganin and Lempitsky (2015), SYN Ganin and Lempitsky (2015) and USPS Denker et al. (1989), are leveraged for evaluating model performance. These four datasets contain large domain shifts from MNIST in terms of backgrounds, shapes and textures. PACS Li et al. (2017) is a recent dataset with different object style depictions and a more challenging domain shift than the MNIST experiment. This dataset contains four domains (art, cartoon, photo and sketch), and shares seven common object categories (dog, elephant, giraffe, guitar, house, horse and person) across these domains. It is made up of 9,991 images with the resolution of . For fair comparison, we follow the protocol in Li et al. (2017) including the recommended train, validation and test split.

CIFAR-10 and CIFAR-100 are two datasets Krizhevsky and Hinton (2009) containing small natural RGB images, both with 50,000 training images and 10,000 testing images. CIFAR-10 has 10 categories, and CIFAR-100 has 100 object classes. In order to measure the resilience of a model to common corruptions, we evaluate on CIFAR-10-C and CIFAR-100-C datasets Hendrycks and Dietterich (2019). These two datasets are constructed by corrupting the original CIFAR test sets. For each dataset, there are a total of fifteen noise, including blur, weather, and digital corruption types, and each of them appears at five severity levels or intensities. We do not tune on the validation corruptions, so we report the average performance over all corruptions and intensities.

4.1 MNIST with Domain Shifts

Experiment Setup. We follow the setup of Volpi et al. (2018) in experimenting with MNIST dataset. We use 10,000 samples from MNIST for training and evaluate prediction accuracy on the respective test sets of SVHN, MNIST-M, SYN and USPS. In order to work with comparable datasets, we resize all the images to , and treat images from MNIST and USPS as RGB images. We use LeNet LeCun et al. (1989) as a base model and the batch size is 32. We use Adam Kingma and Ba (2014) with for minimization and SGD with for maximization. We set , , , and . We compare our method against ERM Vapnik (1998), ADA Volpi et al. (2018), and PAR Wang et al. (2019a).

We also implement a variant of our method through Bayesian Neural Networks (BNNs) Blundell et al. (2015); Gal and Ghahramani (2016); Lakshminarayanan et al. (2017) to demonstrate our compatibility with stochastic neural networks. BNNs learn a distribution over network parameters and are currently the state of the art for estimating predictive uncertainty Ebrahimi et al. (2020); Neal (2012). We follow Blundell et al. (2015) to implement the BNN via variational inference. During the training procedure, in each maximization phase, a set of network parameters are drawn from the variational posterior , and then the predictive uncertainty is redefined by the expectation of all prediction entropies: . We refer to the supplementary material for more details of this BNN variant.

Results. Table 1

shows the classification accuracy and standard deviation of each model averaged over ten runs. We can see that our model with the maximum-entropy formulation achieves the best performance, while the improvement on USPS is not as significant as those on other domains due to its high similarity with MNIST. We then notice that, after engaging the BNN, our performance is further improved. Intuitively, we believe this is because the BNN provides a better estimation of the predictive uncertainty in the maximization phase. We are also interested in analyzing the behavior of our method when

is increased. Figure 1 shows the results of our method and other baselines by varying the number of iterations while fixing and . We observe that our method improves performances on SVHN, MNIST-M and SYN, outperforming both ERM and Volpi et al. (2018) statistically significantly in different iterations. This demonstrates that the improvements obtained by our method are consistent.

! SVHN Netzer et al. (2011) MNIST-M Ganin and Lempitsky (2015) SYN Ganin and Lempitsky (2015) USPS Denker et al. (1989) Average Standard (ERM Vapnik (1998)) 31.95 1.91 55.96 1.39 43.85 1.27 79.92 0.98 52.92 0.98 PAR Wang et al. (2019a) 36.08 1.27 61.16 0.21 45.48 0.35 79.95 1.18 55.67 0.33 Adv. Augment (ADA) Volpi et al. (2018) 35.70 2.00 58.65 1.72 47.18 0.61 80.40 1.70 55.48 0.74 + Max Entropy (ME-ADA) 42.00 1.74 63.98 1.82 49.80 1.74 79.10 1.03 58.72 1.12 + Max Entropy w/ BNN 42.56 1.45 63.27 2.09 50.39 1.29 81.04 0.98 59.32 0.82

Table 1: Average classification accuracy (%) and standard deviation of models trained on MNIST LeCun et al. (1998) and evaluated on SVHN Netzer et al. (2011), MNIST-M Ganin and Lempitsky (2015), SYN Ganin and Lempitsky (2015) and USPS Denker et al. (1989). The results are averaged over ten runs. Best performances are highlighted in bold. The results of PAR are obtained from Xu et al. (2020).

[width=.245]fig_mnist_k_1 [width=.245]fig_mnist_k_2 [width=.245]fig_mnist_k_3 [width=.245]fig_mnist_k_4

Figure 1: Test accuracy associated with models trained using 10,000 MNIST LeCun et al. (1998) samples and tested on SVHN Netzer et al. (2011), MNIST-M Ganin and Lempitsky (2015), SYN Ganin and Lempitsky (2015) and USPS Denker et al. (1989). We compare our method (blue) to Volpi et al. (2018) (orange) with different number of iterations , and Empirical Risk Minimization (ERM) Vapnik (1998) (red line). The results are averaged over ten runs; and black bars indicate the range of accuracy spanned.

4.2 Pacs

! DSN L-CNN MLDG Fusion MetaReg Epi-FCR AGG HEX PAR ADA ME-ADA Domain ID Art 61.1 62.9 66.2 64.1 69.8 64.7 63.4 66.8 66.9 64.3 67.1 Cartoon 66.5 67.0 66.9 66.8 70.4 72.3 66.1 69.7 67.1 69.8 69.9 Photo 83.3 89.5 88.0 90.2 91.1 86.1 88.5 87.9 88.6 85.1 88.6 Sketch 58.6 57.5 59.0 60.1 59.2 65.0 56.6 56.3 62.6 60.4 63.0 Average 67.4 69.2 70.0 70.3 72.6 72.0 68.7 70.2 71.3 69.9 72.2

Table 2: Classification accuracy (%) of our approach on PACS dataset Li et al. (2017) in comparison with the previously reported state-of-the-art results. Bold numbers indicate the best performance (two sets, one for each scenario engaging or forgoing domain identifications, respectively).

Experiment Setup. We continue to experiment on PACS dataset, which consists of collections of images over four domains. Each time, one domain is selected as the test domain, and the rest three are used for training. Following Li et al. (2017)

, we use the ImageNet pretrained AlexNet 

Krizhevsky et al. (2012) as a base network. We compare with recently reported state of the art engaging domain identifications, including DSN Bousmalis et al. (2016), L-CNN Li et al. (2017), MLDG Li et al. (2018), Fusion Mancini et al. (2018), MetaReg Balaji et al. (2018) and Epi-FCR Li et al. (2019), as well as methods forgoing domain identifications, including AGG Li et al. (2019), HEX Wang et al. (2019b), and PAR Wang et al. (2019a). Former methods often obtain better results because they utilize domain identifications. Our method belongs to the latter category. Other training details are provided in the supplementary material.

Results. We report the results in Table 2. We note that our method achieves the best performance among techniques forgoing domain identifications. More impressively, our method, without using domain identifications, is only slightly shy of MetaReg Balaji et al. (2018) in terms of overall performance, which takes advantage domain identifications. Interestingly, it is also worth mentioning that our method improves previous methods with a relatively large margin when “sketch” is the testing domain. This is notable because “sketch” is the only colorless domain which owns the largest domain shift out of the four domains in PACS. Our method handles this extreme case by producing larger data shifts from the source domain with the proposed maximum-entropy term during data augmentation.

4.3 CIFAR-10 and CIFAR-100 with Corruptions

Experiment Setup. In the following experiments, we show that our approach endows robustness to various architectures including All Convolutional Network (AllConvNet) Salimans and Kingma (2016); Springenberg et al. (2014), DenseNet-BC Huang et al. (2017) (with and ), WideResNet (40-2) Zagoruyko and Komodakis (2016), and ResNeXt-29 (Xie et al. (2017)

. We train all networks with an initial learning rate of 0.1 optimized by SGD using Nesterov momentum, and the learning rate decays following a cosine annealing schedule 

Loshchilov and Hutter (2016)

. All input images are pre-processed with standard random left-right flipping and cropping in the minimization phase. We train AllConvNet and WideResNet for 100 epochs; DenseNet and ResNeXt require 200 epochs for convergence. Following the setting of 

Hendrycks et al. (2020), we use a weight decay of 0.0001 for DenseNet and 0.0005 otherwise. Due to the space limitation, we ask the readers to refer to the supplementary material for detailed settings of our training parameters for different architectures.

Baselines. To demonstrate the utility of our approach, we compare to many state-of-the-art techniques designed for robustness to image corruptions. These baseline techniques include (i) the standard data augmentation baseline and Mixup Zhang et al. (2018); (ii) two regional regularization strategies for images, i.e., Cutout DeVries and Taylor (2017) and Cutmix Yun et al. (2019); (iii) AutoAugment Cubuk et al. (2019)

, which searches over data augmentation policies to find a high-performing data augmentation policy via reinforcement learning; (iv) Adversarial Training 

Kang et al. (2019) for model robustness against unforeseen adversaries, and Adversarial Data Augmentation Volpi et al. (2018) which generates adversarial perturbations using Wasserstein distances.

Results. The results are shown in Table 3. Our method enjoys the best performance and improves previous state of the art by a large margin (5% of accuracy on CIFAR-10-C and 4% on CIFAR-100-C). More importantly, these gains are achieved across different architectures and on both datasets. Figure 2 shows more detailed comparisons over all corruptions. We find that our substantial gains in robustness are spread across a wide variety of corruptions, with a small drop of performance in only three corruption types: fog, brightness and contrast. Especially, for glass blur, Gaussian, shot and impulse noises, accuracies are significantly improved by 25%. From the Fourier perspective Yin et al. (2019)

, the performance gains from our adversarial perturbations lie primarily in high frequency domains, which are commonly occurring image corruptions. These results demonstrate that the maximum-entropy term can regularize networks to be more robust to common image corruptions.

! Standard Cutout CutMix AutoDA Mixup AdvTrain ADA ME-ADA CIFAR-10-C AllConvNet 69.2 67.1 68.7 70.8 75.4 71.9 73.0 78.2 DenseNet 69.3 67.9 66.5 73.4 75.4 72.4 69.8 76.9 WideResNet 73.1 73.2 72.9 76.1 77.7 73.8 79.7 83.3 ResNeXt 72.5 71.1 70.5 75.8 77.4 73.0 78.0 83.4 Average 71.0 69.8 69.7 74.0 76.5 72.8 75.1 80.5 CIFAR-100-C AllConvNet 43.6 43.2 44.0 44.9 46.6 44.0 45.3 51.2 DenseNet 40.7 40.4 40.8 46.1 44.6 44.8 45.2 47.8 WideResNet 46.7 46.5 47.1 50.4 49.6 44.9 50.4 52.8 ResNeXt 46.6 45.4 45.9 48.7 48.6 45.6 53.4 57.3 Average 44.4 43.9 44.5 47.5 47.4 44.8 48.6 52.3

Table 3: Average classification accuracy (%). Across several architectures, our approach obtains CIFAR-10-C and CIFAR-100-C corruption robustness that exceeds the previous state of the art by a large margin. Best performances are highlighted in bold.

[width=]fig_cifar10

Figure 2: Test accuracy of the Empirical Risk Minimization (ERM) Vapnik (1998) principle compared to our approach on the fifteen CIFAR-10-C Hendrycks and Dietterich (2019) corruptions using WideResNet (40-2) Zagoruyko and Komodakis (2016). Each bar represents an average over all five corruption strengths for a given corruption type.

5 Conclusion

In this work, we introduced a maximum-entropy technique that regularizes adversarial data augmentation. It encourages the model to learn with fictitious target distributions by producing “hard” adversarial perturbations that enlarge predictive uncertainty of the current model. As a result, the learned model is able to achieve improved robustness to large domain shifts or corruptions encountered during deployment. We demonstrate that our technique obtains state-of-the-art performance on MNIST, PACS, and CIFAR-10/100-C, and is extremely simple to implement. One major limitation of our method is that it cannot be directly applied to regression problems since the maximum-entropy lower bound is still difficult to compute in this case. Our future work might consider alternative measurements of information Ozair et al. (2019); Tschannen et al. (2020)

that are more suited for general machine learning applications.

Broader Impact

The proposed method will be used to train a perception system that can robustly and reliably classify object instances. For example, this system can be used in many fundamental real-world applications in which a user desires to classify object instances from a product database, such as products found on local supermarkets or online stores. Similar to most deep learning applications learning from data which run the risk of producing biased or offensive content reflecting the training data, our work that learns a data-driven classification model is no exception. Our method moderates this issue by producing efficient fictitious target domains that are largely shifted from the source training dataset, so that the trained model on these adversarial domains are less biased. However, a downside of this moderation is the introduction of new hyper-parameters to be tuned for different tasks. Compared with other methods that obtain the same robustness but have to be trained on larger datasets, the proposed research can significantly reduce the data collection from different domains to train classification models, thereby reducing the system development time and lower related costs.

Appendix A Supplementary Materials

a.1 Proofs

a.1.1 Proof of Proposition 3

Here, we follow the guidance of Shamir et al. (2010) to prove Proposition 3. Let be a sample set of size , and let be a probabilistic function of into an arbitrary finite target space, defined by for all and . To prove Proposition 3, we bound the deviations of the entropy estimations from its expectation: , and then use a bound on the expected bias of entropy estimation.

To bound the deviation of the entropy estimates, we use McDiarmid’s inequality McDiarmid (1989), in a manner similar to Antos and Kontoyiannis (2001). For this, we must bound the change in value of each of the entropy estimations when a single instance in is arbitrarily changed. A useful and easily proven inequality in that regard is the following: for any natural and for any and ,

(15)

With this in equality, a careful application of McDiarmid’s inequality leads to the following lemma.

Lemma 1.

For any , with probability of at least over the sample set, we have that,

(16)
Proof.

First, we bound the change caused by a single replacement in . We have that,

(17)

If we change a single instance in , then there exist two pairs and such that increases by , and decreases by . This means that and also change by at most , while all other values in the distribution remain the same. Therefore, for each , changes by at most .

Based on this and Eq. (15), changes by at most . Applying McDiarmid’s inequality, we get Eq. (16). We have thus proven Lemma 1. ∎

Lemma 1 provides bounds on the deviation of the from their expected values. In order to relate these to the true values of the entropy , we use the following bias bound from Paninski (2003) and Shamir et al. (2010).

Lemma 2 (Paninski Paninski (2003); Shamir et al. Shamir et al. (2010), Lemma 9).

For a random variable , with the plug-in estimation on its entropy, based on an i.i.d. sample set of size , we have that,

(18)

From this lemma, the quantity is upper bounded by . Combining it with Eq. (16), we get the bound in Proposition 3.

Proposition 4 (Proposition 3 restated).

Let be a fixed probabilistic function of into an arbitrary finite target space , determined by a fixed and known conditional probability distribution , and be a sample set of size drawn from the joint probability distribution . For any , with probability of at least over the sample set , we have,

(19)
Proof.

To prove the proposition, we start by using the triangular inequality to write,

(20)

Because is constant, we have:

(21)

By the linearity of expectation, we have:

(22)

Combining these with Lemmas 1 and 2, we get the bound in Proposition 3. ∎

a.1.2 Proof of Corollary 1

The proof of Corollary 1 is based on the following bound proposed by Kolchinsky et al. (2019).

Lemma 3 (Kolchinsky et al. Kolchinsky et al. (2019), Theorem 1).

Let be a random variable (continuous or discrete), and be a random variable with a finite set of outcomes . Consider two joint distributions over and , and , which have the same marginal over , , and obey . Then,

(23)

This lemma upper bounds the quantity by . After extending it to the case when is a deterministic function of , we get the bound in Corollary 1.

Corollary 2 (Corollary 1 restated).

Let be a random variable and be a random variable with a finite set of outcomes . Let be a joint distribution over and under which . Let be a joint distribution over and which has the same marginal over as , i.e., , and obey . Then, we have that,

(24)
Proof.

Since is a deterministic function of , i.e., , we then have . Combining this with Eq. (23), we prove the bound in Corollary 1. ∎

a.2 Implementation Details

a.2.1 BNN Variant

We follow Blundell et al. (2015) to implement the BNN variant of our method. Let be the observed input variable and be a set of latent variables. Deep neural networks can be viewed as a probabilistic model , where is a set of training examples and is the network output which belongs to a set of object categories by using the network parameters . The variational inference aims to calculate this conditional probability distribution over the latent variables (network parameters) by finding the closest proxy to the exact posterior by solving an optimization problem.

Following the guidance of Blundell et al. (2015), we first assume a family of probability densities over the latent variables parameterized by , i.e., . We then find the closest member of this family to the true conditional probability by minimizing the KL-divergence between and , which is equivalent to minimizing the following variational free energy:

(25)

This objective function can be approximated using Monte Carlo samples from the variational posterior Blundell et al. (2015):

(26)

We assume

have a Gaussian probability density function with diagonal covariance and parameterized by

. A sample weight of the variational posterior can be obtained by the reparameterization trick Kingma and Welling (2014): we sample it from a unit Gaussian and parameterized by , where is the noise drawn from the unit Gaussian and is the point-wise multiplication. For the prior, as suggested by Blundell et al. (2015)

, a scale mixture of two Gaussian probability density functions are chosen: they are zero-centered but have two different variances of

and with the ratio of . In this work, we let , , and . Then, the optimizing objective of adversarial perturbations in the maximization phase of our method is redefined by:

(27)

where is sampled times from the learned variational posterior.

a.2.2 Pacs

The learning principle of the previous state-of-the-art method on this dataset follows two streams. The first stream of methods, including DSN Bousmalis et al. (2016), L-CNN Li et al. (2017), MLDG Li et al. (2018), Fusion Mancini et al. (2018), MetaReg Balaji et al. (2018) and Epi-FCR Li et al. (2019), engages domain identifications, which means that when training the model, each source domain is regarded as a separate domain. The second stream of methods, containing AGG Li et al. (2019), HEX Wang et al. (2019b), and PAR Wang et al. (2019a), does not leverage domain identifications and combines all source domains into a single one during the training procedure. We can find that the first stream leverages more information, i.e., the domain identifications, during the network training, and thus often yields better performance than the second stream. Our work belongs to the latter stream.

.86! Target Domain (loop) (loop) (loop) Art 1 45,000 100 50 0.001 50.0 10.0 1.0 Cartoon 1 45,000 100 50 0.001 50.0 10.0 100.0 Photo 1 45,000 100 50 0.001 50.0 10.0 1.0 Sketch 1 45,000 100 50 0.001 50.0 10.0 100.0

Table 4: The settings of different target domains on PACS.

We follow the setup of Li et al. (2017)

for network training. To align with the previous methods, the ImageNet pretrained AlexNet 

Krizhevsky et al. (2012) is employed as the baseline network. In the network training, we set the batch size to 32. We use SGD with the learning rate of 0.001 (the learning rate decays following a cosine annealing schedule Zagoruyko and Komodakis (2016)), the momentum of 0.9, and weight decay of 0.00005 for minimization, while we use the SGD with the learning rate of 50.0 for maximization. Table 4 shows more detailed setting of all parameters under four different target domains.

a.2.3 CIFAR-10 and CIFAR-100

! (epoch) (epoch) (loop) CIFAR-10-C AllConvNet 2 100 10 15 0.1 20.0 0.1 10.0 DenseNet 2 200 10 15 0.1 20.0 1.0 100.0 WideResNet 2 100 10 15 0.1 20.0 1.0 10.0 ResNeXt 2 200 10 15 0.1 20.0 1.0 10.0 CIFAR-100-C AllConvNet 2 100 10 15 0.1 20.0 0.1 10.0 DenseNet 2 200 10 15 0.1 20.0 10.0 10.0 WideResNet 2 100 10 15 0.1 20.0 1.0 10.0 ResNeXt 2 200 10 15 0.1 20.0 10.0 10.0

Table 5: The settings of different network architectures on CIFAR-10-C and CIFAR-100-C.

The experimental settings follow the setups in Hendrycks et al. (2020). We use SGD for both minimization and maximization. In Table 5, we report the detailed settings of all parameters under different network architectures on CIFAR-10-C and CIFAR-100-C. Note that and are measured by number of training epoches, while is measured by number of iterations. In this work, we do not compare our method with Hendrycks et al. (2020), since the design of Hendrycks et al. (2020) depends on a set of pre-defined image corruptions which is with a different research target compared to our method.

References

  • [1] A. Achille and S. Soatto (2018) Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research 19 (1), pp. 1947–1980. Cited by: §3.
  • [2] A. Achille and S. Soatto (2018) Information dropout: learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40 (12), pp. 2897–2905. Cited by: §1, §3.
  • [3] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2017) Deep variational information bottleneck. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • [4] A. A. Alemi, I. Fischer, and J. V. Dillon (2018) Uncertainty in the variational information bottleneck. In

    Proceedings of the Conference on Uncertainty in Artificial Intelligence Workshops

    ,
    Cited by: §2.
  • [5] R. A. Amjad and B. C. Geiger (2019) Learning representations for neural network-based classification using the information bottleneck principle. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: §1, §2, §3.1, §3.
  • [6] A. Antos and I. Kontoyiannis (2001) Convergence properties of functional estimates for discrete distributions. Random Structures & Algorithms 19 (3-4), pp. 163–193. Cited by: §A.1.1.
  • [7] Y. Balaji, S. Sankaranarayanan, and R. Chellappa (2018) Metareg: towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems (NeurIPS), pp. 998–1008. Cited by: §A.2.2, §1, §2, §4.2, §4.2.
  • [8] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm (2018) Mutual information neural estimation. In Proceedings of the International Conference on Machine Learning (ICML), pp. 531–540. Cited by: §1.
  • [9] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural network. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1613–1622. Cited by: §A.2.1, §A.2.1, §A.2.1, §4.1.
  • [10] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 343–351. Cited by: §A.2.2, §2, §4.2.
  • [11] S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge University Press. Cited by: §3.1.
  • [12] H. Cheng, D. Lian, S. Gao, and Y. Geng (2019) Utilizing information bottleneck to evaluate the capability of deep neural networks for image classification. Entropy 21 (5), pp. 456. Cited by: §2.
  • [13] T. M. Cover and J. A. Thomas (2012) Elements of information theory. John Wiley & Sons. Cited by: §3.1.
  • [14] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) AutoAugment: learning augmentation strategies from data. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 113–123. Cited by: §4.3.
  • [15] J. S. Denker, W. R. Gardner, H. P. Graf, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, H. S. Baird, and I. Guyon (1989) Neural network recognizer for hand-written zip code digits. In Advances in Neural Information Processing Systems (NeurIPS), pp. 323–331. Cited by: Figure 1, Table 1, §4.
  • [16] T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with Cutout

    .
    arXiv preprint arXiv:1708.04552. Cited by: §4.3.
  • [17] S. Ebrahimi, M. Elhoseiny, T. Darrell, and M. Rohrbach (2020) Uncertainty-guided continual learning with bayesian neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • [18] A. Elad, D. Haviv, Y. Blau, and T. Michaeli (2019) Direct validation of the information bottleneck principle for deep nets. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Cited by: §1, §3.
  • [19] C. Florensa, Y. Duan, and P. Abbeel (2017) Stochastic neural networks for hierarchical reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §3.2.
  • [20] Y. Gal and Z. Ghahramani (2015) Bayesian convolutional neural networks with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158. Cited by: §3.
  • [21] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1050–1059. Cited by: §3.2, §4.1.
  • [22] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In Proceedings of the International Conference on Machine Learning (ICML), pp. 1180–1189. Cited by: §2, Figure 1, Table 1, §4.
  • [23] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §1.
  • [24] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1.
  • [25] D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, Figure 2, §4.
  • [26] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan (2020) AugMix: a simple data processing method to improve robustness and uncertainty. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §A.2.3, §4.3.
  • [27] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: §4.3.
  • [28] D. Kang, Y. Sun, D. Hendrycks, T. Brown, and J. Steinhardt (2019) Testing robustness against unforeseen adversaries. arXiv preprint arXiv:1908.08016. Cited by: §4.3.
  • [29] A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5574–5584. Cited by: §3.1.
  • [30] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • [31] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §A.2.1.
  • [32] A. Kolchinsky, B. D. Tracey, and S. Van Kuyk (2019) Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §A.1.2, §1, §2, §3.2, Lemma 3.
  • [33] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Cited by: §4.
  • [34] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §A.2.2, §4.2.
  • [35] A. Krogh and J. A. Hertz (1992) A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems (NeurIPS), pp. 950–957. Cited by: §3.
  • [36] S. Kullback and R. A. Leibler (1951) On information and sufficiency. Annals of Mathematical Statistics 22 (1), pp. 79–86. Cited by: §2.
  • [37] A. Kurakin, I. J. Goodfellow, and S. Bengio (2017) Adversarial machine learning at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1.
  • [38] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), pp. 6402–6413. Cited by: §4.1.
  • [39] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (4), pp. 541–551. Cited by: §4.1.
  • [40] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Figure 1, Table 1, §4.
  • [41] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2017) Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5542–5550. Cited by: §A.2.2, §A.2.2, §2, §4.2, Table 2, §4.
  • [42] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2018) Learning to generalize: meta-learning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §A.2.2, §2, §4.2.
  • [43] D. Li, J. Zhang, Y. Yang, C. Liu, Y. Song, and T. M. Hospedales (2019) Episodic training for domain generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1446–1455. Cited by: §A.2.2, §1, §2, §4.2.
  • [44] I. Loshchilov and F. Hutter (2016) SGDR: stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.3.
  • [45] M. Mancini, S. R. Bulò, B. Caputo, and E. Ricci (2018) Best sources forward: domain generalization through source-specific nets. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 1353–1357. Cited by: §A.2.2, §2, §4.2.
  • [46] C. McDiarmid (1989) On the method of bounded differences. In Surveys in Combinatorics, 1989: Invited Papers at the Twelfth British Combinatorial Conference, London Mathematical Society Lecture Note Series, pp. 148–188. Cited by: §A.1.1.
  • [47] R. M. Neal (2012) Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §4.1.
  • [48] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: Figure 1, Table 1, §4.
  • [49] S. Ozair, C. Lynch, Y. Bengio, A. Van den Oord, S. Levine, and P. Sermanet (2019) Wasserstein dependency measure for representation learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 15578–15588. Cited by: §1, §5.
  • [50] L. Paninski (2003) Estimation of entropy and mutual information. Neural Computation 15 (6), pp. 1191–1253. Cited by: §A.1.1, §1, Lemma 2.
  • [51] F. Qiao, L. Zhao, and X. Peng (2020) Learning to learn single domain generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12556–12565. Cited by: §1, §2.
  • [52] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 901–909. Cited by: §4.3.
  • [53] O. Shamir, S. Sabato, and N. Tishby (2010) Learning and generalization with the information bottleneck. Theoretical Computer Science 411 (29-30), pp. 2696–2711. Cited by: §A.1.1, §A.1.1, §1, §2, §3.1, Lemma 2.
  • [54] A. Sinha, H. Namkoong, and J. Duchi (2018) Certifying some distributional robustness with principled adversarial training. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §2, §3.
  • [55] J. Snoek, Y. Ovadia, E. Fertig, B. Lakshminarayanan, S. Nowozin, D. Sculley, J. Dillon, J. Ren, and Z. Nado (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), pp. 13969–13980. Cited by: §3.1.
  • [56] J. Song and S. Ermon (2020) Understanding the limitations of variational mutual information estimators. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1.
  • [57] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. In Proceedings of the International Conference on Learning Representations Workshops, Cited by: §4.3.
  • [58] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §1, §3.2.
  • [59] D. Strouse and D. J. Schwab (2017) The deterministic information bottleneck. Neural Computation 29 (6), pp. 1611–1630. Cited by: §3.1.
  • [60] C. Tang and R. R. Salakhutdinov (2013) Learning stochastic feedforward neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 530–538. Cited by: §3.2.
  • [61] N. Tishby, F. C. Pereira, and W. Bialek (1999) The information bottleneck method. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing, pp. 368––377. Cited by: §1, §2, §3.1.
  • [62] N. Tishby and N. Zaslavsky (2015) Deep learning and the information bottleneck principle. In Proceedings of the IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §1, §2.
  • [63] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2020) On mutual information maximization for representation learning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §5.
  • [64] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7167–7176. Cited by: §2.
  • [65] G. Valiant and P. Valiant (2011) Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In

    Proceedings of the Annual ACM Symposium on Theory of Computing (STOC)

    ,
    pp. 685–694. Cited by: §3.1.
  • [66] V. N. Vapnik (1998) Statistical learning theory. Wiley. Cited by: Figure 1, Figure 2, §4.1, Table 1.
  • [67] M. Vera, P. Piantanida, and L. R. Vega (2018) The role of the information bottleneck in representation learning. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), pp. 1580–1584. Cited by: §2.
  • [68] R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino, and S. Savarese (2018) Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5339–5349. Cited by: §1, §1, §2, §2, §3.1, §3, Figure 1, §4.1, §4.1, §4.3, Table 1, §4.
  • [69] H. Wang, S. Ge, Z. Lipton, and E. P. Xing (2019) Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), pp. 10506–10518. Cited by: §A.2.2, §4.1, §4.2, Table 1.
  • [70] H. Wang, Z. He, Z. C. Lipton, and E. P. Xing (2019) Learning robust representations by projecting superficial statistics out. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §A.2.2, §4.2.
  • [71] Y. Wu and P. Yang (2016) Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62 (6), pp. 3702–3720. Cited by: §3.1.
  • [72] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1492–1500. Cited by: §4.3.
  • [73] Z. Xu, D. Liu, J. Yang, and M. Niethammer (2020) Robust and generalizable visual representation learning via random convolutions. arXiv preprint arXiv:2007.13003. Cited by: Table 1.
  • [74] D. Yin, R. G. Lopes, J. Shlens, E. D. Cubuk, and J. Gilmer (2019) A fourier perspective on model robustness in computer vision. In Advances in Neural Information Processing Systems (NeurIPS), pp. 13255–13265. Cited by: §4.3.
  • [75] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6023–6032. Cited by: §4.3.
  • [76] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §A.2.2, Figure 2, §4.3.
  • [77] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.3.
  • [78] L. Zhao, X. Peng, Y. Chen, M. Kapadia, and D. N. Metaxas (2020) Knowledge as priors: cross-modal knowledge generalization for datasets without superior knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6528–6537. Cited by: §1.
  • [79] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas (2019) Semantic graph convolutional networks for 3D human pose regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3425–3435. Cited by: §1.