"Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness" (NeurIPS 2020).
Adversarial data augmentation has shown promise for training robust deep neural networks against unforeseen data shifts or corruptions. However, it is difficult to define heuristics to generate effective fictitious target distributions containing "hard" adversarial perturbations that are largely different from the source distribution. In this paper, we propose a novel and effective regularization term for adversarial data augmentation. We theoretically derive it from the information bottleneck principle, which results in a maximum-entropy formulation. Intuitively, this regularization term encourages perturbing the underlying source distribution to enlarge predictive uncertainty of the current model, so that the generated "hard" adversarial perturbations can improve the model robustness during training. Experimental results on three standard benchmarks demonstrate that our method consistently outperforms the existing state of the art by a statistically significant margin.READ FULL TEXT VIEW PDF
"Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness" (NeurIPS 2020).
Deep neural networks can achieve good performance on the condition that the training and testing data are drawn from the same distribution. However, this condition might not hold true in practice. Data shifts caused by mismatches between training and testing domain Balaji et al. (2018); Li et al. (2019); Sinha et al. (2018); Volpi et al. (2018); Zhao et al. (2019), small corruptions to data distributions Hendrycks and Dietterich (2019); Zhao et al. (2020), or adversarial attacks Goodfellow et al. (2015); Kurakin et al. (2017)
are often inevitable in real-world applications, and lead to significant performance degradation of deep learning models. Recently, adversarial data augmentationGanin et al. (2016); Sinha et al. (2018); Volpi et al. (2018) emerges as a strong baseline where fictitious target distributions are generated by an adversarial loss to resemble unforseen data shifts, and used to improve model robustness through training. The adversarial loss is leveraged to produce perturbations that fool the current model. However, as shown in Qiao et al. (2020)
, this heuristic loss function is insufficient to synthesize large data shifts, i.e., “hard” adversarial perturbations from the source domain, which makes the model still vulnerable to severely shifted or corrupted testing data.
To mitigate this issue, we propose a regularization technique for adversarial data augmentation from an information theory perspective using the Information Bottleneck (IB) Tishby et al. (1999) principle. The IB principle encourages the model to learn an optimal representation by diminishing the irrelevant parts of the input variable that do not contribute to the prediction. Recently, there has been a surge of interest in combining the IB method with training of deep neural networks Achille and Soatto (2018b); Amjad and Geiger (2019); Elad et al. (2019); Kolchinsky et al. (2019); Ozair et al. (2019); Tschannen et al. (2020), while its effectiveness for adversarial data augmentation still remains unclear.
In the IB context, a neural network does not generalize well on out-of-domain data often when the information of the input cannot be well-compressed by the model, i.e., the mutual information of the input and its associated latent representation is high Shamir et al. (2010); Tishby and Zaslavsky (2015). Motivated by this conceptual observation, we aim to regularize adversarial data augmentation through maximizing the IB function. Specifically, we produce “hard” fictitious target domains that are largely shifted from the source domain by enlarging the mutual information of the input and latent distribution within the current model. However, mutual information is shown to be intractable in the literature Belghazi et al. (2018); Paninski (2003); Song and Ermon (2020), and therefore directly optimizing this objective is challenging.
In this paper, we develop an efficient maximum-entropy regularizer to achieve the same goal by making the following contributions: (i) to the best of our knowledge, we are the first work to investigate adversarial data argumentation from an information theory perspective, and address the problem of generating “hard” adversarial perturbations from the IB principle which has not been studied yet; (ii) we theoretically show that the IB principle can be bounded by a maximum-entropy regularization term in the maximization phase of adversarial data argumentation, which results in a notable improvement over Volpi et al. (2018); (iii) we also show that our formulation holds in an approximate sense under certain non-deterministic conditions (e.g., when the neural network is stochastic or contains Dropout Srivastava et al. (2014) layers). Note that our maximum-entropy regularizer can be implemented by one line of code with minor computational cost, while it consistently and statistically significantly improves the existing state of the art on three standard benchmarks.
Information Bottleneck Principle. We begin by summarizing the concept of information bottleneck and, along the way, introduce the notations. The Information Bottleneck (IB) Tishby et al. (1999) is a principled way to seek a latent representation that an input variable contains about an output . Let be the mutual information of and , i.e., , where denotes the KL-divergence Kullback and Leibler (1951). Intuitively, measures the uncertainty in given . The representation can be quantified by two terms: which reflects how much compresses , and which reflects how well predicts . In practice, this IB principle is explored by minimizing the following IB Lagrangian:
where is a positive parameter that controls the trade-off between compression and prediction. By controlling the amount of compression within the representation via the compression term , we can tune desired characteristics of trained models such as robustness to adversarial samples Alemi et al. (2017), generalization error Amjad and Geiger (2019); Cheng et al. (2019); Kolchinsky et al. (2019); Shamir et al. (2010); Tishby and Zaslavsky (2015); Vera et al. (2018), and detection of out-of-distribution data Alemi et al. (2018).
Domain Generalization. Domain adaptation Ganin and Lempitsky (2015); Tzeng et al. (2017) transfers models in source domains to related target domains with different distributions during the training procedure. On the other hand, domain generalization Balaji et al. (2018); Bousmalis et al. (2016); Li et al. (2017, 2018, 2019); Mancini et al. (2018) aims to learn features that perform well when transferred to unseen domains during evaluation. This paper further studies a more challenging setting named single domain generalization Qiao et al. (2020), where networks are learned using one single source domain compared with conventional domain generalization that requires multiple training source domains. Recently, adversarial data augmentation Volpi et al. (2018) is proven to be a promising solution which synthesizes virtual target domains during training so that the generalization and robustness of the learned networks to unseen domains can be improved. Our approach improves it by proposing an efficient regularizer.
Adversarial Data Augmentation. We are interested in the problems of training deep neural networks in a single source domain and deploying it to unforeseen domains following different underlying distributions. Let be random data points with associated labels ( is finite) drawn from the source distribution . We consider the following worst-case problem around :
where is the network parameters, is the loss function, and measures the distance between two distributions and . We denote , where represents the parameters of the final prediction layer and represents the parameters of the rest of the network. Letting be the latent representation of input , we feed it into a
-way classifier such that using thesoftmax activation
, the probabilityof the -th class is:
where is the parameters for the -th class. In the classification setting, we minimize the cross-entropy loss over each sample in the training domain: . Moreover, in order to preserve the semantics of the input samples, the metric is defined in the latent space . Let denote the transportation cost of moving mass from to : , where . For probability measures and supported on , let be their couplings. Then, we use the Wasserstein metric defined by . The solution to the worst-case problem (2) ensures good performance (robustness) against any data distribution that is distance away from the source domain . However, for deep neural networks, this formulation is intractable with arbitrary . Instead, following the reformulation of Sinha et al. (2018); Volpi et al. (2018), we consider its Lagrangian relaxation for a fixed penalty parameter :
In this paper, our main idea is to incorporate the IB principle into adversarial data augmentation so as to improve model robustness to large domain shifts. We start by adapting the IB Lagrangian (1
) to supervised-learning scenarios so that the latent representationcan be leveraged for classification purposes. To this end, we modify the IB Lagrangian (1) following Achille and Soatto (2018a, b); Amjad and Geiger (2019) to , where the constraint on is replaced with the risk associated to the prediction according to the loss function . We can see that appears as a standard cross-entropy loss augmented with a regularizer promoting minimality of the representation. Then, we rewrite Eq. (4) to leverage the newly defined loss function:
As discussed in Sinha et al. (2018); Volpi et al. (2018), the worst-case setting of Eq. (5) can be formalized as a minimax optimization problem. It is solved by an iterative training procedure where two phases are alternated in iterations. In the maximization phase, new data points are produced by computing the inner maximization problem to mimic fictitious target distributions that satisfy the constraint . In the minimization phase, the network parameters are updated by the loss function evaluated on the adversarial examples generated from the maximization phase.
The main challenge in optimizing Eq. (5) is that exact computation of the compression term in is almost impossible due to the high dimensionality of the data. The way of approximating this term in the minimization phase has been widely studied in recent years, and we follow Elad et al. (2019); Gal and Ghahramani (2015) to express by penalty (also known as weight decay Krogh and Hertz (1992)). Below, we discuss how to effectively implement in the maximization phase for adversarial data augmentation. The full algorithm is summarized in Algorithm 1.
Intuitively, regularizing the mutual information in the maximization phase encourages adversarial perturbations that cannot be effectively “compressed” by the current model. From the information theory perspective, these perturbations usually imply large domain shifts, and thus can potentially benefit model generalization and robustness. However, since is high dimensional, maximizing is intractable. One of our key results is that, when we restrict to classification scenarios, we can efficiently approximate and maximize during adversarial data augmentation. As we will show, this process can be effectively implemented through maximizing the entropy of network predictions, which is a tractable lower bound of .
, deep neural networks can be considered as a Markov chain of successive representations of the input where information flows obeying the structure:. By the Data Processing Inequality Cover and Thomas (2012), we have . On the other hand, when performing data augmentation during each maximization phase, the model parameters are fixed; and is a deterministic function of , i.e., any given input is mapped to a single class. Consequently, it holds that , where is the Shannon entropy. After combining all these together, then we have Proposition 1.
Consider a deterministic neural network, the parameters of which are fixed. Given the input , let be the network prediction and be the latent representation of . Then, the mutual information is lower bounded by , i.e., we have that,
Note that Eq. (6) does not mean that calculating does not need , since is generated by inputting into the network. There are two important benefits of our formulation (6) to be discussed. First, it provides a method to maximize that is not related to the input dimensionality. Thus, for high dimensional images, we can still maximize mutual information this way. Second, our formulation is closely related to the Deterministic Information Bottleneck Strouse and Schwab (2017), where is approximated by . However, is still intractable in general. Instead, can be directly computed from the softmax output of a classification network as we will show later. Next, we modify Eq. (5) by replacing with , which becomes a relaxed worst-case problem:
From a Bayesian perspective, the prediction entropy can be viewed as the predictive uncertainty Kendall and Gal (2017); Snoek et al. (2019) of a model. Therefore, our maximum-entropy formulation (7) is equivalent to perturbing the underlying data distribution so that the predictive uncertainty of the current model is enlarged in the maximization phase. This motivates us to extend our approach to stochastic neural networks for better capturing the model uncertainty as we will show in the experiment.
Empirical Estimation. Now, involves the expected prediction entropy
over the data distribution. However, during training we only have sample access to the data distribution, which we can use as a surrogate for empirical estimation. Given an observed inputsampled from the source distribution , we start from defining the prediction entropy of its corresponding output by:
Then, through calculating the expectation over the prediction entropies of all possible observations contained in the source dataset , we can obtain the empirical estimation of :
where denotes the empirical entropy, and is the network prediction of . After combing Eq. (9) with the relaxed worst-case problem (7), we will have the empirical counterpart of which is defined by . Taking the dual reformulation of the penalty problem , we can obtain an efficient solution procedure. The following result is a minor adaptation of Volpi et al. (2018) (Lemma 1):
Let and be continuous. Let denote the robust surrogate loss. Then, for any distribution and any , we have that,
To solve the penalty problem of Eq. (5), in the minimization phase of the iterative training procedure, we can perform Stochastic Gradient Descent (SGD) on the robust surrogate loss . To be specific, under suitable conditions Boyd and Vandenberghe (2004), we have that , where is an adversarial perturbation of at the current model . On the other hand, in the maximization phase, we solve the maximization problem (11) by Maximum-Entropy Adversarial Data Augmentation (ME-ADA) in this work. Concretely, in the -th maximization phase, we compute adversarially perturbed samples at the current model :
Note that the entropy term is efficient to be calculated from the softmax output of a model, which can be implemented with one line of code in modern deep learning frameworks, and substantial performance improvement can be achieved by it as we will show in the experiments.
Theoretic Bound. It is essential to guarantee that the empirical estimate of the entropy (from a training set containing samples) is an accurate estimate of the true expected entropy . The next proposition ensures that for large , in a classification problem, the sample estimate of average entropy is close to the expected entropy.
Let be a fixed probabilistic function of into an arbitrary finite target space , determined by a fixed and known conditional probability distribution drawn from the joint probability distribution
, determined by a fixed and known conditional probability distribution, and be a sample set of size
drawn from the joint probability distribution. For any , with probability of at least over the sample set , we have,
We prove Proposition 3 in the supplementary material. The proof adapts the setting in Shamir et al. (2010), where we bound the deviations of the information estimations from their expectation and then use the bound on the expected bias of entropy estimation. Here, it is also worth discussing two important properties of this bound. First, we note that Proposition 3 holds for any fixed probabilistic function. Compared with prior studies on the plug-in estimate of discrete entropy over a finite size alphabet Valiant and Valiant (2011); Wu and Yang (2016), we focus on the bound of non-optimal estimators. In particular, this proposition holds for any , even if is not a globally optimal solution for in Eq. (7). This is the case of models in the maximization phase, which thus ensures the effectiveness of our formulation across the whole iterative training procedure. Second, the bound does not depend on . In addition, the complexity of the bound is mainly controlled by . By constraining to be small, a tight bound can be achieved. This assumption usually holds for the setting of training classification models, i.e., .
It is important to note that not all models are deterministic, e.g., when deep neural networks are stochastic Florensa et al. (2017); Tang and Salakhutdinov (2013) or contain Dropout layers Gal and Ghahramani (2016); Srivastava et al. (2014). The mapping from to may be intrinsically noisy or non-deterministic. Here, we show that when is a small perturbation away from being a deterministic function of , our maximum-entropy formulation (7
) still applies in an approximate sense. We now consider the case when the joint distribution ofand is -close to having be a deterministic function of . The next result is a minor adaptation of Kolchinsky et al. (2019) (Theorem 1) and it shows that the conditional entropy is away from being zero.
Let be a random variable and
be a random variable andbe a random variable with a finite set of outcomes . Let be a joint distribution over and under which . Let be a joint distribution over and which has the same marginal over as , i.e., , and obey . Then, we have that,
As we show in this corollary, even if the relationship between and is not perfectly deterministic but close to being so, i.e., it is -close to a deterministic function, then we have . Hence, in this case, the proposed Proposition 1 and our maximum-entropy adversarial data augmentation formulation (7) still hold in an approximate sense.
In this section, we evaluate our approach over a variety of settings. We first test with MNIST under the setting of large domain shifts, and then test on a more challenging dataset, with PACS data under the domain generalization setting. Further, we test on CIFAR-10-C and CIFAR-100-C which are standard benchmarks for evaluating model robustness to common corruptions. We compare the proposedMaximum-Entropy Adversarial Data Augmentation (ME-ADA) with previous state of the art when available. We note that Adversarial Data Augmentation (ADA) Volpi et al. (2018) is our main competitor, since our method downgrades to Volpi et al. (2018) when the maximum-entropy term is discarded.
Datasets. MNIST dataset LeCun et al. (1998)
consists of handwritten digits with 60,000 training examples and 10,000 testing examples. Other digit datasets, including SVHNNetzer et al. (2011), MNIST-M Ganin and Lempitsky (2015), SYN Ganin and Lempitsky (2015) and USPS Denker et al. (1989), are leveraged for evaluating model performance. These four datasets contain large domain shifts from MNIST in terms of backgrounds, shapes and textures. PACS Li et al. (2017) is a recent dataset with different object style depictions and a more challenging domain shift than the MNIST experiment. This dataset contains four domains (art, cartoon, photo and sketch), and shares seven common object categories (dog, elephant, giraffe, guitar, house, horse and person) across these domains. It is made up of 9,991 images with the resolution of . For fair comparison, we follow the protocol in Li et al. (2017) including the recommended train, validation and test split.
CIFAR-10 and CIFAR-100 are two datasets Krizhevsky and Hinton (2009) containing small natural RGB images, both with 50,000 training images and 10,000 testing images. CIFAR-10 has 10 categories, and CIFAR-100 has 100 object classes. In order to measure the resilience of a model to common corruptions, we evaluate on CIFAR-10-C and CIFAR-100-C datasets Hendrycks and Dietterich (2019). These two datasets are constructed by corrupting the original CIFAR test sets. For each dataset, there are a total of fifteen noise, including blur, weather, and digital corruption types, and each of them appears at five severity levels or intensities. We do not tune on the validation corruptions, so we report the average performance over all corruptions and intensities.
Experiment Setup. We follow the setup of Volpi et al. (2018) in experimenting with MNIST dataset. We use 10,000 samples from MNIST for training and evaluate prediction accuracy on the respective test sets of SVHN, MNIST-M, SYN and USPS. In order to work with comparable datasets, we resize all the images to , and treat images from MNIST and USPS as RGB images. We use LeNet LeCun et al. (1989) as a base model and the batch size is 32. We use Adam Kingma and Ba (2014) with for minimization and SGD with for maximization. We set , , , and . We compare our method against ERM Vapnik (1998), ADA Volpi et al. (2018), and PAR Wang et al. (2019a).
We also implement a variant of our method through Bayesian Neural Networks (BNNs) Blundell et al. (2015); Gal and Ghahramani (2016); Lakshminarayanan et al. (2017) to demonstrate our compatibility with stochastic neural networks. BNNs learn a distribution over network parameters and are currently the state of the art for estimating predictive uncertainty Ebrahimi et al. (2020); Neal (2012). We follow Blundell et al. (2015) to implement the BNN via variational inference. During the training procedure, in each maximization phase, a set of network parameters are drawn from the variational posterior , and then the predictive uncertainty is redefined by the expectation of all prediction entropies: . We refer to the supplementary material for more details of this BNN variant.
Results. Table 1
shows the classification accuracy and standard deviation of each model averaged over ten runs. We can see that our model with the maximum-entropy formulation achieves the best performance, while the improvement on USPS is not as significant as those on other domains due to its high similarity with MNIST. We then notice that, after engaging the BNN, our performance is further improved. Intuitively, we believe this is because the BNN provides a better estimation of the predictive uncertainty in the maximization phase. We are also interested in analyzing the behavior of our method whenis increased. Figure 1 shows the results of our method and other baselines by varying the number of iterations while fixing and . We observe that our method improves performances on SVHN, MNIST-M and SYN, outperforming both ERM and Volpi et al. (2018) statistically significantly in different iterations. This demonstrates that the improvements obtained by our method are consistent.
Experiment Setup. We continue to experiment on PACS dataset, which consists of collections of images over four domains. Each time, one domain is selected as the test domain, and the rest three are used for training. Following Li et al. (2017)
, we use the ImageNet pretrained AlexNetKrizhevsky et al. (2012) as a base network. We compare with recently reported state of the art engaging domain identifications, including DSN Bousmalis et al. (2016), L-CNN Li et al. (2017), MLDG Li et al. (2018), Fusion Mancini et al. (2018), MetaReg Balaji et al. (2018) and Epi-FCR Li et al. (2019), as well as methods forgoing domain identifications, including AGG Li et al. (2019), HEX Wang et al. (2019b), and PAR Wang et al. (2019a). Former methods often obtain better results because they utilize domain identifications. Our method belongs to the latter category. Other training details are provided in the supplementary material.
Results. We report the results in Table 2. We note that our method achieves the best performance among techniques forgoing domain identifications. More impressively, our method, without using domain identifications, is only slightly shy of MetaReg Balaji et al. (2018) in terms of overall performance, which takes advantage domain identifications. Interestingly, it is also worth mentioning that our method improves previous methods with a relatively large margin when “sketch” is the testing domain. This is notable because “sketch” is the only colorless domain which owns the largest domain shift out of the four domains in PACS. Our method handles this extreme case by producing larger data shifts from the source domain with the proposed maximum-entropy term during data augmentation.
Experiment Setup. In the following experiments, we show that our approach endows robustness to various architectures including All Convolutional Network (AllConvNet) Salimans and Kingma (2016); Springenberg et al. (2014), DenseNet-BC Huang et al. (2017) (with and ), WideResNet (40-2) Zagoruyko and Komodakis (2016), and ResNeXt-29 () Xie et al. (2017)
. We train all networks with an initial learning rate of 0.1 optimized by SGD using Nesterov momentum, and the learning rate decays following a cosine annealing scheduleLoshchilov and Hutter (2016)
. All input images are pre-processed with standard random left-right flipping and cropping in the minimization phase. We train AllConvNet and WideResNet for 100 epochs; DenseNet and ResNeXt require 200 epochs for convergence. Following the setting ofHendrycks et al. (2020), we use a weight decay of 0.0001 for DenseNet and 0.0005 otherwise. Due to the space limitation, we ask the readers to refer to the supplementary material for detailed settings of our training parameters for different architectures.
Baselines. To demonstrate the utility of our approach, we compare to many state-of-the-art techniques designed for robustness to image corruptions. These baseline techniques include (i) the standard data augmentation baseline and Mixup Zhang et al. (2018); (ii) two regional regularization strategies for images, i.e., Cutout DeVries and Taylor (2017) and Cutmix Yun et al. (2019); (iii) AutoAugment Cubuk et al. (2019)
, which searches over data augmentation policies to find a high-performing data augmentation policy via reinforcement learning; (iv) Adversarial TrainingKang et al. (2019) for model robustness against unforeseen adversaries, and Adversarial Data Augmentation Volpi et al. (2018) which generates adversarial perturbations using Wasserstein distances.
Results. The results are shown in Table 3. Our method enjoys the best performance and improves previous state of the art by a large margin (5% of accuracy on CIFAR-10-C and 4% on CIFAR-100-C). More importantly, these gains are achieved across different architectures and on both datasets. Figure 2 shows more detailed comparisons over all corruptions. We find that our substantial gains in robustness are spread across a wide variety of corruptions, with a small drop of performance in only three corruption types: fog, brightness and contrast. Especially, for glass blur, Gaussian, shot and impulse noises, accuracies are significantly improved by 25%. From the Fourier perspective Yin et al. (2019)
, the performance gains from our adversarial perturbations lie primarily in high frequency domains, which are commonly occurring image corruptions. These results demonstrate that the maximum-entropy term can regularize networks to be more robust to common image corruptions.
In this work, we introduced a maximum-entropy technique that regularizes adversarial data augmentation. It encourages the model to learn with fictitious target distributions by producing “hard” adversarial perturbations that enlarge predictive uncertainty of the current model. As a result, the learned model is able to achieve improved robustness to large domain shifts or corruptions encountered during deployment. We demonstrate that our technique obtains state-of-the-art performance on MNIST, PACS, and CIFAR-10/100-C, and is extremely simple to implement. One major limitation of our method is that it cannot be directly applied to regression problems since the maximum-entropy lower bound is still difficult to compute in this case. Our future work might consider alternative measurements of information Ozair et al. (2019); Tschannen et al. (2020)
that are more suited for general machine learning applications.
The proposed method will be used to train a perception system that can robustly and reliably classify object instances. For example, this system can be used in many fundamental real-world applications in which a user desires to classify object instances from a product database, such as products found on local supermarkets or online stores. Similar to most deep learning applications learning from data which run the risk of producing biased or offensive content reflecting the training data, our work that learns a data-driven classification model is no exception. Our method moderates this issue by producing efficient fictitious target domains that are largely shifted from the source training dataset, so that the trained model on these adversarial domains are less biased. However, a downside of this moderation is the introduction of new hyper-parameters to be tuned for different tasks. Compared with other methods that obtain the same robustness but have to be trained on larger datasets, the proposed research can significantly reduce the data collection from different domains to train classification models, thereby reducing the system development time and lower related costs.
Here, we follow the guidance of Shamir et al. (2010) to prove Proposition 3. Let be a sample set of size , and let be a probabilistic function of into an arbitrary finite target space, defined by for all and . To prove Proposition 3, we bound the deviations of the entropy estimations from its expectation: , and then use a bound on the expected bias of entropy estimation.
To bound the deviation of the entropy estimates, we use McDiarmid’s inequality McDiarmid (1989), in a manner similar to Antos and Kontoyiannis (2001). For this, we must bound the change in value of each of the entropy estimations when a single instance in is arbitrarily changed. A useful and easily proven inequality in that regard is the following: for any natural and for any and ,
With this in equality, a careful application of McDiarmid’s inequality leads to the following lemma.
For any , with probability of at least over the sample set, we have that,
First, we bound the change caused by a single replacement in . We have that,
If we change a single instance in , then there exist two pairs and such that increases by , and decreases by . This means that and also change by at most , while all other values in the distribution remain the same. Therefore, for each , changes by at most .
Lemma 1 provides bounds on the deviation of the from their expected values. In order to relate these to the true values of the entropy , we use the following bias bound from Paninski (2003) and Shamir et al. (2010).
For a random variable , with the plug-in estimation on its entropy, based on an i.i.d. sample set of size , we have that,
Let be a fixed probabilistic function of into an arbitrary finite target space , determined by a fixed and known conditional probability distribution , and be a sample set of size drawn from the joint probability distribution . For any , with probability of at least over the sample set , we have,
Let be a random variable (continuous or discrete), and be a random variable with a finite set of outcomes . Consider two joint distributions over and , and , which have the same marginal over , , and obey . Then,
This lemma upper bounds the quantity by . After extending it to the case when is a deterministic function of , we get the bound in Corollary 1.
Let be a random variable and be a random variable with a finite set of outcomes . Let be a joint distribution over and under which . Let be a joint distribution over and which has the same marginal over as , i.e., , and obey . Then, we have that,
We follow Blundell et al. (2015) to implement the BNN variant of our method. Let be the observed input variable and be a set of latent variables. Deep neural networks can be viewed as a probabilistic model , where is a set of training examples and is the network output which belongs to a set of object categories by using the network parameters . The variational inference aims to calculate this conditional probability distribution over the latent variables (network parameters) by finding the closest proxy to the exact posterior by solving an optimization problem.
Following the guidance of Blundell et al. (2015), we first assume a family of probability densities over the latent variables parameterized by , i.e., . We then find the closest member of this family to the true conditional probability by minimizing the KL-divergence between and , which is equivalent to minimizing the following variational free energy:
This objective function can be approximated using Monte Carlo samples from the variational posterior Blundell et al. (2015):
have a Gaussian probability density function with diagonal covariance and parameterized by. A sample weight of the variational posterior can be obtained by the reparameterization trick Kingma and Welling (2014): we sample it from a unit Gaussian and parameterized by , where is the noise drawn from the unit Gaussian and is the point-wise multiplication. For the prior, as suggested by Blundell et al. (2015)
, a scale mixture of two Gaussian probability density functions are chosen: they are zero-centered but have two different variances ofand with the ratio of . In this work, we let , , and . Then, the optimizing objective of adversarial perturbations in the maximization phase of our method is redefined by:
where is sampled times from the learned variational posterior.
The learning principle of the previous state-of-the-art method on this dataset follows two streams. The first stream of methods, including DSN Bousmalis et al. (2016), L-CNN Li et al. (2017), MLDG Li et al. (2018), Fusion Mancini et al. (2018), MetaReg Balaji et al. (2018) and Epi-FCR Li et al. (2019), engages domain identifications, which means that when training the model, each source domain is regarded as a separate domain. The second stream of methods, containing AGG Li et al. (2019), HEX Wang et al. (2019b), and PAR Wang et al. (2019a), does not leverage domain identifications and combines all source domains into a single one during the training procedure. We can find that the first stream leverages more information, i.e., the domain identifications, during the network training, and thus often yields better performance than the second stream. Our work belongs to the latter stream.
We follow the setup of Li et al. (2017)
for network training. To align with the previous methods, the ImageNet pretrained AlexNetKrizhevsky et al. (2012) is employed as the baseline network. In the network training, we set the batch size to 32. We use SGD with the learning rate of 0.001 (the learning rate decays following a cosine annealing schedule Zagoruyko and Komodakis (2016)), the momentum of 0.9, and weight decay of 0.00005 for minimization, while we use the SGD with the learning rate of 50.0 for maximization. Table 4 shows more detailed setting of all parameters under four different target domains.
The experimental settings follow the setups in Hendrycks et al. (2020). We use SGD for both minimization and maximization. In Table 5, we report the detailed settings of all parameters under different network architectures on CIFAR-10-C and CIFAR-100-C. Note that and are measured by number of training epoches, while is measured by number of iterations. In this work, we do not compare our method with Hendrycks et al. (2020), since the design of Hendrycks et al. (2020) depends on a set of pre-defined image corruptions which is with a different research target compared to our method.
Proceedings of the Conference on Uncertainty in Artificial Intelligence Workshops, Cited by: §2.
Improved regularization of convolutional neural networks with Cutout. arXiv preprint arXiv:1708.04552. Cited by: §4.3.
Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1180–1189. Cited by: §2, Figure 1, Table 1, §4.
Proceedings of the Annual ACM Symposium on Theory of Computing (STOC), pp. 685–694. Cited by: §3.1.