Deep learning has shown notable empirical success in various application areas. Typically, in an over-parametrized setting with a highly non-convex loss surface, classical learning theory [vapnik2013nature] predicts that deep neural networks should have a high out-of-sample error because the solution is likely to get stuck at a local minimum. Nonetheless, deep neural networks appear to generalize well even in small data regimes. Numerous recent works have sought to explain generalization in neural networks. Zhang et al. [zhang2016understanding] showed that neural networks can fit random noise and labels, thus refuting the finite sample expressivity argument. Another view [keskar2016large] as to why neural networks generalize well, studies the loss surface geometry around the learned parameter and shows that sharper minima solutions tend to generalize poorly compared to flatter minima which were contested by Dinh et al. [dinh2017sharp]. Some recent research [keskar2017improving, wilson2017marginal] also demonstrates that vanilla SGD optimization has better generalization ability than adaptive optimization methods.
Our method is similar to Adversarial Distribution Shift (ADS) presented in [jacobsen2018excessive] where benign perturbations are added to the training data causing neural networks to learn task-irrelevant features. Specifically, [jacobsen2018excessive] studied the effect of single-pixel perturbations on MNIST training images on clean test performance. Data poisoning attacks [biggio2012poisoning, shafahi2018poison, steinhardt2017certified] are also related to such an approach where the adversary injects a few malicious samples in the training data to cause incorrect classification (typically targeted) during inference. Tanay et al. [tanay2018built] showed that neural network models can be made almost arbitrarily sensitive to a single-pixel while maintaining identical test performance between models. However, poisoning methods [munoz2017towards, shafahi2018poison, koh2017understanding] usually modify some part of the decision boundary by adding malicious training samples for targeted misclassifications, which is different from our approach of optimal ADS. Moreover, our motivation in this work is to analyze how optimization methods, specifically adaptive and non-adaptive algorithms, contribute to generalization robustness which is different from the typical objective of data poisoning methods.
In this paper, we find optimal training ADS that cause a high generalization gap between corrupted training and clean images during inference while limiting the attack to a few pixels only. The overview of our method is shown in Figure 1.Our contribution in this paper is two-fold. Firstly, we propose a novel fitness function for the CMA-ES algorithm to find optimal pixel disturbance, using domain adaptation theory. Our method outperforms previous heuristic ADS method presented in[jacobsen2018excessive]. Secondly, our analysis reveals that the choice of optimization technique plays an important role in generalization robustness. Specifically, vanilla SGD is found to be surprisingly resilient against training sample perturbations compared to adaptive optimization methods like ADAM, which calls into question the effectiveness of such popular adaptive optimization methods towards generalization robustness.
2 Problem Setup
We consider a multi-class classification task with input space and label space . The true data distribution is given as,
. Our goal is to train a classifier on a perturbed version of the true data samples such that the empirical risk (or test error) on the true uncorrupted samples is maximized. Considering that for each sample in, we can draw class-wise input perturbations, , parameterized by the mean and covariance matrix , which are added to the true samples,
, where noise encoding each class information is added to training images. The joint distribution of the perturbed data, constructed by assigning labels of the true samples to the corresponding perturbed samples, given as. In this paper, we work with image inputs and perturb a few pixels to analyze generalization sensitivity to small changes in training inputs.
Let us define a classifier function from a hypothesis space . The corresponding empirical risk on samples drawn from a distribution is defined as, , which signifies the error on the samples drawn from . Our objective is to find optimal perturbation parameter that increases the empirical risk on the clean samples while minimizing it on the corrupted samples, thus compromising generalization in neural networks, given as
The above objective finds optimal perturbation parameter that increases the empirical risk on the clean samples while minimizing it on the corrupted samples, thus compromising generalization in neural networks.
3 Maximum Domain Divergence based Evolutionary Strategy (MDD-ES)
The objective function in Equation 1 requires a nested minimization for classifier training and empirical risk maximization for optimal noise search. This presents difficulty in using standard gradient-based optimization methods for searching the optimal pixel perturbations. Therefore, we use a black-box optimization technique, specifically Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [hansen2016cma], which has been shown to work well in high-dimensional problems [ha2018recurrent]. However, simply using empirical risk (generalization gap) measure on clean samples as a fitness score might require more generations for convergence. However, each generation of the CMA-ES is computationally expensive (due to multiple CNN training rounds). Therefore, we propose a novel fitness score inspired by the domain divergence literature that provides an additional signal for convergence, leading to improved noise optimization properties from fewer generations.
3.1 Measuring Domain-Divergence
Considering a domain and a collection of subsets of as . Given two domain distributions and over , and a hypothesis class , Shai et al. [ben2007analysis, ganin2016domain] showed that domain divergence (-divergence) for the hypothesis space of linear classifiers can be approximately computed by the empirical -divergence from samples and as,
where samples from the source domain and samples from the target domain is drawn. The proxy -distance is computed as, according to [ben2007analysis], where is the discriminator error.
3.2 Bound on Target Risk
We are interested in finding a bound of the target empirical risk obtained by learning a classifier of the source samples. Shai et al. (and later used by Ganin et al. [ben2007analysis, ben2010theory, ganin2016domain]) showed that the bound on target risk can be computed in terms of the proxy -distance defined above, as follows,
Considering be a hypothesis class of VC dimension , for samples and
, then with probabilityover the choice of samples, for every :
with and is the empirical source risk.
Given a fixed hypothesis space, we observe that increasing the -divergence between the two domains would make the above bound loose. Since we are interested in maximizing the target risk, pixel perturbations that increase the -divergence between corrupted and clean data would be more likely to fool the neural network. We use this insight to craft a fitness score that favors solutions with high domain divergence between the clean and perturbed distributions.
3.3 Proposed Fitness Score based CMA-ES Optimization
Using the insights developed in the previous section, we propose MDD-ES algorithm that utilizes a fitness score measuring, (i) semantic mismatch score, (ii) domain divergence score. Given training data, , and initial CMA-ES parameters, , we sample a population of noise for each generation, . For each sample in the current generation , we obtain the optimal weights, , by training a CNN () from scratch on the corrupted training samples . We compute the semantic mismatch score for the noise sample as , where is the cross-entropy loss. This score encourages high loss of generalization between clean and corrupted samples drawn from the training distribution. To obtain the domain divergence score, we train a discriminator with corrupted samples as label and clean samples as label . The domain divergence score is computed as, , where is the error of the trained discriminator. The overall fitness score for the CMA-ES algorithm is computed as the combination of above score, = . After each generation, the sampling parameters are updated by the CMA-ES algorithm to favor the pixel perturbations corresponding to the top-performing fitness scores, . We refer the reader to the original paper [hansen2016cma] for details on the CMA-ES update algorithm. The best performing fitness score across all generations is chosen as the optimal pixel perturbations, . It must be noted that no samples from the testing data was used in the training phase for optimizing the noise generator parameters. During testing, we train with the optimally corrupted training data and perform inference on clean test data.
4 Experimental Results
We evaluated our method on four datasets: MNIST, Fashion-MNIST, SVHN cropped images, and CIFAR10 images. The perturbed MNIST images for are shown in Figure 2 (a). Learning perturbations by evolution involves multiple training rounds in each generation. We used two custom CNN models as underlying models in the evolutionary learning stage: GrayNet (24C3-P-48C3-P-256FC-10S), for MNIST, Fashion-MNIST and ColorNet (32C3-32C3-P-64C3-64C3-P-128C3-128C3-P-512FC-10S) for CIFAR10, SVHN dataset. We use four settings of number of pixel perturbation, .
4.1 Learning Curves for Perturbation Optimization
We examine test error with increasing generations of our proposed algorithm as shown in Figure 2 (c) and Figure 2 (d) for MNIST and Fashion-MNIST datasets respectively. Test error is seen to grow as the evolutionary optimization advances indicating the soundness of our proposed optimization strategy. Additionally, we visualize the mean GradCAM distribution of images per class from the testing dataset corresponding to the true class label for MNIST dataset in Figure 2 (b), which reveals that the CAM distribution shifts its density to non-salient background ROI in the image, thus learning non-discriminative features that do not generalize well. This might explain the drop in testing accuracy with increasing epochs.
4.2 Comparison to Prior Methods
As a baseline for our task, we choose a uniformly sampled spatial distribution of pixel perturbation, which is the starting point of the CMA-ES algorithm. Our method consistently outperforms both the baseline method and Jacobsen et al. [jacobsen2018excessive] on the metric of test error on the clean test set, for all the datasets as shown in Table 1. Our method shows superior performance compared to [jacobsen2018excessive] because we perform optimization to search for the best corruption pattern whereas [jacobsen2018excessive] uses heuristic pixel perturbations on the left-most column of the input image to encode class specific information. The baseline method outperforms Jacobsen et al. [jacobsen2018excessive] due to data augmentation.
4.3 Adaptivity can Overfit to Training Perturbations
High out-of-sample error is generally attributed to poor convergence of the neural network parameters to an unfavorable local minimum. By examining the robustness of well-known optimization strategies to our proposed pixel perturbation algorithm, we wish to study if a certain algorithm is more liable to memorizing small perturbations while ignoring other salient statistical patterns in the training data. To this end, we trained CNN models on our proposed optimal ADS data using ADAM [kingma2014adam]
, SGD, RMSProp[tieleman2017divide], and Adabound [luo2019adaptive] optimization. The results are shown in Figure 3.
Wilson et al. [wilson2017marginal]
showed that adaptive methods are affected by spurious features that do not contribute to out-of-sample generalization by crafting a smart artificial linear regression example. Our method is an extension of such methods for automatic creation of spurious examples that scale to arbitrarily sized datasets by evolutionary strategies. Figure3 reveals that ADAM and RMSProp show prohibitively low testing accuracy for all cases while vanilla SGD is surprisingly resilient to such perturbations showing better out-of-sample performance consistently for all the datasets. Adabound uses strategies from both SGD and Adam, thus showing intermediate performance. Thus, adaptive methods overfit to training perturbations while vanilla SGD is considerably robust to such changes.
Due to the input data corruption, the loss manifold changes to favor solutions that overfit to the spurious perturbation features. Our intuition is that adaptive methods adjust an algorithm to the geometry of the data [wilson2017marginal] and thus overfits to such spurious features. In contrast, SGD’s optimization strategy does not depend on the data, but it uses the geometry inherent to the parameter space. Thus it performs better than adaptive optimization algorithms.
Loss surface : Keskar et al. [keskar2016large, hochreiter1997flat] claimed that flatter minima solutions generalize better compared to its sharper counterparts. To investigate this phenomenon, we visualize the loss surface around the learned parameters by interpolating between the weights obtained from SGD and ADAM optimization following the strategy by Goodfellow et al. [goodfellow2014qualitatively]
. We plot the loss function values and train/test accuracies at intermediate intervals given asas shown in Figure 4. Interestingly, we find that SGD finds sharper minima solutions where both test and train loss are low () compared to ADAM, where the train loss exhibits are more flatter geometry (). This pattern is repeatedly visible for all datasets suggesting that sharpness of minima does not guarantee a solution that has better generalization robustness to training perturbations, which is along the same line of argument as claimed by Dinh et al. [dinh2017sharp].
We present a population-based evolutionary strategy using a novel fitness score to search for pixel perturbations that explicitly maximize domain divergence and generalization gap. Our method incrementally fools the neural networks with each passing generation suggesting the existence of certain vulnerable spatial locations on input images. Our analysis reveals that a proper selection of neural network optimization is paramount to good generalization. We find that vanilla SGD performs significantly better than adaptive optimization methods in ignoring spurious training features that do not contribute to out-of-sample generalization. Our analysis of loss surface, reveals that in spite of good generalization performance SGD finds sharper minima solutions than ADAM. It might be tempting to conclude that sharper minima solutions are more robust to input perturbation overfitting however more analysis is required in this direction. We believe this work will fuel further research into understanding the generalization properties of deep learning optimization in the presence of input noise.