Despite the enormous progress in training neural networks to solve hard tasks, they remain surprisingly and stubbornly sensitive to imperceptibly small worst-case perturbations known asadversarial examples. This lack of robustness has sparked many theories (gilmer2018adversarial; mahloujifar2019curse; tanay2016boundary; ford2019adversarial; fawzi2018adversarial; bubeck2018adversarial; goodfellow2014explaining; schmidt2018adversarially) but together they fail to explain many perplexing observations. Compelling recent work (ilyas2019adversarial) has illuminated the situation greatly by proposing the ‘Features-Not-Bugs’ (FNB) hypothesis, which states that adversarial sensitivity is a simple consequence of state-of-the-art models learning well-generalizing features in the dataset. From the FNB perspective, since models are trained only for maximizing accuracy, they have the freedom to choose useful but non-robust features that humans find non-intuitive.
What is the nature of these features? Recent work(yin2019fourier)
proposes the High Frequency (HF) hypothesis, which states that state-of-the-art classifiers are using low amplitude, high frequency features in natural images. Given the strong theoretical connection between adversarial features and perturbations in linear models(goh2019a), it stands to reason that we should expect a similarly strong relationship in nonlinear models. In this vein, (yin2019fourier) also finds that adversarial perturbations of naturally trained models tend to be higher frequency. In contrast, those of adversarialy trained models tend to be lower frequency, explaining several otherwise perplexing tradeoffs seen in corruption-based data augmentation studies (yin2019fourier; rusak2020increasing). Though not all adversarial features are high frequency (yin2019fourier), understanding the origin of high frequency features is an important step towards building robust, interpretable models.
In this paper we propose the Implicit Fourier Regularization (IFR) hypothesis, which claims that:
Implicit regularization in frequency domain, directly caused by the translation invariance of the convolution operation, is necessary for the evolution of high frequency adversarial perturbations.
Through systematic experiments examining the learning dynamics of deep linear and nonlinear models, we provide strong empirical support for this claim, along with theoretical support in the case of deep linear models.
Main Contributions Our main contributions are as follows. First, by exploring learning dynamics, we confirmed the “Features not Bugs” Hypothesis. Second, based on extensive, systematic experiments, we propose the Implicit Fourier Regularization Hypothesis: that the frequency-driven nature of adversarial examples for real-world deep networks originates from an implicit bias towards sparsity in the Fourier domain, which in turn is caused by the translation invariance of convolutions. We extend the theory of (gunasekar2018implicit) to bounded-width convolutions in this regard, and we find that the radial distribution of energy in the frequency spectrum of the adversarial perturbation reveals the different implicit biases induced by various architectures and weight initializations.
2 Related Work
Explanations of Adversarial Examples. Several papers have tried to understand the conditions that are necessary for sensitivity to adversarial noise. Some work has focused on statistical properties of the data distribution (gilmer2018adversarial; mahloujifar2019curse; tanay2016boundary; ford2019adversarial; fawzi2018adversarial), whereas others have studied overfitting or underfitting as the key property (bubeck2018adversarial; goodfellow2014explaining; schmidt2018adversarially). See (ilyas2019adversarial) for an comprehensive review. However, to our knowledge no existing work has used learning dynamics nor implicit regularization theory as a way to understand adversarial development.
Fourier Analysis of Input Perturbations. Recent work on robustness to adversarial and noise perturbations has used a Fourier perspective to understand how various data augmentation techniques can make different models more robust (ilyas2019adversarial; rusak2020increasing). In (ilyas2019adversarial)
the authors have found that adversarial training and other forms of noise perturbations produce models which are robust to high frequency noise. With this knowledge, they have designed better noise generation and data augmentation techniques. Our work tries to explain why adversarial examples of convolutional neural network evolves in frequency space compared to other models.
Relationship between Performance and Adversarial Attacks. Some papers have study the relationship between performance and adversarial development. In (shafahi2018adversarial)
the authors have established that for certain loss functions, adversarial examples are inescapable and that input complexity could affect the robustness of the model. In addition,(ilyas2019adversarial; xie2019adversarial; tsipras2018robustness; nakkiran2019adversarial)
have found that for classification problems adversarial examples improve the performance of the classifier, and that adversarial robustness is at odds with accuracy.
Theories of Learning Dynamics & Implicit Regularization. This line of theoretical work aims to understand how overparameterized neural nets with more parameters than training data can possibly generalize (williams2019gradient; gunasekar2018implicit; gidel2019implicit; woodworth2020kernel). Early work revealed the surprising phenomenon of implicit regularization: When a loss surface possesses many global minima of equal value, which is common in overparametrized models, the specific global minima that a learning algorithm converges to can depend greatly on seemingly idiosyncratic details such as the choice of parameter initialization scheme, specific learning algorithm, or architecture. Recent studies have also shown that such implicit bias/regularization is responsible for much if not most of the generalization performance of state-of-the-art image classifiers(li2019enhanced; arora2019exact). Our paper specifically utilizes recent work(gunasekar2018implicit) that shows that overparametrized deep linear models with and without convolutions induce dramatically different implicit biases, the former yielding a bias towards sparsity in the Fourier domain. In Section 3.3.1 we hypothesize that such implicit Fourier regularization, combined with useful features in the dataset, are necessary for the emergence of sparse high frequency adversarial examples during learning.
3 Empirical Study of Learning Dynamics
3.1 Learning Dynamics supports Features-Not-Bugs Hypothesis
Breakpoints migrate to Adversarial Directions at a far higher rate than Random Directions.
To further understand and test the FNB hypothesis, we used function space theory developed for ReLU Neural Networks that re-parametrizes the latter as continuous piecewise linear (CPWL) spline functions with breakpoints/breakplanes (where slope changes discontinuously) and delta-slope parameters(the magnitude of the change in slope along a particular direction)(williams2019gradient; sahs2020a). From the spline perspective, modeling curvature in the target function necessitates the flow of breakpoints from their initial locations (determined by the weight initialization) to regions of high curvature. The theory has been successful in explaining several otherwise perplexing phenomenon including the need for overparametrization in training, the structure of the loss surface, the Hessian spectrum’s correlation with smoothing, and implicit regularization.
Motivated by this theory, our interest in adversarial examples, and the "features not bugs" (FNB) hypothesis, a natural question is: To what degree are adversarial attacks take advantage of input directions that already have a preexisting high density of breakpoints? Or do breakpoints only move during training to functionally relevant directions that then are utilized by the adversarial attack?
In order to test this, we trained a ResNet18 model on the CIFAR-10 dataset. After training, we used the Foolbox package (rauber2017foolbox) to generate adversarial attacks for every example in the test set via an Projected Gradient Descent attack (kurakin2016adversarial)(see Sec. LABEL:sec:supp-attack-configs). Once we had the adversarial examples, we pass both the original and adversarial images through the intermediate models (saved during training) and measured two observables along the line connecting the original and adversarial
image: (1) Changes in binary ReLU states (active to inactive or vice versa); and (2) the change in curvature in the logits for correct and adversarial categories. Then we compare both of these metrics to a noise-perturbed example imagewith the same norm as the adversarial example (), but perturbed in a different random direction .
In Figure 1fig:sub-first,fig:sub-second, we can see that the distribution of the fraction of ReLU changes for the both random and adversarial directions is the same at the beginning of training but then diverges significantly by the end. Furthermore, in Figure 1
fig:sub-third,fig:sub-fourth, we can see that the distribution of roughness (Measured as the deviation from the normal vector) per example along the adversarial direction increases through time and diverges for adversarial vs random perturbations. This can also be seen for intermediate models (Sup. FigureLABEL:fig:roughness_complete).
Success of Adversarial Attacks early in training is correlated with Classification Accuracy. To further test the FNB hypothesis, we ask: Does the development of adversarial examples during training correlate with gains in accuracy? To answer this we select model snapshots from intermediate times during training and perform adversarial attacks on them. We considered an adversarial attack to be successful if the original image was correctly classified by the model and if the adversarial example was within a Euclidean distance of , i.e., , as defined by the attack (Supp. Sec. LABEL:sec:supp-attack-configs).
Figure 2 shows that the adversarial success rate, defined as the percentage of examples for which a successful adversarial attack was found, increased by the same amount as the training/test accuracy. This suggests that adversarial examples develop in the model as performance increases, further supporting the FNB hypothesis. However, we also see that the adversarial examples computed for the final network after training lag behind in performance compared to the adversarial attack computed for intermediate models (Figure 2). This indicates that the final network’s adversarial attacks are not fully transferable to the intermediate networks. This raises the question: Does the nature of the adversarial attacks change during the course of learning?
Minimum Distance Adversarial Features Change throughout Learning. One of the most important defining properties of adversarial examples is that they have low norm and are thus typically imperceptible by eye. Are adversarial attacks low norm throughout training? We measured the norm of the adversarial perturbation during learning, using the and norms. Figure 3 (Top) shows us that both and norms decrease during training, showing that low norm perturbations require training well.
3.2 Dynamics of Adversarial Examples in Frequency Domain
Energy in Adversarial Perturbations flows from Low to Medium/High Frequencies Are adversarial perturbations simply changing in norm, or do they also change in direction during training? To test this, we took inspiration from (yin2019fourier) and adopted a Fourier perspective. Their work suggests that networks trained for adversarial robustness tend to be robust to high frequency perturbations and so we decided to study the learning dynamics of adversarial perturbations in the frequency domain.
As with previous experiments, we performed adversarial attacks on intermediate models during training, and measured the 2-D Discrete Fourier spectrum of the perturbation . In Figure 3 (Bottom), we see that early on in training the adversarial perturbations tend to contain low frequencies. As training progresses, we see the evolution towards higher frequencies. Thus as training progresses, adversarial perturbations contain more and higher frequencies with lower amplitudes (decrease in over time). This can also be observed in Figures 1(b) and 1(c), where we show the marginal radial and angular distributions of the energy spectrum , where denote polar coordinates in the frequency domain. We also observe that (i) the radial energy distribution shifts from low to high frequency , and (ii) the angular energy distribution becomes more uniform, corresponding to the observed ring structure.
3.3 Implicit Regularization in Frequency Domain impacts Adversarial Perturbations
Recent theoretical work (gunasekar2018implicit; williams2019gradient; sahs2020a) has shown that learning dynamics – influenced by the choice of model parametrization, weight initialization scheme, and gradient descent-based optimization algorithm – plays a pivotal role in the generalization performance of deep networks(arora2019exact) by inducing an implicit model bias in the form of regularization. This implicit bias depends critically on the choice of parametrization, sparking much recent work attempting to characterize the implicit bias/regularizer for various popular (or analytically tractable) architectures and learning algorithms. In particular, (gunasekar2018implicit) shows that shallow linear convnets with a single hidden layer (1 full-width circular convolutional linear layer followed by 1 fully connected linear layer) induce an implicit sparsity-promoting regularizer in the Fourier domain i.e., whererepresented by the linear network. Furthermore the sparsity promotion intensifies with increasing depth yielding an implicit regularizer where is the number of hidden convolutional layers.
In stark contrast, they find that a fully connected linear network with a single hidden layer (2 fully connected linear layers) induces a ridge regularizer in the pixel or space domain. Furthermore, this regularization does not change with depth i.e., . Based on this we state our main hypothesis: that convolutions induce an implicit Fourier regularization that strongly impacts the frequency spectra of adversarial perturbations.
Recent theoretical work in this space focuses mostly on linear models and so our first step is to test this hypothesis in linear models. Furthermore, given the complexity of modern architectures and learning algorithms, we also ask: What are the minimal conditions
(architecture type, hyperparameters, dataset, etc.) that are required for different kinds of biases in adversarial examples to form? This approach enables us to use linear learning dynamics theory to better understand the origin of adversarial features and their dependence on model parameterization. Armed with our linear understanding, we can then empirically explore nonlinear models like the ResNet18 model used for our earlier analysis.
3.3.1 Exploring Linear Models: Testing the Implicit Regularization Theory and expanding to Bounded-Width Convolutions
Fully Connected vs Full-Width Convolutional Models show different implicit bias in the frequency domain, as predicted by Implicit Fourier Regularization Theory(gunasekar2018implicit). In order to test the implicit Fourier regularization (IFR) theory, we first trained a 1 hidden layer linear fully connected model and a 1 layer hidden linear 3-channels (circular full-width) convolutional model on CIFAR-10. Both models achieved similar test accuracies (Supp. Table LABEL:tbl:accuracy)
In Figure LABEL:fig:vis-lineara, we visualize the linear classifier (end-to-end weights) for the chosen class , the adversarial perturbation , and their respective Fourier spectra and for the fully connected and convolutional linear models. We observe that for the Full Width Convolution model is more sparse compare to the Fully Connected model which is also reflected in the (Figure LABEL:fig:bar_norm) and in the radial distribution of energy (Figure LABEL:fig:vis-linearb). This indicates that both models conform to the predictions from IFR theory in (gunasekar2018implicit), and that there is a direct relationship between and . Furthermore, the IFR Theory states that this difference should increase with depth. In Figure LABEL:fig:bar_norm, we see that a deep full width convolutional model yields a smaller than the shallow full width convolutional model, but that no such depth dependence exists for a 3 hidden layer fully connected vs 1 hidden layer fully connected. Finally, Supp. Figure LABEL:fig:average_delta_linear) shows that this behavior is not specific to one image but is indeed present in the average Fourier spectrum of the adversarial perturbations for both fully connected and full width convolutional models.
Increasing the Number of Channels and restricting to Bounded-Width Convolutions Concentrates Energy in Higher Frequencies. Given that the IFR theory(gunasekar2018implicit) applies only to full-width circular convolutions with equal number of hidden units as their input, we asked: does a similar bias occur in bounded-width and multi-channel convolutions? In Figure LABEL:fig:bar_norm, we observe that a 3-channel kernel linear convolutional model is not less dense than its full-width counterpart, but does show more concentration in higher frequencies than a Fully Connected model (Supp. Figures LABEL:fig:average_delta_linear, LABEL:fig:vis-linearb). This is likely due to the additional spatial constraints imposed on the bounded-width convolutions which, via the Fourier Uncertainty Principle (a space-limited kernel cannot be band-limited in Fourier domain), drive frequency dispersion in the Fourier domain. This frequency dispersion is in tension with the sparsity promotion predicted by the IFR theory, resulting in a compromise sparsity that is lower than that of the full-width convolutional model. This intuition can be made rigorous for a 3-channel bounded-width linear convolutional model, extending the results of (gunasekar2018implicit), as follows.
For full-width convolutional networks, let the parametrization map be