Deep learning has achieved many recent advances in predictive modeling in various tasks, while the community has nonetheless become alarmed by the unintuitive generalization behaviors of neural networks, such as the capacity in memorizing label shuffled data (Zhang et al., 2017) and the vulnerability towards adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2015)
To explain the generalization behaviors of neural networks, many theoretical breakthroughs have been made progressively, including studying the properties of stochastic gradient descent(Keskar et al., 2017), different complexity measures (Neyshabur et al., 2017), generalization gaps (Schmidt et al., 2018), and many more from different model or algorithm perspectives (Kawaguchi et al., 2017; Mahloujifar et al., 2018; Bubeck et al., 2018; Shamir et al., 2019).
In this paper, inspired by previous understandings that convolutional neural networks (CNN) can learn from confounding signals (Wang et al., 2017) and superficial signals (Jo and Bengio, 2017; Geirhos et al., 2019; Wang et al., 2019), we investigate the generalization behaviors of CNN from a data perspective. Concurrently with (Ilyas et al., 2019), we suggest that the unintuitive generalization behaviors of CNN as a direct outcome of the perceptional disparity between human and models:
CNN can view the data at a much higher granularity than the human can.
However, different from (Ilyas et al., 2019), we provide an interpretation of this high granularity of the model’s perception:
CNN can exploit the high-frequency image components that are not perceivable to human.
For example, Figure 1 shows the prediction results of three selected testing samples from the MNIST data set and three testing samples from CIFAR10 data set, together with the prediction results of the high and low-frequency component counterparts. For these examples, the prediction outcomes are almost entirely determined by the high-frequency components of the image, which are barely perceivable to human. On the other hand, the low-frequency components, which almost look identical to the original image to human, are predicted to something distinctly different by the model.
Motivated by the the above empirical observations, in this paper we further investigate the generalization behaviors of CNN and attempt to explain such behaviors via differential responses to the image frequency spectrum of the inputs (Remark 1). Our main contributions are summarized as follows:
We reveal the existing trade-off between CNN’s accuracy and robustness by offering examples of how CNN exploits the high-frequency components of images to trade robustness for accuracy (Corollary 1).
We propose defense methods that can help improve the adversarial robustness of CNN without training or fine-tuning the model.
We introduce a new black box attack method that simply perturbs the high-frequency components of an image.
The remainder of the paper is organized as follows. In Section 2, we first introduce related discussions. In Section 3, we will present our main contributions, including a formal discussion on that CNN can exploit high-frequency components, which naturally leads to the trade-off between adversarial robustness and accuracy, new defense and attack methods. These contributions are later verified by experiments in Section 4. Then we will briefly discuss some related topics in Section 5 before we finally conclude the paper in Section 6.
2 Related Work
Ever since Szegedy et al. (2013) demonstrated the phenomenon of adversarial examples (i.e., samples with small perturbations that are imperceptible to human can deceive the model, altering predictions dramatically), the study of this unintuitive generalization behavior of neural networks leads to various results. For example, one branch of the study focuses on developing the methods, namely attack methods, to generate adversarial examples. These attack methods mostly assume the full knowledge, e.g., parameters and gradients, of the model (white-box methods), such as FGSM that moves a single step along the direction of the gradient (Goodfellow et al., 2015), DeepFool that iteratively searches for a minimum norm perturbation (Moosavi-Dezfooli et al., 2016), and C&W attack that seeks a perturbation with minimum distance to the original image (Carlini and Wagner, 2017); some other attacks assume limited knowledge of the model (black-box methods), such as SinglePixel attack or Salt&Pepper attack (Rauber et al., 2017)2016), Parseval networks (Cisse et al., 2017), and adversarial training (Madry et al., 2018b). These are but a few highlights among a long history of proposed attack and defense methods. One can refer to comprehensive reviews for detailed discussions (Akhtar and Mian, 2018; Chakraborty et al., 2018), yet shall be aware that newer works are introduced proliferatively (e.g., Ilyas et al., 2018; Yan et al., 2018; Guo et al., 2018b; Alaifari et al., 2019; Xiao et al., 2019; Song et al., 2019; Farnia et al., 2019; Wong et al., 2019).
While resulting in torrent research, the arm-race between attack and defense methods also leads the community to consider whether adversarial examples are inevitable. Shafahi et al. (2019) offered an affirmative answer with a series of analytic studies and empirical results. With designed data distributions, Tsipras et al. (2019) demonstrated the existence of a trade-off between a model’s robustness and accuracy. Related theoretical discussions have also been offered (Zhang et al., 2019).
Key Differences: This paper studies the related topics from a data perspective, we extend previous discussions by offering interpretable observations connected to the phenomenon of adversarial examples and the trade-off between robustness and accuracy; then we continue to make methodological contributions based on the observations. To the best of our knowledge, there are only concurrent studies focusing on the data perspective (Ilyas et al., 2019), and our paper materializes the discussion of (Ilyas et al., 2019) by offering explanations of the imperceptible features.
3 The Frequency Spectrum and CNN’s Generalization Behavior
We first set up the basic notations used in this paper: denotes a data sample (the image and the corresponding label). denotes a convolutional neural network whose parameters are denoted as . We use to denote a human model, and as a result, denotes how human will classify the data .
denotes a generic loss function (e.g., cross entropy loss). denotes a function evaluating prediction accuracy (for every sample, this function yields if the sample is correctly classified, otherwise).
denotes a function evaluating the distance between two vectors.
denotes the Fourier transform; thus,denotes the inverse Fourier transform. We use to denote the frequency component of a sample. Therefore, we have and .
3.1 CNN Exploit High-frequency Components
We decompose the raw data , where and denote the low-frequency component and high-frequency component of . We have the following four equations:
where denotes a thresholding function that separates the low and high frequency components from
according to a hyperparameter, radius.
To define formally, we first consider a grayscale (one channel) image of size with possible pixel values (in other words, ), then we have , where denotes the complex number. We use to index the value of at position , and we use to denote the centroid. We have the equation formally defined as:
We consider in as the Euclidean distance in this paper. If has more than one channel, then , and operate on every channel of pixels independently.
With an assumption (referred to as A1) that presumes “only is perceivable to human, but both and are perceivable to a CNN,” we have:
but when a CNN is trained with
CNN may learn to exploit to minimize the loss. As a result, CNN’s generalization behavior appears unintuitive to a human. ∎
Notice that “CNN may learn to exploit ” differs from “CNN overfit” because can contain more information than sample-specific idiosyncrasy, and these more information can be generalizable across training, validation, and testing sets, but are just imperceptible to a human.
As Assumption A1 has been demonstrated to hold in some cases (e.g., in Figure 1), we believe Remark 1 can serve as one of the explanations to CNN’s generalization behavior. For example, the adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2015) can be generated by perturbing ; the capacity of CNN in reducing training error to zero over label shuffled data (Zhang et al., 2017) can be seen as a result of exploiting and overfitting sample-specific idiosyncrasy; then, the observation that CNN do not directly memorize the data when labels are not shuffled may be explained by the tendency of CNN towards learning simple, low-frequent functions (Rahaman et al., 2018), although this tendency is not within the scope of this paper.
3.2 Trade-off between Robustness and Accuracy
We continue with Remark 1 and discuss CNN’s trade-off between robustness and accuracy given from the image frequency perspective. We first formally state the accuracy of as:
and the robustness of as (following conventions e.g., Carlini et al., 2019):
where is the upper bound of the perturbation allowed.
With another assumption (referred to as A2): “for model , there exists a sample such that:
we can extend our main argument (Remark 1) to a formal statement:
The proof is a direct outcome of the previous discussion and thus omitted. The Assumption A2 can also be verified empirically (e.g., in Figure 1), therefore we can safely state that Corollary 1 can serve as one of the explanations to the trade-off between CNN’s robustness and accuracy.
Corollary 1 does not necessarily share the same goal of other theoretical trade-off discussions along with their directions focusing on the statistical properties of the model but intends to expand along a new dimension concerning with the nature of the data, especially the frequency spectrum of the images. One advantage of our direction is that we can conveniently offer real-world interpretable examples as justifications of our discussion. Also, intuitive explanations can immediately inspire us to contribute to the rich collection of adversarial attack/defense methods.
3.3 Improving the Adversarial Robustness by Focusing on Low-Frequency Components
Within the scope of this paper, we are only interested in improving the adversarial robustness of clean-data trained CNNs without extra learning or even fine-tuning procedure. We focus on trained CNNs because of 1) the practical value of improving the adversarial robustness of deployed CNNs in industry and 2) the research value of verifying our previous discussions by eliminating the influence from other factors during further learning or fine-tuning.
Remark 1 straightforwardly leads to a solution to improve CNN’s robustness towards human level: cleaning the high-frequency components of an input image. Thus, we have our first defense method:
During testing, for any input image , transform it into the corresponding with radius .
However, processing every input image can be computationally expensive, so we propose another method that only operates on the model space: instead of transforming to , we transform to :
For model , we use to denote the first convolutional layer. denotes the th convolutional kernel. For every , we compute the low-frequency components of with:
and get whose th convolutional kernel is . We replace with and the remaining parameters of stay unchanged. We use to denote the resulting model. Images are predicted via .
The rationale of D2 is shown with (the proof is in Appendix A):
given a fixed radius .
While Theorem 1 justifies D2, the method may be less than ideal in practice. One of the reasons is when we apply on the perturbed frequency domain, the resulting usually has both real part and imaginary part, but we can only replace with the real part of due to the structure of CNN. To reduce this inconvenience, we also explore D2 extensions to perturb the weights of more layers.
3.4 High Frequency Attack
Finally, Remark 1 directly leads to an intuitive attack method, which we consider as an additional contribution of this paper. We refer our attack method as High Frequency (HF) attack. HF does not require the gradient information of the model but needs to query the model’s prediction to confirm the success of the attack, similar to other (semi) black-box attacks.
The heuristic algorithm of HF is straightforward: given an image, HF starts with and gradually reduces the radius at Iteration . For every radius , HF maps to and tests the model with . If , then HF terminates and returns as an adversarial example of . If HF cannot find an adversarial example along the way, it also needs to terminate before the radius gets too smaller and starts to discard the information that is meaningful to a human. In practice, we set HF to terminate when .
Due to the discrete nature of the image data, can be set in the following way to avoid unnecessary computational load. With two pseudo-indices and , we use the tuple to replace , and . Then is replaced by either or depending on whether or is greater, where (an integer at least ) is the step size. With this manner, we need iterations. HF is guaranteed to test all the possibilities if .
We verify our discussions through experiments. We first test how defense methods help improve the robustness of the original model against either generated adversarial examples or new attacks. Then, we demonstrate the effectiveness of our High Frequency attack method.
Notice that we do not intend to claim that our defense/attack methods are superior to all existing ones, and we believe our contributions are valid without such superiorities for following reasons: our methods, as direct outcomes of Remark 1, can validate previous discussions, which, if validated, help explain CNN’s unintuitive generalization behaviors; our defense methods improve the robustness of existing models without any training or finetuning, thus are helpful for deployed models. D2 can even achieve this goal without introducing extra input processing computations.
4.1 Experiment Setup
, and ImageNet(Deng et al., 2009). We use a simple five-layer CNN architecture for MNIST and FashioinMNIST data sets, and use the AlexNet architecture (Krizhevsky et al., 2012) for CIFAR10 and ImageNet data sets. The prediction accuracy on clean testing images are 99.2% (MNIST), 92.9% (FashionMNIST), 83.1% (CIFAR10), and 53.4% (ImageNet). We consider three attack methods: FGSM (Goodfellow et al., 2015), DeepFool (Moosavi-Dezfooli et al., 2016), and C&W (Carlini and Wagner, 2017). We use the default parameters in Foolbox (Rauber et al., 2017), and our experiments suggest that these default parameters are effective enough in most cases. For the ImageNet experiment, we use pretrained weights111https://www.cs.toronto.edu/~guerzhoy/tf_alexnet/ and only consider 100 classes (1st, 11th, …, 991st class) from the ImageNet validation set.
4.2 Improving Robustness Against Attacks
Against Attacks Towards Original Models: We first consider how D1 and D2 can help improve the robustness of models against adversarial attacks towards the original models: we first attack the trained models to generate adversarial examples, then improve the model with D1 or D2, and test the performance of improved models over the generated adversarial examples.
The results of D1 and D2 are shown in Figure 2 and Figure 3, respectively. These figures show the curve of prediction accuracy of adversarial examples (Y-axis) over the maximum norm perturbation allowed between adversarial examples and the original image (X-axis). For some data sets (FashionMNIST, CIFAR10, and ImageNet), we divide the calculated distance by 255, so that the maximum norm perturbation allowed for every adversarial sample is 1.0.
In Figure 2, we test D1 with three different radii, depending on the dimension of the testing images: for images (MNIST, FashionMNIST), for images (CIFAR10), and for images (ImageNet). We report these radius choices because images processed with larger radii will behave almost identical to the original images, and images processed with much smaller radii will be almost unrecognizable, and the prediction results become less interesting. Except that in one case (FGSM attacking the FashionMNIST model), the default parameters of Foolbox are effective enough to deceive the models. As we can see, the baseline curves drop dramatically as the perturbation bound increases. However, models with D1 processing the images are almost uninfluenced by the adversarial perturbations. For MNIST and FashionMNIST, D1 is particularly effective, as most choices of the radius can raise the curve significantly. For CIFAR10 and ImageNet, the curve behaviors are intuitively connected to the choice of radius: processing the images with a larger radius is not helpful because the processing cannot necessarily erase the perturbations introduced through high-frequency components; on the other hand, while processing with a smaller radius leads to a more robust model, the prediction accuracy drops.
In Figure 3, we report D2 with four different choices, depending on the CNN architecture. For the simpler architecture for MNIST and FashionMNIST, we experiment with two radius choices () with two strategies: only perturbing the first layer () and perturbing the first two layers (). For AlexNet, we also experiment with two radius choices () and with two different strategies: only perturbing the first layer () and also perturbing the second layer with radius and the third layer with radius (denoted as ). These parameter choices may look arbitrary, but we notice that the performance of the method varies smoothly as a function of these parameters. Therefore, we only report some representative choices to evaluate the method qualitatively, not quantitatively. In other words, the exact choice of the parameter does not matter within this paper. As we can see, Figure 3 tells the similar story as Figure 2. This similarity verifies Theorem 1: perturbing the weights is effectively the same as perturbing the images.
Defense Against Attacks Towards Improved Models: We also experiment with the more challenging case: we first apply D2 onto the CNN, then attack the improved models and evaluate the robustness in comparison to the baseline. We do not consider D1 because D1 effectively masks the gradient of the model, so attack methods such as FGSM, DeepFool, and C&W will be unlikely to succeed. We test the same settings of D2 as the previous section. As this new case is more challenging than the previous one, we only observe marginal improvements of our method.
We conjecture the marginal improvement is because these powerful attacks can still deceive the model through perturbations on low-frequency signals after we block the high-frequency signals. To verify this hypothesis, we plot some adversarial MNIST images generated by C&W to deceive the improved models. In comparison, the adversarial images for the baseline look almost identical to the clean images to human, but the adversarial images for the D2 improved models tend to have some clustered perturbations. These perturbations, although still blurry, can at least offer the information to human which digit the model will misclassify the image into. More discussions are in Appendix B.
In conclusion, the performances of these methods validate our main argument (Remark 1) of this paper: CNN can exploit high-frequency components of images, which serve as one of the reasons of the unintuitive generalization behaviors, and discarding the high-frequency components can help the CNN to view the data as a human does, thus helps in improving robustness.
4.3 High Frequency Attack
Finally, we experiment with our High Frequency attack method. As the HF attack does not utilize the gradients of the model, but only queries the model for whether an adversarial example has been successfully generated, we compare it with attacks of similar mechanisms, such as SinglePixel attack and Salt&Pepper attack (Rauber et al., 2017). Similarly, we use the default parameters in Foolbox.
We attack the baseline models with these methods and plot the accuracy-perturbation curve, as shown in Figure 4. We can see that these attack methods are much less effective than the attack methods that consider the gradient of the model. HF tends to be more effective than the competing methods as the area under the accuracy-perturbation curve is clearly smaller. However, despite the clear advantage, we do not intend to claim the HF is guaranteed to be superior to other competing attack methods. Instead, these results suggest HF is reasonably effective, and thus validate Remark 1. which offers explanations to the unintuitive generalization behaviors of CNN. Some adversarial examples are shown in Appendix C.
Are high-frequency components just noises?
As we have demonstrated that CNN can exploit the high-frequency components of an image, a follow-up question is whether the signals we have noticed are indeed related to the frequency of an image, or just some “noises.” This question is important because masking out the high-frequency components is one of the conventional methods for denoising images (e.g., see Dabov et al., 2007)
. To answer this question, we experiment with another frequently used image denoising method: truncated singular value decomposition (SVD)(e.g., see Lu, 1997). We first decompose the image with SVD, then instead of separating the image into low-frequency and high-frequency components, we separate the image into one reconstructed with dominant singular values (the ones with bigger absolute values) and one reconstructed with trailing singular values (the ones with smaller absolute values). For MNIST and FashionMNIST, with this set-up, and with up to 20 trailing singular values (there are 28 in total), we do not find any images that align with the story in Figure 1 (i.e., samples are predicted correctly by noise components, but wrongly by clean components) for MNIST, and we find much fewer images supporting this story for FashionMNIST. Our observations suggest the signal CNN exploit is more than just random “noises”.
Does a more robust CNN also capture high-frequency signals?
As we have only experimented with CNNs optimized to minimize the empirical loss on clean training data, another follow-up question is whether more robust CNNs will also exploit the high-frequency components. To answer this question, we obtain adversarially trained models with PGD (Madry et al., 2018a) with the same architecture on MNIST and FashionMNIST and repeat the experiments of Figure 1. We also test the models with HF attack. For MNIST data set, we cannot find any samples that tell the story of Figure 1, and the HF attack is less effective, although it can still deceive the model. This observation correlates well with the understanding that more robust models tend to perceive the data more similarly to how a human does (e.g., Kim et al., 2019). However, for FashionMNIST data set, we, unfortunately, notice more samples that tell the story of Figure 1, and the HF attack is even more effective.
Other related CNN-analysis works adopting the Fourier transform technique:
This work is inspired by empirical observations showing that a CNN has a tendency in learning superficial statistics (Jo and Bengio, 2017; Wang et al., 2019). Jo and Bengio (2017) showed that there is a large generalization gap between images of different frequency-domain perturbations of the same model. Guo et al. (2018a) showed that the adversarial attack is particular effective if the perturbations are constrained in low-frequent space, which was further analyzed by Sharma et al. (2019), who also showed that the low-frequent perturbations are perceivable to human. The major difference between our paper and (Guo et al., 2018a; Sharma et al., 2019) is that we connect the model’s behavior on original images to that on high-frequency components and explain model’s generalization behavior based on this connection. On the function space, Rahaman et al. (2018) and Xu et al. (2019) showed that the networks tend to learn simple low-frequent functions through Fourier transform analysis.
We started with an empirical observation suggesting that CNN can exploit high-frequency components of an image, and led to Remark 1 stating that the unintuitive generalization behavior is partially due to that CNN perceive the data at a much higher granularity than human. We consider Remark 1 as the main contribution of this paper. Building around this main contribution, we discussed the trade-off between robustness and accuracy and then proposed defense and attack methods. The experimental results suggested that Remark 1, although initially based on observations from specific examples, is one of the reasons accounting for CNN’s unintuitive generalization behaviors.
Finally, we need to clarify that our Remark 1 is probably only one of the explanations to the unintuitive generalization behavior of CNN, and there may exist other features that are also captured by the CNN and imperceptible to a human. However, to the best of our knowledge, we are the first work to go beyond statistical arguments and offer some physical explanations to the generalization behavior of CNN. Hopefully, this work has emphasized the importance of understanding the nature of data and can inspire new branches of studies focusing on the data perspective.
Akhtar and Mian (2018)
N. Akhtar and A. Mian.
Threat of adversarial attacks on deep learning in computer vision: A survey.IEEE Access, 6:14410–14430, 2018. ISSN 2169-3536. doi: 10.1109/access.2018.2807385.
- Alaifari et al. (2019) R. Alaifari, G. S. Alberti, and T. Gauksson. ADef: an iterative algorithm to construct adversarial deformations. In International Conference on Learning Representations, 2019.
- Bracewell (1986) R. N. Bracewell. The Fourier transform and its applications, volume 31999. McGraw-Hill New York, 1986.
- Bubeck et al. (2018) S. Bubeck, E. Price, and I. Razenshteyn. Adversarial examples from computational constraints. arXiv preprint arXiv:1805.10204, 2018.
- Carlini and Wagner (2017) N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
- Carlini et al. (2019) N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. Goodfellow, A. Madry, and A. Kurakin. On evaluating adversarial robustness, 2019.
- Chakraborty et al. (2018) A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay. Adversarial attacks and defences: A survey, 2018.
- Cisse et al. (2017) M. Cisse, Y. Adi, N. Neverova, and J. Keshet. Houdini: Fooling deep structured prediction models. arXiv preprint arXiv:1707.05373, 2017.
- Dabov et al. (2007) K. Dabov, A. Foi, and K. Egiazarian. Video denoising by sparse 3d transform-domain collaborative filtering. In 2007 15th European Signal Processing Conference, pages 145–149. IEEE, 2007.
- Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Farnia et al. (2019) F. Farnia, J. Zhang, and D. Tse. Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, 2019.
- Geirhos et al. (2019) R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
- Goodfellow et al. (2015) I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples (2014). In International Conference on Learning Representations, 2015.
- Guo et al. (2018a) C. Guo, J. S. Frank, and K. Q. Weinberger. Low frequency adversarial perturbation, 2018a.
- Guo et al. (2018b) Y. Guo, C. Zhang, C. Zhang, and Y. Chen. Sparse dnns with improved adversarial robustness. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 240–249. Curran Associates, Inc., 2018b.
Ilyas et al. (2018)
A. Ilyas, L. Engstrom, A. Athalye, and J. Lin.
Black-box adversarial attacks with limited queries and information.
In J. Dy and A. Krause, editors,
Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2137–2146, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
- Ilyas et al. (2019) A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175, 2019.
- Jo and Bengio (2017) J. Jo and Y. Bengio. Measuring the tendency of cnns to learn surface statistical regularities. arXiv preprint arXiv:1711.11561, 2017.
- Kawaguchi et al. (2017) K. Kawaguchi, L. P. Kaelbling, and Y. Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.
- Keskar et al. (2017) N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.
- Kim et al. (2019) B. Kim, J. Seo, and T. Jeon. Bridging adversarial robustness and gradient interpretability, 2019.
- Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.
- Lu (1997) W.-S. Lu. Wavelet approaches to still image denoising. In Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No. 97CB36136), volume 2, pages 1705–1709. IEEE, 1997.
- Madry et al. (2018a) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018a.
- Madry et al. (2018b) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018b.
- Mahloujifar et al. (2018) S. Mahloujifar, D. I. Diochnos, and M. Mahmoody. The curse of concentration in robust learning: Evasion and poisoning attacks from concentration of measure, 2018.
Moosavi-Dezfooli et al. (2016)
S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard.
Deepfool: a simple and accurate method to fool deep neural networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.
- Neyshabur et al. (2017) B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017.
- Papernot et al. (2016) N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE, 2016.
- Rahaman et al. (2018) N. Rahaman, D. Arpit, A. Baratin, F. Draxler, M. Lin, F. A. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of deep neural networks. arXiv preprint arXiv:1806.08734, 2018.
- Rauber et al. (2017) J. Rauber, W. Brendel, and M. Bethge. Foolbox: A python toolbox to benchmark the robustness of machine learning models, 2017.
- Schmidt et al. (2018) L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Mądry. Adversarially robust generalization requires more data, 2018.
- Shafahi et al. (2019) A. Shafahi, W. R. Huang, C. Studer, S. Feizi, and T. Goldstein. Are adversarial examples inevitable? In International Conference on Learning Representations, 2019.
- Shamir et al. (2019) A. Shamir, I. Safran, E. Ronen, and O. Dunkelman. A simple explanation for the existence of adversarial examples with small hamming distance, 2019.
- Sharma et al. (2019) Y. Sharma, G. W. Ding, and M. Brubaker. On the effectiveness of low frequency perturbations, 2019.
- Song et al. (2019) C. Song, K. He, L. Wang, and J. E. Hopcroft. Improving the generalization of adversarial training with domain adaptation. In International Conference on Learning Representations, 2019.
- Szegedy et al. (2013) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
Tsipras et al. (2019)
D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry.
Robustness may be at odds with accuracy.In International Conference on Learning Representations, 2019.
Wang et al. (2017)
H. Wang, A. Meghawat, L.-P. Morency, and E. P. Xing.
Select-additive learning: Improving generalization in multimodal sentiment analysis.In 2017 IEEE International Conference on Multimedia and Expo (ICME), pages 949–954. IEEE, 2017.
- Wang et al. (2019) H. Wang, Z. He, and E. P. Xing. Learning robust representations by projecting superficial statistics out. In International Conference on Learning Representations, 2019.
- Wong et al. (2019) E. Wong, F. R. Schmidt, and J. Z. Kolter. Wasserstein adversarial examples via projected sinkhorn iterations, 2019.
- Xiao et al. (2017) H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
Xiao et al. (2019)
K. Y. Xiao, V. Tjeng, N. M. M. Shafiullah, and A. Madry.
Training for faster adversarial robustness verification via inducing reLU stability.In International Conference on Learning Representations, 2019.
- Xu et al. (2019) Z.-Q. J. Xu, Y. Zhang, T. Luo, Y. Xiao, and Z. Ma. Frequency principle: Fourier analysis sheds light on deep neural networks, 2019.
- Yan et al. (2018) Z. Yan, Y. Guo, and C. Zhang. Deep defense: Training dnns with improved adversarial robustness. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 419–428. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7324-deep-defense-training-dnns-with-improved-adversarial-robustness.pdf.
- Zhang et al. (2017) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
- Zhang et al. (2019) H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan. Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019.
Appendix A Proof of Theorem 1
Since the only differences between and are the convolutional kernels at the first layer, to show , we only need to show
where denotes the convolution operation.
With Convolution Theorem (Bracewell, 1986), we have:
Recall the definition of and we know that with a fixed radius , and will have the same sparsity pattern (i.e., values that are away from the centroid will be masked as zeros). The result of the dot product of the frequency domain will follow the same sparsity pattern, thus . Therefore, we can have and apply on both sides, we showed and thus proved the theorem. ∎
Appendix B Defense Against Attacks Towards Improved Models
We now proceed to a more challenging case: we first apply D2 onto the CNN and then apply the attack methods and evaluate how robustness of the improved model in comparison to the baseline method. We do not consider D1 because D1 effectively masks the gradient of the model and attack methods such as FGSM, DeepFool, and C&W will be unlikely to succeed. We test the same settings of D2 as the previous section, and the results are shown in Figure 5.
As this new case is more challenging, we do not observe significant advantages of our method. However, some interesting patterns are observed. For simpler data sets, we notice that our methods are superior to baseline methods under DeepFool and C&W attacks, but are inferior or equivalent under FGSM attacks. However, for more challenging data sets, we can only observe the advantages of our method on CIFAR10 data set under FGSM attacks. Our method behaves almost the same to baseline method in other settings.
Although the AUC does not seem to differentiate our improved models from the baseline model, a more detailed investigation into the adversarial examples generated helps verify that our method works as expected. In Figure 6, we showed five MNIST testing images, together with the adversarial examples generated for the baseline model and our improved models. As we can see, the adversarial examples generated for improved model tend to shift the “semantics” of the image. The most striking examples are for Model (f) (D2()), where some images have to be perturbed dramatically to deceive the model (3rd and 5th examples of the (f) column). However, D2() achieved this level of robustness with a trade-off of the prediction accuracy, because D2() fails to classify the some of the images, as marked by a square at the corner. For other models, we also observe patterns in the perturbed images. For example, several images for improved models at the 4th row seem to turn a digit 4 into digit 9. Similar patterns also exist that turn some images at the 3th row into digit 7, and turn some images at the 5th row into digit 5. Therefore, although the numerical evaluation suggests that our improved model barely improves upon the baseline model, these detailed inspection suggestions that the adversarial examples generated as an attack to the improved models can be better perceived by human. In contrast, as a result of perturbed high-frequent imperceptible components, adversarial exampled generated as an attack to the original model (column (b)) do not show such patterns for a human to understand which class these images have deceived the model to predict. As a result, we believe our methods improved the model in a way that, when it is attacked, at least the differences induced by adversarial examples can be perceived and understood by human. Also, it’s worth mentioning that we achieved this level of perception without any training or finetuning, thus the method can be directly used for deployed models.
Appendix C HF Adversarial Examples
We show some successful adversarial examples that HF, SinglePixel, and Salt&Pepper can all successfully attack the model, in Figure 7 for MNIST, Figure 8 for FashionMNIST, Figure 9 for CIFAR10, and Figure 10 for ImageNet. Overall speaking, some of the adversarial images generated by HF are blurry.