Project page for our paper: Interpreting Adversarially Trained Convolutional Neural Networks
We attempt to interpret how adversarially trained convolutional neural networks (AT-CNNs) recognize objects. We design systematic approaches to interpret AT-CNNs in both qualitative and quantitative ways and compare them with normally trained models. Surprisingly, we find that adversarial training alleviates the texture bias of standard CNNs when trained on object recognition tasks, and helps CNNs learn a more shape-biased representation. We validate our hypothesis from two aspects. First, we compare the salience maps of AT-CNNs and standard CNNs on clean images and images under different transformations. The comparison could visually show that the prediction of the two types of CNNs is sensitive to dramatically different types of features. Second, to achieve quantitative verification, we construct additional test datasets that destroy either textures or shapes, such as style-transferred version of clean data, saturated images and patch-shuffled ones, and then evaluate the classification accuracy of AT-CNNs and normal CNNs on these datasets. Our findings shed some light on why AT-CNNs are more robust than those normally trained ones and contribute to a better understanding of adversarial training over CNNs from an interpretation perspective.READ FULL TEXT VIEW PDF
Convolutional Neural Networks (CNNs) are commonly thought to recognise
Humans rely heavily on shape information to recognize objects. Conversel...
Humans rely heavily on shapes as a primary cue for object recognition. A...
Convolutional neural networks (CNNs) learn to extract representations of...
Convolutional Neural Networks (CNNs) used on image classification tasks ...
We investigate the influence of adversarial training on the interpretabi...
Adversarial training has been the topic of dozens of studies and a leadi...
Project page for our paper: Interpreting Adversarially Trained Convolutional Neural Networks
Unofficial pytorch implementation of Spacial Sensitivity experiments performed in 'Interpreting Adversarially Trained Convolutional Neural Networks [Zhang+, ICML 2019]' and 'Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet [Brendel+, ICLR2019]'
Convolutional neural networks (CNNs) have achieved great success in a variety of visual recognition tasks (Krizhevsky et al., 2012; Girshick et al., 2014; Long et al., 2015) with their stacked local connections. A crucial issue is to understand what is being learned after training over thousands or even millions of images. This involves interpreting CNNs.
Along this line, some recent works showed that standard CNNs trained on ImageNet make their predictions rely on the local textures rather than long-range dependencies encoded in the shape of objects(Geirhos et al., 2019; Brendel & Bethge, 2019; Ballester & de Araújo, 2016). Consequently, this texture bias prevents the trained CNNs from generalizing well on those images with distorted textures but maintained shape information. Geirhos et al. (2019) also showed that using a combination of Stylized-ImageNet and ImageNet can alleviate the texture bias of standard CNNs. It naturally raises an intriguing question:
Are there any other trained CNNs are more biased towards shapes?
. To defense the adversarial examples, adversarial training was proposed; that is, instead of minimizing the loss function over the clean example, it minimizes almost worst-case loss over the slightly perturbed examples(Madry et al., 2018). We name these adversarially trained networks as AT-CNNs. They were extensively shown to be able to enhance the robustness, i.e., improving the classification accuracy over the adversarial examples. Then,
What is learned by adversarially trained CNNs to make it more robust?
In this work, in order to explore the answer to the above questions, we systematically design various experiments to interpret the AT-CNNs and compare them with normally trained models. We find that AT-CNNs are better at capturing long-range correlations such as shapes, and less biased towards textures than normally trained CNNs in popular object recognition datasets. This finding partially explains why AT-CNNs tends to be more robust than standard CNNs.
We validate our hypothesis from two aspects. First, we compare the salience maps of AT-CNNs and standard CNNs on clean images and those under different transformations. The comparison could visually show that the predictions of the two CNNs are sensitive to dramatically different types of features. Second, we construct additional test datasets that destroy either textures or shapes, such as the style-transferred version of clean data, saturated images and patch-shuffled images, then evaluate the classification accuracy of AT-CNN and normal CNNs on these datasets. These sophisticated designed experiments provide a quantitative comparison between the two CNNs and demonstrate their biases when making predictions.
To the best of our knowledge, we are the first to implement systematic investigation on interpreting the adversarially trained CNNs, both visually and quantitatively. Our findings shed some light on why AT-CNNs are more robust than those normally trained ones and also contribute to better understanding adversarial training over CNNs from an interpretation perspective.111Our codes are available at https://github.com/PKUAI26/AT-CNN
The remaining of the paper is structured as follows. We introduce background knowledge on adversarial training and salience methods in Section 2. The methods for interpreting AT-CNNS are described in Section 3. Then we present the experimental results to support our findings in Section 4. The related works and discussions are presented in Section 5. Section 6 concludes the paper.
This training method was first proposed by (Goodfellow et al., 2014), which is the most successful approach for building robust models so far for defending adversarial examples (Madry et al., 2018; Sinha et al., 2018; Athalye et al., 2018; Zhang et al., 2019b, a). It can be formulated as solving a robust optimization problem (Shaham et al., 2015)
where represents the neural network parameterized by weights ; the input-output pair is sample from the training set ; denotes the adversarial perturbation and is the chosen loss function, e.g. cross entropy loss. denotes a certain norm constraints, such as or .
The inner maximization is approximated by adversarial examples generated by various attack methods. Training against a projected gradient descent (PGD, Madry et al. (2018)) adversary leads to state-of-the-art white-box robustness. We use PGD based adversarial training with bounded and norm constraints. We also investigate FGSM (Goodfellow et al., 2014) based adversarial training.
Given a trained neural network, visualizing the salience maps aims at assigning a sensitivity value, sometimes also called “attribution”, to show the sensitivity of the output to each pixel of an input image. Salience methods can mainly be divided into (Ancona et al., 2018) perturbation-based methods (Zeiler & Fergus, 2014; Zintgraf et al., 2017) and gradient-based method (Erhan et al., 2009; Simonyan et al., 2013; Shrikumar et al., 2017; Sundararajan et al., 2017; Selvaraju et al., 2017; Zhou et al., 2016; Smilkov et al., 2017; Bach et al., 2015). Recently (Adebayo et al., 2018) carries out a systematic test for many of the gradient-based salience methods, and only variants of Grad and GradCAM (Selvaraju et al., 2017) pass the proposed sanity checks. We thus choose Grad and its smoothed version SmoothGrad (Smilkov et al., 2017) for visualization.
Formally, let denote the input image, a trained network is a function , where is the total number of classes. Let
denotes the class activation function for each class. We seek to obtain a salience map . The Grad explanation is the gradient of class activation with respect to the input image ,
SmoothGrad (Smilkov et al., 2017) was proposed to alleviate noises in gradient explanation by averaging over the gradient of noisy copies of an input. Thus for an input , the smoothed variant of Grad, SmoothGrad can be written as
where , and. In all our experiments, we set , and the noise level , . We choose , where
is the probability of class
assigned by a classifier to input.
In this section, we elaborate our method for interpreting the adversarially trained CNNs and comparing them with normally trained ones. Three image datasets are considered, including Tiny ImageNet222https://tiny-imagenet.herokuapp.com/, Caltech-256 (Griffin et al., 2007) and CIFAR-10.
We first visualize the salience maps of AT-CNNs and normal CNNs to demonstrate that the two models trained with different ways are sensitive to different kinds of features. Besides this qualitative comparison, we also test the two kinds of CNNs on different transformed datasets to distinguish the difference of their preferred features.
|(a) Original||(b) Stylized||(c) Saturated 8||(d) Saturated 1024||(e) patch-shuffle 2||(f) patch-shuffle 4|
A straightforward way of investigating the difference between AT-CNNs and CNNs is to visualize which group of pixels the network outputs are most sensitive to. Salience maps generated by Grad and its smoothed variant SmoothGrad are good candidates to show what features a model is sensitive to. We compare the salience maps between AT-CNNs and CNNs on clean images, and images under texture preserving and shape preserving distortions. Extensive results can been seen in Section 4.1.
As pointed by Smilkov et al. (2017), sensitivity maps based on Grad method are often visually noisy, highlighting that some pixels, to a human eye, seem randomly selected. SmoothGrad in Eq. (3), on the other hand, could reduce visual noise by averaging the gradient over the Gaussian perturbed images. Thus, we mainly report the salience maps produced by SmoothGrad, and the Grad visualization results are provided in the appendix. Note that the two visualization methods could help us draw a consistent conclusion on the difference between the two trained CNNs.
Besides visual inspection of sensitivity maps, we propose to measure the sensitivity of AT-CNNs and CNNs to different features by evaluating the performance degradation under several distortions that either preserves shapes or textures. Intuitively, if one model relies on textures a lot, the performance would degrade severely if we destroy most of the textures while preserving other information, such as the shapes and other features. However, a perfect disentanglement of texture, shape and other feature information is impossible (Gatys et al., 2015). In this work, we mainly construct three kinds of image translations to achieve the shape or texture distortion, style-transfer, saturating and patch-shuffling operation. Some of the image samples are shown in Figure 1. We also added three Fourier-filtered test set in the appendix. We now describe each of these transformations and their properties.
Note that we conduct normal training or adversarial training on the original training sets, and then evaluate their generalizability over the transformed data. During the training, we never use the transformed datasets.
Stylizing. Geirhos et al. (2019) utilized style transfer (Huang & Belongie, 2017) to generate images with conflicting shape and texture information to demonstrate the texture bias of ImageNet-trained standard CNNs. Following the same rationale, we utilize style transfer to destroy most of the textures while preserving the global shape structures in images, and build a stylized test dataset. Therefore, with similar generalization error, models capturing shapes better should also perform better on stylized test images than those biased towards textures. The style-transferred image samples are shown in Figure 1(b).
Saturation. Similar to (Ding et al., 2019), we denote the saturation of the image by , where indicates the saturation level ranging from to . When , the saturation operation does not change the image. When
, increasing the saturation level will push the pixel values towards binarized ones, andleads to the pure binarization. Specifically, for each pixel of image with value , its corresponding saturated pixel of is defined as One can observe that, from Figure 1(c) and (d), increasing saturation level can gradually destroy some texture information while preserving most parts of the contour structures.
Patch-Shuffling. To destroy long-range shape information, we split images into small patches and randomly rearranging the order of these patches, with . Favorably, this operation preserves most of the texture information and destroys most of the shape information. The patch-shuffled image samples are showed in Figure 1(e), (f). Note that as increasing, more information of the original image is lost, especially for images with low resolution.
We describe the experiment setup to evaluate the performance of AT-CNNs and standard CNNs in data distributions manipulated by above-mentioned operations. We conduct experiments on three datasets. CIFAR-10, Tiny ImageNet and Caltech-256 (Griffin et al., 2007). Note that we do not create the style-transferred and patch-shuffled test set for CIFAR-10 due to its limited resolution.
; for data augmentation, we perform zero paddings with width as 4, horizontal flip and random crop.
Tiny ImageNet has 200 classes of objects. Each class has 500 training images, 50 validation images, and 50 test images. All images from Tiny ImageNet are of size . We re-scale them to and perform random horizontal flip and per-image standardization as data augmentation.
|(a) Images from Caltech-256||(b) Images from Tiny ImageNet|
Caltech-256 (Griffin et al., 2007) consists of 257 object categories containing a total of 30607 images. Resolution of images from Caltech is much higher compared with the above two datasets. We manually split of images as the test set. We perform re-scaling and random cropping following (He et al., 2016a). For both Tiny ImageNet and Caltech-256, we use ResNet-18 model as the network architecture.
For all above three datasets, we train three types of AT-CNNs, they mainly differ in the way of generating adversarial examples: FGSM, PGD with bounded norm and PGD with bounded norm, and for each attack method we train several models under different attack strengths. Details are listed in the appendix. To understand whether the difference of performance degradation for AT-CNNs and standard CNNs is due to the poor generalization (Schmidt et al., 2018; Tsipras et al., 2018) of adversarial training, we also compare the AT-CNNs with an underfitting CNN (trained over clean data) with similar generalization performance as AT-CNNs. We train 11 models on each dataset. Their generalization performance on clean data, and robustness measured by PGD attack are shown in Table 1.
To investigate what features of an input image AT-CNNs and normal CNNs are most sensitive to, we generate sensitivity maps using SmoothGrad (Smilkov et al., 2017) on clean images, saturated images, and stylized images. The visualization results are presented in Figure 2.
We can easily observe that the salience maps of AT-CNNs are much more sparse and mainly focus on contours of each object on all kinds of images, including the clean, saturated and stylized ones. Differently, sensitivity maps of standard CNNs are more noisy, and less biased towards the shapes of objects. This is consistent with the findings in (Geirhos et al., 2019).
Particularly, in the second row of Figure 2, sensitivity maps of normal CNNs of the “dog” class are still noisy even when the input saturated image are nearly binarized. On the other hand, after adversarial training, the models successfully capture the shape information of the object, providing a more interpretable prediction.
For stylized images shown in the third row of Figure 2, even with dramatically changed textures after style transfer, AT-CNNs can still be able to focus the shapes of original object, while standard CNNs totally fail.
Due to the limited space, we provide more visualization results (including the sensitivity maps generated by Grad method) in appendix.
In this part, we mainly show generalization performance of AT-CCNs and normal CNNs on either shape or texture preserving distorted image datasets. This could help us to understand how different that the two types of models are biased in a quantitative way.
For all experimental results below, besides the top-1 accuracy, we also report an “accuracy on correctly classified images”. This accuracy is measured by first selecting the images from the clean test set that is being correctly classified, then measuring the accuracy of transformed images from these correctly classified ones.
|dataset||Caltech-256||Stylized Caltech-256||TinyImageNet||Stylized TinyImageNet|
Following Geirhos et al. (2019), we generate stylized version of test set for Caltech-256 and Tiny ImageNet.
We report the “accuracy on correctly classified images” of all the trained models on stylized test set in Table 2. Compared with standard CNNs, though with a lower accuracy on original test images, AT-CNNs achieve higher accuracy on stylized ones with textures being dramatically changed. The comparison quantitatively shows that AT-CNNs tend to be more invariant with respect to local textures.
|(a) Caltech-256||(b) Tiny ImageNet|
|(a) Original Image||(b) Patch-Shuffle 2||(c) Patch-Shuffle 4||(d) Patch-Shuffle 8|
|(a) Caltech-256||(b) Tiny ImageNet|
We use the saturation operation to manipulate the images, and show the how increasing saturation levels affects the accuracy of models trained in different ways.
In Figure 4, we visualize images with varying saturation levels. It can be easily observed that increasing saturation levels pushes images more “binnarized”, where some textures are wiped out, but produces sharper edges and preserving shape information. When saturation level is smaller than , i.e. clean image, it pushes all the pixels towards and nearly all the information is lost, and leads to a totally gray image with constant pixel value.
We measure the “accuracy on correctly classified images” for all the trained models, and show them in Figure 5. We can observe that with the increasing level of saturation, more texture information is lost. Favorably, adversarially trained models exhibit a much less sensitivity to this texture loss, still obtaining a high classification accuracy. The results indicate that AT-CNNs are more robust to “saturation” or “binarizing” operations, which may demonstrate that the prediction capability of AT-CNNs relies less on texture and more on shapes. Results on CIFAR-10 tells the same story, as presented in appendix due to the limited space.
Additionally, in our experiments, for each adversarial training approach, either PGD or FGSM based, AT-CNNs with higher robustness towards PGD adversary are more invariant to the increasing of the saturation level and texture loss. On the other hand, adversarial training with higher robustness typically ruin the generalization over the clean dataset. Our finding also supports the claim “robustness maybe at odds with accuracy”(Tsipras et al., 2018).
When decreasing the saturation level, all models have similar degree of performance degradation, indicating that AT-CNNs are not robust to all kinds of image distortions. They tend to be more robust for fixed types of distortions. We leave the further investigation regarding this issue as future work.
Stylizing and saturation operation aim at changing or removing the texture information of original images, while preserving the features of shapes and edges. In order to test the different bias of AT-CNN and standard CNN in the other way around, we shatter the shape and edge information by splitting the images into patches and then randomly shuffling them. This operation could still maintains the local textures if is not too large.
Figure 6 shows one example of patch-shuffled images under different numbers of splitting. The first row shows the probabilities assigned by different models to the ground truth class of the original image. Obviously, after random shuffling, the shapes and edge features are destroyed dramatically, the prediction probability of the adverarially trained CNNs drops significantly, while the normal CNNs still maintains a high confidence over the ground truth class. This reveals AT-CNNs are more baised towards shapes and edges than normally trained ones.
Moreover, Figure 7 depicts the “ accuracy of correctly classified images” for all the models measured on “Patch-shuffled” test set with increasing number of splitting pieces. AT-CNNs, especially trained against with a stronger attack are more sensitive to “Patch-shuffling” operations in most of our experiments.
Note that under “Patch-shuffle 8” operation, all models have similar “ accuracy of correctly classified images”, which is largely due to the severe information loss. Also note that this accuracy of all models on Tiny ImageNet shown in 7(a) is mush lower than that on Caltech-256 in 7(b). That is, under “Patch-shuffle 1”, normally trained CNN has an accuracy of on Caltech-256, while only on Tiny ImageNet. This mainly origins from the limited resolution of Tiny ImageNet, since “Patch-Shuffle” operation on low-resolution images destroys more useful features than those with higher resolution.
Interpreting AT-CNNs. Recently there are some relevant findings indicating that AT-CNNs learn fundamentally different feature representations than standard classifiers. Tsipras et al. (2018) showed that sensitivity maps of AT-CNNs in the input space align well with human perception. Additionally, by visualizing large- adversarial examples against AT-CNNs, it can be observed that the adversarial examples could capture salient data characteristics of a different class, which appear semantically similar to the images of the different class. Dong et al. (2017)
leveraged adversarial training to produce a more interpretable representation by visualizing active neurons. Compared withTsipras et al. (2018) and Dong et al. (2017), we have conducted a more systematical investigation for interpreting AT-CNNs. We construct three types of image transformation that can largely change the textures while preserving shape information (i.e. stylizing and saturation), or shatter the shape/edge features while keeping the local textures (i.e. patch-shuffling). Evaluating the generalization of AT-CNNs over these designed datasets provides a quantitative way to verify and interpret their strong shape-bias compared with normal CNNs.
Insights for defensing adversarial examples. Based on our investigation over the AT-CNNs, we find that the robustness towards adversarial examples is correlated with the capability of capturing long-range features like shapes or contours. This naturally raises the question: whether any other models that can capture more global features or with more texture invariance could lead to more robustness to adversarial examples, even without adversarial training? This might provide us some insights on designing new network architecture or new strategies for enhancing the bias towards long-range features. Some recent works turn out partially answering this question. (Xie et al., 2018) enhanced standard CNNs with non-local blocks inspired from (Wang et al., 2018; Vaswani et al., 2017) which capture long-range dependencies in a data-dependent manner, and when combined with adversarial training, their networks achieved state-of-the-art adversarial robustness on ImageNet. (Luo et al., 2018) destroyed some of the local connection of standard CNNs by randomly select a set of neurons and remove them from the network before training, and thus forcing the CNNs to less focus on local texture features. With this design, they achieved improved black-box robustness.
Adversarial training with other types of attacks. In this work, we mainly interpret the AT-CNNs based on norm-constrained perturbation over the original images. It is worthy of noting that the difference between normally trained and adversarially trained CNNs may highly depends on the type of adversaries. Models trained against spatially-transformed adversary (Xiao et al., 2018), denoted as ST-ST-CNNs, have similar robustness towards PGD attack with standard models, and their salience maps are still quite different as shown in Figure 8. Also the average distance between salience maps is close to that of standard CNN, which is much higher than that of PGD-AT-CNN. There exists a variety of generalized types of attacks, parameterized by , such as spatially transformed (Xiao et al., 2018) and GAN-based adversarial examples (Song et al., 2018). We leave interpreting the AT-CNNs based on these generalized types of attacks as future work.
From both qualitative and quantitative perspectives, we have implemented a systematic study on interpreting the adversarially trained convolutional neural networks. Through constructing distorted test sets either preserving shapes or local textures, we compare the sensitivity maps of AT-CNNs and normal CNNs on the clean, stylized and saturated images, which visually demonstrates that AT-CNNs are more biased towards global structures, such as shapes and edges. More importantly, we evaluate the generalization performance of the two models on the three constructed datasets, stylized, saturated and patch-shuffled ones. The results clearly indicate that AT-CNNs are less sensitive to the texture distortion and focus more on shape information, while the normally trained CNNs the other way around.
Understanding what a model has learned is an essential topic in both machine learning and computer vision. The strategies we propose can also be extended to interpret other neural networks, such as models for object detection and semantic segmentation.
This work is supported by National Natural Science Foundation of China (No.61806009), Beijing Natural Science Foundation (No.4184090), Beijing Academy of Artificial Intelligence (BAAI) and Intelligent Manufacturing Action Plan of Industrial Solid Foundation Program (No.JCKY2018204C004). We also appreciate insightful discussions with Dinghuai Zhang and Dr. Lei Wu.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Ieee, 2009.
Towards deep learning models resistant to adversarial attacks.In International Conference on Learning Representations, 2018.
Automatic differentiation in pytorch.2017.
Learning deep features for discriminative localization.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929, 2016.
CIFAR-10. We train a standard ResNet-18 (He et al., 2016a) architecture, it has 4 groups of residual layers with filter sizes (64, 128, 256, 512) and 2 residual units.
We evaluate the robustness of all our models using a projected gradient descent adversary with , step size = 2 and number of iterations as 40.
We perform 9 types of adversarial training on each of the dataset. 7 of the 9 kinds of adversarial training are against a projected gradient descent (PGD) adversary(Madry et al., 2018), the other 2 are against FGSM adversary(Goodfellow et al., 2014).
We list value of for adversarial training of each dataset and -norm. In all settings, PGD runs 20 iterations.
-norm bounded adversary. For all of the three data set, pixel vaules range from 0 1, we train 4 adversarially trained CNNs with , these four models are denoted as PGD-inf:1, 2, 4, 8 respectively, and steps size as 1/255, 1/255, 2/255, 4/255.
-norm bounded adversary. For Caltech-256 & Tiny ImageNet, the input size for our model is , we train three adversarially trained CNNs with , and these four models are denoted as PGD-l2: 4, 8, 12 respectively. Step sizes for these three models are 2/255, 4/255, 6/255. For CIFAR-10, where images are of size , the three adversarially trained CNNs have , but they are denoted in the same way and have the same step size as that in Caltech-256 & Tiny ImageNet.
for these two adversarially trained CNNs are , and they are denoted as FGSM 4, 8 respectively.
Following (Geirhos et al., 2019) we construct stylized test set for Caltech-256 and Tiny ImageNet by applying the AdaIn style transfer(Huang & Belongie, 2017) with a stylization coefficient of to every test image with the style of a randomly selected painting from 333https://www.kaggle.com/c/painter-by-numbers/Kaggle’s Painter by numbers dataset. we used source code provided by(Geirhos et al., 2019).
(Jo & Bengio, 2017) showed deep neural networks tend to learn surface statistical regularities as opposed to high-level abstractions. Following them, we test the performance of different trained CNNs on the high-pass and low-pass filtered dataset to show their tendencies.
Following (Jo & Bengio, 2017) We construct three types of Fourier filtered version of test set.
The low frequency filtered version. We use a radial mask in the Fourier domain to set higher frequency modes to zero.(low-pass filtering)
The high frequency filtered version. We use a radial mask in the Fourier domain to preserve only the higher frequency modes.(high-pass filtering)
The random filtered version. We use a random mask in the Fourier domain to set each mode to 0 with probability uniformly. The random mask is generated on the fly during the test.
We measure generalization performance (accuracy on correctly classified images) of each model on these three filtered datasets from Caltech-256, results are listed in Table 3. AT-CNNs performs better on Low-pass filtered dataset and worse on High-pass filtered dataset. Results indicate that AT-CNNs make their predictions depend more on low-frequency information. This finding is consistent with our conclusions since local features such as textures are often considered as high-frequency information, and shapes and contours are more like low-frequency.
|Data set||The low frequency filtered version||The high frequency filtered version||The random filtered version|
We the detailed results for our quantitative experiments here. Table 5, 4, 6 show the results of each models on test set with different saturation levels. Table 8, 7 list all the results of each models on test set after different path-shuffling operations.