Interpreting Adversarial Examples by Activation Promotion and Suppression

04/03/2019
by   Kaidi Xu, et al.
ibm
0

It is widely known that convolutional neural networks (CNNs) are vulnerable to adversarial examples: crafted images with imperceptible perturbations. However, interpretability of these perturbations is less explored in the literature. This work aims to better understand the roles of adversarial perturbations and provide visual explanations from pixel, image and network perspectives. We show that adversaries make a promotion and suppression effect (PSE) on neurons' activation and can be primarily categorized into three types: 1)suppression-dominated perturbations that mainly reduce the classification score of the true label, 2)promotion-dominated perturbations that focus on boosting the confidence of the target label, and 3)balanced perturbations that play a dual role on suppression and promotion. Further, we provide the image-level interpretability of adversarial examples, which links PSE of pixel-level perturbations to class-specific discriminative image regions localized by class activation mapping. Lastly, we analyze the effect of adversarial examples through network dissection, which offers concept-level interpretability of hidden units. We show that there exists a tight connection between the sensitivity (against attacks) of internal response of units with their interpretability on semantic concepts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 8

page 13

page 14

page 15

page 16

02/23/2021

Adversarial Examples Detection beyond Image Space

Deep neural networks have been proved that they are vulnerable to advers...
04/21/2019

Beyond Explainability: Leveraging Interpretability for Improved Adversarial Learning

In this study, we propose the leveraging of interpretability for tasks b...
10/27/2019

Spot Evasion Attacks: Adversarial Examples for License Plate Recognition Systems with Convolution Neural Networks

Recent studies have shown convolution neural networks (CNNs) for image r...
05/06/2019

Deep Visual City Recognition Visualization

Understanding how cities visually differ from each others is interesting...
11/22/2018

Detecting Adversarial Perturbations Through Spatial Behavior in Activation Spaces

Neural network based classifiers are still prone to manipulation through...
06/12/2020

D-square-B: Deep Distribution Bound for Natural-looking Adversarial Attack

We propose a novel technique that can generate natural-looking adversari...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adversarial examples are crafted inputs with intention to fool machine learning models

[3, 4, 5, 6, 7, 8]

. Many existing works have shown that deep neural networks (e.g., CNNs) are vulnerable to adversarial examples, with human imperceptible pixel-level perturbations. The study on how to generate and defend adversarial examples has received a lot of recent attention. Different types of adversarial attacks have been proposed to mislead image classifiers with high success rate. However, understanding these attacks and further interpreting their adversarial effects so far are little explored. While many questions remain open with regard to how adversarial examples work, we effort to study some fundamental questions: a) How to interpret the mechanism of adversarial perturbations at pixel and image levels? b) Rather than attack generation, how to quantify the effectiveness of different adversarial attacks in a unified way? c) How to explore the effect of adversarial examples on internal response of neural networks? In this paper, we provide a comprehensive and deep understanding of adversarial examples from the pixel, image and network perspective.

CAM

CAM +

perturbed pixels

original image

adversarial image

Figure 1: Explanation of adversarial perturbations produced by the C&W attack [9]. The first column shows the original image (with true label ‘Japanese spaniel’) and its adversarial example (with target label ‘bullfrog’). The second column demonstrates the CAM of the original image for the true label and that of the adversarial example for the target label. In the third column, perturbation patterns are discovered by our approach, i.e. suppression-dominated (white, corresponding to the face of spaniel), promotion-dominated (black, corresponding to the face of bullfrog), and balance-dominated adversaries (gray) overlaid on CAMs.

1.1 Our contributions

First, we study the sensitivity and functionality of pixel-level perturbations on image classification. Unlike adversarial saliency maps (ASMs) [9], our proposed sensitivity measure, despite its simplicity, takes into account the dependency among pixels that contribute simultaneously to the classification confidence. We uncover the promotion and suppression effect (PSE) of adversarial examples. We group the adversaries into three types: a) suppression-dominated perturbations that mainly reduce the classification score of the true label, b) promotion-dominated perturbations that focus on boosting the confidence of the target label, and c) balance-dominated perturbations that play a dual role on suppression and promotion.

Second, we associate PSE of pixel-level perturbations with class activation map (CAM) [1] based image-level interpretability. We show that the adversarial pattern can be interpreted with the class-specific discriminative image regions. Figure 1 presents an example of the C&W adversarial attack [9], where suppression- and promotion-dominated perturbations are matched to the discriminative region of the natural image (with respect to the true label ‘Japanese spaniel’) and that of the adversarial example (with respect to the target label ‘bullfrog’), respectively. We also show that the CAM-based image-level interpretability provides means to evaluate the efficacy of attack generation methods. Although some works [10, 11, 12] attempted to connect adversarial examples with CAM, their analysis was preliminary and focused on the visualization of adversarial examples.

Third, we present the first attempt to analyze the effect of adversarial examples on the internal representations of CNNs using the network dissection technique [2]. We show a tight connection between the sensitivity of hidden units of CNNs with their interpretability on semantic concepts, which are also aligned with PSE. Furthermore, we analyze how the internal representations of CNNs evolve with respect to adversarial inputs under both natural and robustly trained models [13].

1.2 Related Works

Adversarial attack generation was extensively studied [3, 14, 9, 15, 12], where the effectiveness of adversarial attacks are commonly measured from attack success rate as well as -norm distortion between natural and adversarial examples. Some works [16, 17]

generated adversarial attacks by adding noise patches, which are different from norm-ball constrained attacks, lead to higher noise visibility. Rather than attack generation, the goal of this paper is to understand and explain the effect of imperceptible perturbations. Here we focus on norm-ball constrained adversarial attacks. Many defense methods have been developed against adversarial attacks. Examples include defensive distillation

[18], random mask [19]

, training with a Lipschitz regularized loss function

[20], and robust adversarial training using min-max optimization [13, 21], where the latter method is computationally intensive but is commonly regarded as the strongest defense mechanism.

Although the study on attack generation and defense has attracted an increasing amount of attention, interpretability of these examples is less explored in the literature. Some preliminary works [22, 23] were made on evaluating the impact of pixel-level adversarial perturbations on changing the classification results. In [22], Jacobian-based adversarial saliency map (ASM) was introduced to greedily perturb pixels that significantly contribute to the likelihood of target classification. However, ASM requires evaluating the CNN’s forward derivative which implicitly ignores the coupling effect of pixel-level perturbations, and it becomes less effective when an image has multiple color channels (e.g., RGB) given the fact that each color channel will be treated independently. As an extension of [22], the work [23] proposed an adversarial saliency prediction (ASP) method, which characterizes the divergence of the ASM distribution and the distribution of perturbations.

Both ASM [22] and ASP [23] have helped humans to understand how changes made to inputs will affect the outputs of neural networks, however, it remains difficult to visually explain the mechanism of adversarial examples given the fact that the pixel-level perturbations are small and imperceptible to humans. The work [10, 11] adopted CAM to visualize the change of attention regions of the natural and adversarial images, but the use of CAM is preliminary and its connection with interpretability of pixel-level perturbations is missing. The most relevant work to ours is [12], which proposed an interpretability score via ASM and CAM. However, it focuses on generating structure-driven adversarial attacks by promoting group sparsity of perturbations. In contrast, our CAM-based analysis applies to different adversarial attacks and associates the class-specific discriminative image regions with pixel-level perturbations. We also show that the CAM-based interpretability provides means to examine the effectiveness of perturbation patterns.

From the network perspective, the work [24] investigated the effect of an ensemble attack on neurons’ activation. Although the used ensemble-based attack generation algorithm enhances the attack transferbility, it fails to distinguish the effectiveness of various norm-ball constrained adversarial attacks. In [25], activation atlas was proposed to show feature visualizations of basis neurons as well as common combinations of neurons. And it was applied to visualizing the effect of adversarial patches (rather than norm-ball constrained adversarial perturbations). Different from [24, 25], we adopt the technique of network dissection [2, 26] to peer into the effect of adversarial examples on the concept-level interpretability of hidden units. Briefly, our work provides a deep understanding on the mechanism of adversarial attacks at the pixel, image, and network level.

2 Preliminaries: Attack, Dataset, and Model

Let denote the natural image, and be adversarial perturbations

to be designed. Here unless specified otherwise the vector representation of an image is used. The

adversarial example is then given by . By setting the input of CNNs as and , the classifier will predict the true label and the target label (), respectively. To find minimal adversarial perturbations that are sufficient to predict the target label , a so-called norm-ball constrained attack technique is commonly used; Examples considered in this paper include FGSM [3], C&W [27], EAD [15], and Str attacks [12]. We refer readers to Appendix 1 for more details on attack generation.

Our work attempts to interpret adversarial examples from the pixel (Sec. 3), image (Sec. 4) and network (Sec. 5

) perspective. At pixel and image levels, we generate adversarial examples from ImageNet under network models Resnet_v2_101

[28] and Inception_v3 [29] via the aforementioned attack generation methods. At the network level, we generate adversarial examples from the Broadly and Densely Labeled Dataset (Broden) [2], which contains examples with pixel-level concept annotations related to multiple concept categories including color, material, texture, part, scene and object. The considered network model is ResNet_152 [28]. When the adversarial training [13] is considered, we then focus on CIFAR-10 [30] using ResNet modified by [13] for untargeted attacks.

3 Effects of Pixel-level Perturbations

We are interested in quantifying how much impact a perturbation produces on the classification results with respect to (w.r.t.) the correct and target labels.

We use the change of logit scores to measure the effect of a perturbation on a class label. Different from the C&W attack loss

[27] that considers all pixels’ perturbations , we focus on grids of these perturbations, where a ‘grid’ corresponds to a group of pixels, namely, a local region of an image. We then build a grid-level sensitivity measure that as will be evident later, can be extended to perform image-level sensitivity analysis.

Recall that gives the natural image, and corresponds to the adversarial example. We divide an image into grid points with coordinates , where each contains a group of pixels corresponding to a local region of image, and . Here denotes the overall set of pixels , and in experiments we set the grid size to for ImageNet. We call grids

, which can be obtained by applying a sliding mask with a given stride to the image

[12]. We then introduce to characterize the perturbation at grid , where if , and otherwise. Here or denotes the th element of the vector . In Definition 1, we measure the effect of through its induced logit change with respect to the true label and the target label , respectively.

Definition 1

(Sensitivity measure of perturbations): The impact of the grid-level perturbation on image classification is measured from two aspects: a) the logit change with respect to the true label , and b) the logit change with respect to the target label . That is,

(1)
(2)
(3)

for , where gives the logit score with respect to class , and is a small positive number.

In Definition 1, measures how much the logit score (with respect to ) will change if the perturbation at is eliminated. Clearly, a large implies a more significant role of on suppressing the classification result away from . By contrast, measures the effect of on promoting the targeted classification result. The overall adversarial significance of is the combined effect of and through their norm. Thus, grids with small values of play a less significant role on misleading image classifiers. In (1)–(2), we set to get rid of the negative values of and , namely, the insignificant case. Despite the apparent simplicity of Definition 1, it takes into account the coupling effect of grid-level perturbations on the logit change, which is characterized by (against ) rather than (against ) used in [22].

As an application of and , we can define a promotion-suppression ratio (PSR)

(4)

which describes the mechanism of on misclassification. In (4), the logarithm is taken for ease of studying PSR under different regimes, e.g, implies that . Here we categorize the effect of into three types. If , then we call the suppression-dominated perturbation, which is mainly used to reduce the classification logit of the true label. If , then we call the promotion-dominated perturbation, which is mainly used to boost the classification logit of the target label. If , then we call the balance-dominated perturbation that plays a dual role on suppression and promotion. Although different threshold values on can be used, we choose for ease of analysis and visualization.

          adv. example     distortion heat map
   

grid level

distortion

sensitivity

measure

sorted

PSR

                           grid index
Figure 2: Sensitivity measure of pixel-level perturbations (generated by C&W attack) on the ‘badger’-to-‘computer’ example. Here the true label is ’bodger’ and the target label is ’computer’. The first row shows the adversarial example and the heat map of -norm distortion at each grid, i.e., . The second row presents as well as versus grid index. The third row demonstrates the sensitivity measure and . The fourth row shows the sorted promotion-suppression ratio (PSR) , and the dash lines correspond to and .

In Figure 2, we show an adversarial example together with its adversarial perturbations generated by the C&W attack [9]. Through this example, we demonstrate more insights on the sensitivity measure (1)-(4). As we can see, either or has a strong correlation with the strength of adversarial perturbations at each grid (in terms of the and norms of ). As suggested by PSR, in this example most of contribute to promoting the output classification towards the target class .

For a more comprehensive quantitative analysis, we consider adversarial examples by attacking images randomly selected from ImageNet. Table  1 shows the correlation between the proposed sensitivity scores and the strength of adversarial perturbations. We observe that except IFGSM, other studied attacks maintain a good correlation between perturbation strength and adversarial sensitivity. This explains that IFGSM is far from the optimal attack that should perturb the most sensitive pixels to logit change. We also observe that Str-attack has the highest correlation, which verifies its efficacy by exploiting the group-structure of images [12]. In the next section, we will visually explain the universal rule behind these attacks by leveraging class activation mapping.

width=0.35 attack model IFGSM Resnet 0.303 0.480 0.522 Inception 0.170 0.220 0.248 C&W Resnet 0.507 0.538 0.609 Inception 0.545 0.517 0.620 EAD Resnet 0.602 0.630 0.710 Inception 0.639 0.655 0.697 Str Resnet 0.625 0.720 0.783 Inception 0.643 0.614 0.702

  • means . The same notation rule holds for the last two columns.

Table 1: Correlation between sensitivity measure (, or ) and -norm distortion ().

4 Interpreting Adversarial Perturbations via Class Activation Map (CAM)

CAM [1] and other similar techniques such as GradCAM [10] and GradCAM++ [31] can build a localizable deep representation, which exposes the implicit attention of CNNs on a labelled image [1]. It is worth mentioning that both CAM and GradCAM generate consistent class-specific discriminative image regions in our work since the considered network architectures ResNet_v2_101 and Inception_v3 perform global average pooling over convolutional maps prior to prediction [10, Appendix A]. Although GradCAM++ handles images with multiple object instances better than CAM, our experiments show that its contribution to interpreting adversarial attacks is quite similar to that of using CAM. Thus, our analysis will be restricted to CAM but can readily be extended to GradCAM++.

Let denote a CAM for image with class label . The strength of a spatial element in characterizes the importance of the activation at this spatial location that significantly contributes to the predicted label . From the perspective of adversarial examples, one may wonder the relationship between adversarial examples and discriminative regions localized by CAM. Given natural and adversarial examples as well as the correct and the target labels, all CAMs of interest are given by , , and ; see Figure 3 for an example. More results can be found in Figure A1. Compared with , the most discriminative region w.r.t. is suppressed as the adversarial perturbations are added to . By contrast, the difference between and implies that the discriminative region of under is enhanced after injecting .

: catamaran : container ship
Figure 3: Visualizing CAMs of natural image and its adversarial example (generated by C&W attack) w.r.t. the true label ‘catamaran’ and the target label ‘container ship’, respectively. The heat map color from blue to red represents the least and the most discriminative region localized by CAM, respectively. Here the values of CAMs are normalized w.r.t. the same baseline (their maximum value) so that they are comparable.

What can CAM offer to interpret adversarial examples? As suggested by Figure 3, the effect of adversarial perturbations can be visually explained through the class-specific discriminative image regions localized by CAM. The two most informative CAMs are and since the other two and are usually given by small activation scores, namely, non-dominant discriminative image regions compared to those w.r.t. and . We highlight that characterizes the discriminative regions that adversarial perturbations would suppress. And reveals the image regions in which the adversary performs to enhance the likelihood of the target class. As will be evident later, the aforementioned suppression/promotion analysis is consistent with the effects of pixel-level perturbations characterized by PSR. However, CAM offers a visual explanation at image level.

Spurred by [12], we quantify the interpretability of adversarial perturbations through the CAM-based interpretability score (IS). Given a vector representation of CAM , let denote the Boolean map that encodes the most discriminative region localized by CAM,

(5)

where is a given threshold to highlight the most class-specific discriminative region, and is the th element of . The IS of adversarial perturbations w.r.t. is defined by

(6)

where is the element-wise product. For in (5), we set it as

-quantile of CAM. The threshold is appropriately chosen (which cannot be too large or too small) for highlighting the class-specific image regions. Our results are insensitive to the threshold within

shift.

In (6), if the discriminative region perfectly predicts the locations of adversarial perturbations. By contrast, if , then adversarial perturbations cannot be interpreted by CAM. We highlight that although IS was introduced in [12], only preliminary comparison is made between C&W and Str-attack. Beyond [12], Table 2 shows IS for types of adversarial attacks on two network models. As we can see, the Str-attack yields the best interpretability in terms of CAMs and . By contrast, the interpretability of IFGSM is the worst. This is not surprising, since the Str-attack is able to extract important local structures of images by penalizing the group sparsity of adversarial perturbations [12]. In Figure A2, we provide a concrete ‘plow’-to-‘bathtub’ adversarial example to demonstrate the mechanism of adversaries via CAM under different attack methods. We note that there exist common and essential perturbed pixels, which contribute significantly to suppressing the discriminative regions of the true label. Meanwhile, different attacks focus on different effective regions to boost the belief of the target label.

width=0.4 attack model IS w.r.t. IS w.r.t. IFGSM Resnet 0.594 0.604 Incep. 0.634 0.655 C&W Resnet 0.610 0.823 Incep. 0.654 0.784 EAD Resnet 0.625 0.880 Incep. 0.630 0.865 Str Resnet 0.626 0.941 Incep. 0.767 0.983

Table 2: IS under different attacks & network models, averaged over natural/adversarial examples on ImageNet.
adv. examples
& PSRs
& PSRs

suppression-

dominated effect

: rhinoceros beetle – : ambulance

promotion-

dominated effect

: howler monkey – : paper towel

balance-

dominated effect

: catamaran – : container ship
Figure 4: Interpreting adversarial perturbations via CAM and PSR. Image examples are from Figure  3 and A1. For PSR, only the top most significant perturbed grids ranked by (3) are shown. The white and black colors represent the suppression-dominated regions () and the promotion-dominated regions (), respectively. The gray color corresponds to balance-dominated perturbations (). And the red box represents the dominated adversarial effect.

Towards finer analysis we combine PSR in (4) with CAM to explain the effect of grid-level perturbations at the image level. We recall that PSR categorizes into three types: suppression-dominated perturbations for , promotion-dominated perturbations for , and balanced perturbations for . In Figure 4, we revisit the examples in Figure 3 and A1 but overlaid with at perturbed grids. As we can see, the studied three examples demonstrate the suppressed, promoted and balanced role of , respectively. And the locations of adversarial perturbations are well matched to the discriminative regions of and/or . In particular, if there exists a large overlapping between and , then the balanced perturbations are desired since perturbing a single pixel can play a dual role on suppression and promotion; see ‘catamaran’-to-‘container ship’ in Figure 4.

In order to gain more insights on mining the root cause of adversarial perturbations, we investigate how the adversary makes an impact on attacking a single image with multiple target labels (Figure 5) as well as attacking multiple images with the same source and target label (Figure A3). As we can see, the promotion-suppression effect enjoys a similar pattern among multiple adversarial examples. First, while attacking a single image with multiple target labels, the suppression-dominated perturbation keeps consistent among multiple adversarial examples in Figure 5. However, the promotion-dominated perturbation is adaptive to the change of the target label. Second, the same source-target label pair enforces a similar effect of adversarial perturbations on attacking different images. For example, only the face of ‘eagle’ is perturbed among three adversarial examples with the target label ‘hen’ in Figure A3. More adversarial examples of images with complex backgrounds are provided in Figure A4-A5.

adv. & PSR adv. & PSR

Figure 5: Interpreting adversarial examples of the original image ‘Japanese spaniel’ in Figure 1 w.r.t. different target labels ‘acoustic guitar’ and ‘desktop computer’ using CAM and PSR.

We have previously shown that CAM can be used to localize class-specific discriminative image regions. In what follows, we analyze how the resulting adversarial patterns constraints the effectiveness of adversarial attacks. To reveal the significance of the perturbation patterns provided by an adversarial attack, we perform two types of operations on refining these perturbations: (a) removing less significant perturbations quantified by in (3), and (b) enforcing perturbations in the most discriminative regions with respect to the true label, namely, in (5). We represent the refinement operations (a) and (b) through the constraint sets of pixels for a positive threshold and , where is set to filter perturbations with less than cumulative strength. The refined adversarial examples can then be generated by performing the existing attack methods with an additional projection on the sparse constraints given by and . We refer readers to Appendix 3 for more details.

As shown in Table A1 and Figure A6, it is possible to obtain a more effective attack by perturbing less but ‘right’ pixels (i.e., better correspondence with discriminative image regions) under . For attacks with refinement under , Figure 6 shows that perturbing pixels under only a suppression-dominated adversarial pattern is not optimal. As we can see, the perturbation originally contributes to boosting the likelihood of the target label at locations outside the most discriminative region w.r.t. the true label (different from ). If we restrict perturbations under , then the refined attack leads to a much larger distortion. Thus, the seemingly random perturbations are actually effective, used to promote the confidence of the target label.

   
original

CAM + PSRs w.r.t.

adv. image & target label

perturbations

(Str-attack)

CAM w.r.t.

ori. image & true label

CAM + PSRs w.r.t.

refined attack & true label

perturbations

(refined Str-attack)

Figure 6: The ‘flatworm’-to-‘knot’ adversarial example (generated by str-Attack) with and without refinement under . The first row presents the original image, PSRs overlaid on CAM of the adversarial example w.r.t. the target label ‘knot’, and the -norm distortion of adversarial perturbations. The second row presents given by CAM of the original image w.r.t. the true label ‘flatworm’, and the refined attack under . Note that this refinement leads to much larger distortion (max. value ) against the unrefined attack (max. value ); see the third column.

5 Seeing Effects of Adversarial Perturbations from Network Dissection

We examine promotion and suppression effect of adversarial perturbations on the internal response of CNNs by leveraging network dissection [2]. We show that there exists a connection between the sensitivity of units111A unit refers to a channel-wise feature map. (against attacks) and their concept-level interpretability.

We begin by reviewing the main idea of network dissection; see more details in [2]. Interpretability measured by network dissection refers to the alignment between individual hidden units and a set of semantic concepts provided by the broadly and densely labeled dataset Broden. Different from other datasets, examples in Broden contain pixel-level concept annotation, ranging from low-level concepts such as color and texture to higher-level concepts such as material, part, object and scene. Network dissection builds a correspondence between a hidden unit’s activation and its interpretability on semantic concepts. More formally, the interpretability of unit (IoU) w.r.t. the concept is defined by [2]

(7)

where denotes Broden, and is the cardinality of a set. In (7), is a binary segmentation of the activation map of unit , which gives the representative region of at

. Here the activation is scaled up to the input resolution using bilinear interpolation, denoted by

, and then truncated using the top quantile (dataset-level) threshold . That is, . In (7), is the input-resolution annotation mask, provided by Broden, for the concept w.r.t. . Since one unit might be able to detect multiple concepts, the interpretability of a unit is summarized as , where denotes the total number of concept labels.

We next investigate the effect of adversarial perturbations on the internal response of CNNs by leveraging network dissection. We produce adversarial examples from Broden using projected gradient descent (PGD) untargeted attacking method [13]. Given adversarial examples corresponding to , we characterize the sensitivity of unit (to adversarial perturbations) via the change of activation segmentation

(8)

where is a pair of natural and adversarial examples, and the expectation is taken over a certain distribution of our interest, e.g., the entire dataset or data of fixed source-target labels. In (8), we adopt the activation segmentation rather than the activation map since the former highlights the representative region of an activation map without inducing the layer-wise magnitude bias.

Given the per-unit sensitivity measure and interpretability measure , we may ask whether or not the sensitive units (to adversarial perturbations) exhibit strong interpretability. To answer this question, we conduct tests of statistical significance (in terms of -value) by contrasting the IoU of the top ranked sensitive units with the IoU distribution of randomly selected units. Formally, the

-value is the probability of observing

when is from top sensitive units ranked by in the background IoU distribution of when units are randomly picked. The smaller the -value is, the more significant the connection between sensitivity and interpretability is.

We present the significance test of the interpretability of top sensitive units against the layer index of ResNet_ (Figure 7-a). We also show the number of concept detectors222A concept detector refers to a unit with the top ranked concept satisfying [2]. among top sensitive units versus layers for every concept category (Figure 7-b). Here we denote by conv the last convolutional layer of th building block at the th layer in ResNet_ [28]. It is seen from Figure 7-a that there exists a strong connection between the sensitivity of units and their interpretability since in most of cases. By fixing the layer number, such a connection becomes more significant as increases. This suggests that even if the most interpretable units are not precisely the most sensitive units, they still maintain high sensitivity with top ranking. By fixing , we observe that deep layers (conv4_36 and conv5_3) exhibit stronger connection between sensitivity and interpretability compared to shallow layers (conv2_3 and conv3_8). That is because the change of activation induced by adversarial attacks at shallow layers could be subtle and are less detectable in terms of interpretability. Indeed, Figure 7-b shows that more high-level concept detectors (e.g., object and part) emerge in conv4_36 and conv5_3 while low-level concepts (e.g., color and texture) dominate at lower layers.

(a) (b)

Figure 7: Sensitivity and interpretability. (a) -value of interpretability of top sensitive units to adversarial attacks in ResNet_, where the presented layers include conv2_3 (256 units), conv3_8 (512 units), conv4_36 (1024 units) and conv5_3 (2048 units). (b) Number of concept detectors among top sensitive units per layer for each concept category.

max width=1 conv2_3 conv3_8 conv4_36 conv5_3 Ori: table lamp () Adv: studio couch () unit193, orange-color unit358, flecked-texture unit457, shade-part unit1716, lamp-object unit123, sofa-object Ori: airliner () Adv: seashore () unit84, blue-color unit445, banded-texture unit2, stern-part unit781, airplane-object unit782, beach-scene

Figure 8: Visualizing impact of original (Ori) & adversarial (Adv) examples on the response of concept detectors identified by network dissection at representative layers in ResNet. (top) attack ‘table lamp’-to-‘studio couch, day bed’, (bottom) attack ‘airliner’-to-‘seashore, coast, seacoast, sea-coast’. In both top and bottom sub-figures, the first row presents unit indices together with detected top ranked concept labels and categories (in the format ‘concept label’-‘concept category’). The last two rows present the response of concept detectors visualized by the segmented input image, where the segmentation is given by corresponding to the top ranked concept at each unit. Two images and their adversarial examples are presented in a given unit.

To peer into the impact of adversarial perturbations on individual images, we examine how the representation of concept detectors change while facing adversarial examples by attacking images from the same true class to the same target class . Here the representation of a concept detector is visualized by the segmented input image, where determines the segmentation corresponding to the top ranked concept. In Figure 8, we show two examples of attacks: ‘table lamp’-to-‘studio couch, day bed’ and ‘airliner’-to-‘seashore, coast, seacoast, sea-coast’. We first note that most of low-level concepts (e.g., color and texture) are detected at shallow layers, consistent with Figure 7-b. In the attack ‘table lamp’-to-‘studio couch, day bed’, the color ‘orange’ detected at conv2_3 is less expressed for the adversarial image against the natural image. This aligns with human perception since ‘orange’ is related to ‘light’ and thus ‘table lamp’. By contrast, in the attack ‘airliner’-to-‘seashore, coast, seacoast, sea-coast’ the color ‘blue’ is well detected at both natural and adversarial images, since ‘blue’ is associated with both ‘sky’ for ‘airliner’ and ‘sea’ for ‘seashore’. We also note that high-level concepts (e.g., part and object) dominate at deeper layers. At conv5_3, the expression of object concepts (e.g., lamp and airplane) relevant to the true label is suppressed. Meanwhile, the expression of object concepts (e.g., sofa and beach) relevant to the target label is promoted. This precisely reflects the activation promotion and suppression effect induced by adversarial perturbations. In Figure A7, we connect images in Figure 8 to PSR and CAM based visual explanation.

Figure 9: Visualization of neurons’ activations for selected layers of natural and robust models against the natural image with the true label ‘deer’ and the corresponding untargeted adversarial example, respectively.

Lastly, we examine the internal representations of robustly trained CNNs [13] against adversarial examples. Since the robust adversarial training (via robust optimization) is not scalable to ImageNet, we focus on the CIFAR-10 dataset in the absence of network dissection. Figure 9 shows the activation map of natural and robustly trained ResNet provided by [13] against natural and adversarial examples. As we can see, the robust training introduces a model-based correction so that the internal response tends to remain finer local features of the input image at the first layers; see more examples in Figure A9. Moreover, the internal response of the network exhibits a sharp transition towards misclassification at deep layers; see more results in Figure A8. This is also explainable from concept-level interpretability: deeper layers involve detectors of higher-level concepts that play a crucial role on the final classification.

6 Conclusions

In this work, we made a significant effort to understand the mechanism of adversarial attacks and provided its explanation at pixel, image and network levels. We showed that adversarial attacks play a significant role on activation promotion and suppression. The promotion and suppression effect is strongly associated with class-specific discriminative image regions. We also demonstrated that the interpretable adversarial pattern constrains the effectiveness of adversarial attacks. We further provided the first analysis of adversarial examples through network dissection, which builds the connection between the units’ sensitivity to imperceptible perturbations and their interpretability on semantic concepts. Future work would apply our analysis to designing effective defense methods, e.g., speed up the adversarial training under interpretability priors.

References

Appendix

1 Attack Generation

IFGSM attack [3, 14] crafts adversarial examples by performing iterative fast gradient sign method (IFGSM), followed by an -ball clipping. IFGSM attacks are designed to be fast, rather than optimal.

C&W [27], EAD [15], and Str- attacks [12] can be unified in the following optimization framework,

(9)

where denotes a loss function for targeted misclassification, is a regularization function that penalizes the norm of adversarial perturbations, is a regularization parameter, and places optionally hard constraints on . All C&W, EAD and Str- attacks enjoy a similar loss function

(10)

where is the -th element of logits

, namely, the output before the last softmax layer in CNNs, and

is a confidence parameter. Clearly, as increases, the minimization of would reach the target label with high confidence. In this paper, we set by default. It is worth mentioning that problem (9) can be efficiently solved via alternating direction method of multipliers (ADMM) [32, 12], regardless of whether or not is differentiable.

C&W attack [27] adopts the norm to penalize the strength of adversarial perturbations , namely, and in (9), where . In practice, the squared norm is commonly used.

EAD attack [15] specifies the regularization term as an elastic-net regularizer in (9), and . It has empirically shown that the use of elastic-net regularizer improves the transferability of adversarial examples.

Str-attack [12] takes into account the group-level sparsity of adversarial perturbations by choosing as the group Lasso penalty [33]. In the mean time, it constrains the pixel-level perturbation by setting for a tolerance .

2 Adversarial Examples Meet CAM

In Figure A1, we demonstrate more examples class-specific discriminative regions visualized by CAM, namely, , , , , . In Figure A2, we fix the orginal image together with its true and target labels to visualize the difference of attack methods through CAM. In Figure A3, we present the adversarial attack of multiple images with a fixed source-target label pair. As we can see, the balance-dominated perturbation pattern appears at the discriminative region of ‘eagle’. In Figure A4, we present the ‘hamster’-to-‘cup’ example, where objects of the original label and the target label exist simultaneously. We observe that the adversary shows suppression on the discriminative region of the original label and promotion on the discriminative region of the target label. Compared to the C&W attack, Str-attack is more effective in both suppressing and promotion since it just perturbs a few grids. In Figure A5, images involve more heterogeneous and complex background. As we can see, an effective adversarial attack (e.g., Str-attack) perturbs less but more meaningful pixels, which have a better correspondence with the discriminative image regions of the original and target classes. In Figure A6, we present a ‘hippocampus’-to-‘streetcar’ example with refined attacks under . As we can see, it is possible to obtain a more effective attack by perturbing less but ‘right’ pixels (i.e., with better correspondence with discriminative image regions).

3 Effectiveness of Refined Adversarial Pattern

We consider the following unified optimization problem to refine adversarial attacks

(11)

where we represent the refinement operations (a) and (b) through the constraint sets for a positive threshold 333We sort to in an ascending order, and set for the smallest with . We filter less significant perturbations under their cumulative power. and . In , defined by (3) characterizes the strength of the adversarial pattern. In , defined by (5) localizes pixels corresponding to the most discriminative region associated with the true label. Problem (11) can similarly be solved as (9), with an additional projection on the sparse constraints given by and .

We present the effectiveness of attacks with refinement under in Table A1. Here the effectiveness of an attack is characterized by its attack success rate (ASR) as well as -norm distortions. We find that many pixel-level adversarial perturbations are redundant, in terms of the reduction in the norm444: of nonzero elements in . of , which can be removed without losing effectiveness in the attack success rate and -norm distortions for .

4 Internal Response of CNNs against Adversarial Examples

In Figure A7, we connect images in Figure 8 to PSR and CAM based visual explanation. For example, the suppressed image region identified by PSR (white color) corresponds to the interpretable activation of object concept airplane in Figure 8. And the promoted image region identified by PSR (black color) corresponds to the interpretable activation of scene concept beach. In Figure A8, we present the activation bias, defined by the Euclidean distance between neurons’ activations w.r.t. the original and the adversarial inputs, under two natural models Resnet_101 and Inception_v3. We note that the behavior of the network exhibits a sharp transition towards misclassification only at deep layers. In Figure A9, we visualize the internal presentations of natural and robust Resnet model modefied by [13] against original and adversarial inputs, respectively.

max width=0.8 attack model ASR original () refine () original refine original refine original refine refine IFGSM Resnet 266031 61055 1122.56 176.08 2.625 1.87 0.017 0.035 96.7% Incep. 266026 59881 812.94 155.89 1.926 1.22 0.019 0.033 100% C&W Resnet 268117 21103 183.65 134.26 0.697 0.727 0.028 0.029 100% Incep. 268123 22495 144.94 96.75 0.650 0.673 0.028 0.034 100% EAD Resnet 66584 20147 42.57 63.28 1.520 1.233 0.234 0.096 100% Incep. 69677 18855 30.17 45.88 1.289 1.107 0.229 0.083 100% Str Resnet 30823 18744 119.76 110.54 1.250 1.132 0.105 0.087 100% Incep. 27873 15967 86.55 82.33 1.174 0.985 0.103 0.072 100%

Table A1: Attack performance of adversarial perturbations with and without refinement under over images.
Figure A1: CAMs of two natural/adversarial examples (in rows), generated by C&W attack, where , , , , are shown from the left to the right at each row.
Figure A2: Four adversarial examples with CAM visualization generated by C&W, EAD, Str, and IFGSM attacks, respectively. Left to right: , , , and overlaid PSR on at locations of the top most significant perturbed grids ranked by . Here CAMs at each row are normalized with respect to their maximum value.

Figure A3: Multiple images with a fixed source-target label pair: CAM with respect to the source label ‘bald eagle’ (1 column), original image ‘bald eagle’ (2 column), CAM with respect to target label ‘hen’ together with C&W perturbation patterns (3 column), which is measured by promotion-suppression ratio (PSR), i.e. suppression- (white), promotion- (black), and balance-dominated adversaries (gray).

original

C&W attack

Str-attack

PSR over perturbed grids

CAM w.r.t.

CAM w.r.t.

Figure A4: Visual explanation of the ‘hamster’-to-‘cup’ example crafted by C&W and Str-attack, where the true label is ‘hamster’, and the target label is ‘cup’. The first row is the natural image and PSRs over perturbed grids. The second (third) row is CAM with respect to the column-wise natural/adversarial example and the row-wise label.

Figure A5: Attacking images with complex background under C&W, EAD, Str-, and IFGSM attacks.
 original
  C&W
  EAD
 Str
  IFGSM

w/o refine

with refine

Figure A6: The ‘hippopotamus’-to-‘streetcar’ adversarial example with and without refinement under . Here the left-bottom subplot shows CAM of the original image w.r.t. the true label ‘hippopotamus’, and the right subplots present PSRs of unrefined and refined grid-level perturbations overlaid on CAMs of adversarial examples w.r.t. the target label ‘streetcar’.
adv. examples
& PSRs
: table lamp – : studio coach
: airliner – : seashore
Figure A7: Interpreting adversarial perturbations via CAM and PSR. Image examples are from Figure  8. For PSR, only the top most significant perturbed grids ranked by (3) are shown. The white and black colors represent the suppression-dominated regions () and the promotion-dominated regions (), respectively. The gray color corresponds to balance-dominated perturbations ().

Figure A8: The activation bias from each layers between original inputs and adversarial inputs by two models ResNet_101 and Inception_V3 respectively. The activation bias achieved by 4 kinds of attack methods: IFGSM, C&W, EAD and Str-attack over images in ImageNet dataset.

(a) (b) (c) (d)

Figure A9: Visualization of neurons’ activation for the first layers w.r.t. the original and adversarial images under natural and robust models. a)-d) represent 4 examples of response of different models against different inputs. In each example, the first two rows correspond to response of nature model w.r.t. the original and adversarial image, and the last two rows correspond to response of robust model w.r.t. the original and adversarial image.