Interpreting Adversarial Examples with Attributes

by   Sadaf Gulshad, et al.

Deep computer vision systems being vulnerable to imperceptible and carefully crafted noise have raised questions regarding the robustness of their decisions. We take a step back and approach this problem from an orthogonal direction. We propose to enable black-box neural networks to justify their reasoning both for clean and for adversarial examples by leveraging attributes, i.e. visually discriminative properties of objects. We rank attributes based on their class relevance, i.e. how the classification decision changes when the input is visually slightly perturbed, as well as image relevance, i.e. how well the attributes can be localized on both clean and perturbed images. We present comprehensive experiments for attribute prediction, adversarial example generation, adversarially robust learning, and their qualitative and quantitative analysis using predicted attributes on three benchmark datasets.


page 3

page 4

page 7

page 8


Understanding Misclassifications by Attributes

In this paper, we aim to understand and explain the decisions of deep ne...

Class-Aware Domain Adaptation for Improving Adversarial Robustness

Recent works have demonstrated convolutional neural networks are vulnera...

Noise Sensitivity-Based Energy Efficient and Robust Adversary Detection in Neural Networks

Neural networks have achieved remarkable performance in computer vision,...

Facial Attributes: Accuracy and Adversarial Robustness

Facial attributes, emerging soft biometrics, must be automatically and r...

A Hamiltonian Monte Carlo Method for Probabilistic Adversarial Attack and Learning

Although deep convolutional neural networks (CNNs) have demonstrated rem...

The red one!: On learning to refer to things based on their discriminative properties

As a first step towards agents learning to communicate about their visua...

Brain Programming is Immune to Adversarial Attacks: Towards Accurate and Robust Image Classification using Symbolic Learning

In recent years, the security concerns about the vulnerability of Deep C...

1 Introduction

Deep neural networks, despite their good performance in classification [21, 33, 17, 37], can be easily fooled by adversarial examples, i.e. added imperceptible noise not visible to humans [41, 7, 5, 39]

. Understanding why this happens is of major curiosity. Previous research has provided insights about deep learning frameworks

[39], the geometry of their class boundaries  [41, 30, 15] and the geometry of data manifold [14] which have led to a number of detection and defense methods  [26, 27, 44]. However, none of them have yet succeeded completely to rectify, detect, or create a defense against adversarial examples. In this work, we take a step back and propose to understand why a deep network is fooled by adversarial examples.

Figure 1: Our interpretable attribute prediction-grounding framework provides visual evidence for a clean image that gets embedded close to the correct blue class (painted bunting) because of “red belly” and “blue head” attributes, and for an adversarial image that gets embedded close to the incorrect red class (herring gull) as the network thought there were “white belly” and “white head” attributes.

Interpreting deep neural network decisions helps in understanding their internal functioning and could be used for detecting and creating defenses against adversarial attacks [43, 10]. This indirectly provides a way to revisit the decision maker in its failure mode [45]. Previously instance level visual interpretations, e.g. either adding perturbations to the input or by taking the gradient of output with respect to its input [35, 12], have been used to introspect deep neural networks. However, [49, 1] showed that they do not accurately capture the attacks to input generation process and models (see a visual example in Figure 2).

In this paper, we propose an alternative visual interpretation technique using visually discriminative properties of the objects, i.e. attributes, that are predicted and grounded on clean and adversarial examples. To predict attributes, we learn a mapping from image feature space into class attribute space. Thanks to the ranking based learning, we observe that clean images get mapped close to the correct class while adversarial images get mapped closer to a wrong class embedding. For instance, as shown in Figure 1

, “blue head” and ”red belly” associated with the class “painted bunting” are predicted correctly for the clean image. On the other hand, due to predicting attributes incorrectly as “white belly” and “white head”, the adversarial image gets classified into “herring gull” incorrectly. Note that, we consider adversarial examples that are generated to fool only the classifier and not the interpretation mechanism. To ground attributes, we adapt state of the art deep object/object part detector, i.e. Faster-RCNN, to detect bounding boxes around the visual evidence of our predicted attributes. Finally, our analysis involves studying adversarially robust models, i.e. using adversarial training as a defense technique against adversarial attacks.

Our main contributions are as follows: (1) We propose to understand the neural network decisions for adversarial examples by learning to predict visually discriminative class-specific attributes. (2) We visualize the predicted attributes by grounding them on their respective images, i.e. drawing bounding boxes around their visual evidence on the image. (3) We interpret adversarial examples of standard and adversarially robust framework in three benchmark attribute datasets with varying size and granularity.

2 Related Work

In this section, we discuss related works on adversarial examples and interpretability research prior to ours.

Adversarial Examples. Small carefully crafted perturbations, i.e. adversarial perturbations, added to the inputs of deep neural networks, i.e. adversarial examples, can easily fool the classifiers trained using deep learning [41]. Such attacks involve iterative fast gradient sign method [22], Jacobian-based saliency map attacks [31], one pixel attacks [40], Carlini and Wagner attacks [6] and universal attacks [29] designed not only for classificaton but for object detection [47], segmentation [11], auto encoders [42], generative models [20]

, and reinforcement learning

[24]. Most of these perturbations are transferable between different networks and do not require access to the network’s architecture or parameters, i.e. black box attacks.

Concurrently many attempts have been made for understanding, detecting and defense against these attacks. The reason behind adversarial examples may be the linearity in neural networks [41]

or low probability adversarial pockets in image space 

[15]. Neural networks respond to recurrent discriminative patches [7] whereas adversarial examples lie in a different region on the data manifold [14]. Hence, several methods have been proposed to detect adversarial examples [26, 27]. On the other hand, ensemble adversarial training [44],deep contractive networks [16]

, defensive distillation

[31], protection against adversarial attacks using generative models [34][36] focuses on the defense against adversarial attacks. In this work, our aim is to understand the sources and causes of misclassification when the neural network is presented with an adversarial example.

Interpretability. Explaining the output of a decision maker is necessary to build user trust before deploying them into the real world environment, e.g. in applications like finance, autonomous vehicles, and medical imaging etc. Previous work is broadly grouped into two: 1) model interpretation

, i.e. understanding of model by observing the structure, parameters and neuronal activities of the networks and 2)

instance level interpretation or prediction explanation, i.e. showing the causal relationship between input and the specific output [9]

. De-convolutional neural networks 

[48] and activation maximization [38] fall under the first group. On the other hand, visualizing the evidence for classification [51], adding perturbation in the optimization framework and learning the perturbation mask to understand the contribution of features [12] lie in the second group. As an alternative to visualizations, text-based class discriminative explanations [18, 32] and text-based interpretation with semantic information [8] have also been proposed to explain network decisions. In this work, we use attributes as a means of prediction explanation.

Interpretability of Adversarial Examples. After analyzing neuronal activations of the networks for adversarial examples, [7] concluded that the networks learn recurrent discriminative parts of objects instead of semantic meaning. In [19], the authors proposed a datapath visualization module consisting of the layer level, feature level, and the neuronal level visualizations of the network for clean as well as adversarial images. Finally, in [43], the authors proposed an attribute steered classification model and compared its output with a standard classifier. If outputs were inconsistent then the image was detected as an adversarial image. They further argued that the interpretation is closely entangled with the detection. Saliency-based model interpretations has been shown to be fragile to interpret adversarial examples [13], i.e. although the output of the neural network for two inputs is different the saliency maps are identical. Similarly, in [49], authors proposed ACID attacks which change the output of saliency maps without changing the output of the classifier. In [1], authors performed sanity checks on saliency-based methods using randomization tests and found that they do not vary with the change in the data generation process and the model. In our work, we propose to ground class-discriminative attributes via bounding boxes that explain class predictions for clean as well as adversarial examples.

3 Predicting and Grounding Attributes Model

Figure 2: Adversarial images are difficult to explain: when the answer is wrong, often saliency based methods (left) fail to detect what went wrong. Instead, attributes (right) provide intuitive and effective visual and textual explanations.

Instance level interpretations such as saliency maps [35] are often weak in justifying classification decisions for fine-grained adversarial images, e.g. in Figure 2 the saliency maps of a clean image classified into the correct class, e.g. “red winged blackbird”, and the saliency map of a misclassified adversarial image, look quite similar. Instead, we propose to predict and ground attributes for both clean and adversarial images to provide visual as well as attribute-based interpretations. In fact, our predicted attributes predicted for clean and adversarial images look quite different. By grounding the predicted attributes one can infer that “orange wing” is important for “red winged blackbird” while “red head” is important for “red faced cormorant”. Indeed, when the attribute value for orange wing decreases and red head increases the image gets misclassified.

In this section, we detail our two-step framework for interpreting adversarial examples. First, we perturb the images using two different untargeted/targeted adversarial attack methods and robustify the classifiers via adversarial training. Second, we predict class-specific attributes and visually ground them on the image to provide an intuitive justification of why an image is classified as a certain class.

Figure 3: Our interpretable attribute prediction-grounding model. After adversarial attack or adversarial training step, image features of both clean and adversarial images are extracted using Resnet and mapped into attribute space by learning the compatibility function between image features and class attributes. Finally, attributes predicted by SJE are grounded by matching them with attributes predicted by Faster RCNN for clean and adversarial images.

3.1 Adversarial Attacks

We study both untargeted and targeted attacks. Given an original input and its respective correct class predicted by a model , an untargeted adversarial attack model generates an image for which the predicted class is . In targeted attacks, for every image , the the adversary aims at letting the model predict a specific . In the following, we detail an adversarial attack method fooling a softmax classifier and an adversarial training technique that robustifies it.

IFGSM. The iterative fast gradient sign method [22] is a modification of fast gradient sign method (FGSM) [15]. In IFGSM, FGSM is applied iteratively solving the objective function to produce adversarial examples:


where represents the gradient of the cost function w.r.t. perturbed image at step . determines the step size which is taken in the direction of sign gradient and finally, the result is clipped by epsilon .

Adversarial Training. As a defense against adversarial attacks [15] adversarial training minimizes the objective:


where, are input image features, is the classification loss for clean images, is the loss for adversarial images and regulates the loss to be minimized. The model finds the worst case perturbations and fine tunes the network parameters to reduce the loss on perturbed inputs. Hence, the classification accuracy on adversarial images increases, however there is trade-off between the accuracy of the predictions in clean and adversarial images. Adversarial training helps in learning more robust classifiers by suppressing the perturbations from adversarial images [45]. Further, it is also considered as a regularization technique [28].

3.2 Attribute Prediction and Grounding

Our attribute prediction and grounding model uses attributes as side information to define a joint embedding space that the images are mapped to. In this space, attributes act as side information to interpret the classification decision. As shown in Fig.3, during training our model maps clean training images close to their respective class attributes, e.g. “painted bunting” with attributes “red belly, blue head, black bill”, whereas adversarial images get mapped close to a wrong class, e.g. “herring gull” with attributes “white belly, white head, yellow bill”. Finally, we visualize the predicted attributes for clean and adversarial images using a pre-trained Faster RCNN model.

Attribute prediction. We employ structured joint embeddings (SJE) [2] to predict attributes in an image. Given input image features and output class embedding from the sample set SJE learns a mapping by minimizing the empirical risk of the form where estimates the cost of predicting when the true label is .

A compatibility function is defined between input and output space:


where is a matrix of dimension where is the dimension of input and is the dimension of output embedding. It denotes the model parameters to be learned by ranking the correct class higher than the other classes:


where is the pairwise ranking loss:


We optimize with SGD by sampling and searching for the highest ranked class . If the sampled label is not the correct label then the weights are updated using:


where is the learning rate and gives the predicted attributes for image . The image is assigned to the label of the nearest per-class output embedding .

Attribute grounding. In our final step, we ground the predicted attributes on to the input images using a pre-trained Faster RCNN network and visualize them as in [4]. The pre-trained Faster RCNN model predicts bounding boxes denoted by . For each object bounding box it predicts the class as well as the attribute  [3].


The most discriminative attributes predicted by SJE are selected based on the criteria that they change the most when the image is perturbed with noise. Then we look up for these attributes in attributes predicted by Faster RCNN for each bounding box and when the attributes predicted by SJE and Faster RCNN match i.e. , we ground them on their respective clean and adversarial images. Where, and are the indexes of the attributes predicted by SJE which change the most when perturbed with adversarial noise.

4 Experiments

Figure 4: Comparing the accuracy of the non explainable Softmax classifier and the explainable SJE classifier for clean and adversarially perturbed samples. We evaluate both classifiers on clean and adversarial images with no adversarial training and the same with adversarial training (AT) and (AT) respectively.

In this section, we perform experiments on three different datasets and analyze model performance for clean as well as adversarial images. Finally, we present quantitative as well as qualitative analysis using attributes for both targeted and untargeted attacks.

Datasets. We experiment on three datasets, i.e. Animals with Attributes 2 (AwA) [23], Large attribute (LAD) [50] and Caltech UCSD Birds (CUB) [46]. AwA contains 37322 images (22206 train / 5599 val / 9517 test) with 50 classes and 85 attributes per class. LAD has 78017 images (40957 train / 13653 val / 23407 test) with 230 classes and 359 attributes per class. CUB consists of 11,788 images (5395 train / 599 val / 5794 test) belonging to 200 fine-grained categories of birds with 312 attributes per class.

Image Features and Adversarial Examples. We extract image features and generate adversarial images using fine-tuned Resnet-152. Our untargeted and targeted attacks using iterative fast gradient sign method with epsilon values , and and norm as a similarity measure between clean input and the generated adversarial example. We performed targeted attacks under average case scenario where we selected the target class randomly from labels [6].

As for adversarial training, we repeatedly computed the adversarial examples while training and fine-tuned the Resnet-152 to minimize the loss on these examples. We generated adversarial examples using projected gradient descent method which is a multi-step variant of FGSM with epsilon values , and respectively for adversarial training as in [25].

Attribute Prediction and Grounding.

Our per-class attribute vectors come with the dataset and are annotated manually. At test time the image features are projected onto the attribute space and the image is assigned with the label of the nearest ground truth attribute vector.

The predicted attributes are grounded by using Faster-RCNN pre-trained on Visual Genome Dataset since we do not have ground truth part bounding boxes for any of our datasets. The Faster-RCNN model extracts the bounding boxes using 1600 object and 400 attribute annotations. Each bounding box is associated with an attribute followed by the object, e.g. a brown bird.

4.1 Comparing Softmax and SJE for Classification

Here, we evaluate Softmax and SJE classifiers in terms of the classification accuracy on both clean and adversarial images generated with untargeted and targeted attacks for all three datasets. Since SJE model is a more explainable classifier, e.g. predicts attributes, compared to softmax, e.g. predicts directly the class label, it is important to see if there is any significant drop in accuracy. Note that we are not attacking the SJE network directly but we are applying black box attacks on SJE. Similarly, the adversarial training is also performed on Softmax classifier and then the features extracted from this model are used for training SJE.

We observe from our results with targeted and untargeted IFGSM attacks in Figure 4 that SJE and Softmax accuracies are on par for clean images on AWA dataset, SJE accuracy is slightly higher for LAD dataset and slightly lower for CUB dataset (red curves). With untargeted adversarial attacks, SJE works slightly better for AWA and LAD datasets and significantly better for CUB dataset i.e. for (blue curves). However, with targeted attacks for AWA and LAD datasets, SJE accuracy is slightly lower than Softmax but the difference is not significant and is significantly better for CUB dataset (blue curves). This shows that while softmax classifier works slightly better on clean images, SJE works significantly better especially when the perturbation is small and the dataset is fine-grained with well-defined attributes. This shows that when the image is perturbed, by predicting attributes the model not only provides an explanation to the user but also the class predictions are more accurate.

In addition, for targeted vs untargeted attacks, the accuracy for targeted attacks does not decrease as much as with untargeted attacks on all the three datasets. The reason behind lack in the drop of accuracy for targeted attacks is that in targeted attacks we randomly target the images into wrong class which could be very far from its ground truth hence it becomes difficult to misclassify into targeted class as compared to untargeted attacks where the image gets misclassified into the nearest wrong class. Although the drop in accuracy for untargeted attacks is higher than targeted attacks (blue curves), but then the improvement in accuracy for untargeted is also high which leads to almost same adversarially robust accuracy for both targeted and untargeted attacks (purple curves).

Our evaluation with and without adversarial training shows that the classification accuracy improves for adversarial images when adversarial training is used. For example for AWA the accuracy improved from to for untargeted attack with . However, the accuracy for clean images dropped e.g. for AWA the accuracy dropped from to for untargeted attack with (green curves). Overall we observe with both targeted and untargeted attacks that SJE is more robust to the adversarial attacks as compared to Softmax (dotted blue curves). Moreover, SJE results improve significantly for adversarial examples as compared to Softmax with adversarial training (dotted purple curves).

4.2 Quantifying Effect of Predicted Attributes

Figure 5: Attribute distance plots for standard and robust learning frameworks. Standard learning framework plots are shown for clean and adversarial image attributes and robust learning framework plots are shown only for adversarial image attributes but for adversarial images misclassified with standard features and correctly classified with robust features.

Our aim is to analyze (1) the predicted attributes of the clean images classified correctly , and adversarial images misclassified without adversarial training (2) predicted attributes of the adversarial images classified correctly and classified incorrectly with adversarial training. Note that, the correct ground truth class attribute is referred to as and incorrect class attributes are as .

We select top of the attributes whose value changes the most with adversarial perturbations considering distances between predicted attributes of clean and adversarial images when they are correctly and incorrectly classified.

We contrast the Euclidean distance between predicted attributes of (correctly classified) clean and (incorrectly classified) adversarial samples:


with the Euclidean distance between the ground truth attribute vector of the correct and incorrect classes:


and show the results in Figure 5 (a). We observe that for AWA and LAD datasets the distances between the predicted attributes for adversarial and clean images are smaller than the distances between the ground truth attributes of clean and adversarial classes . This result shows that, only a minimal change in attribute values towards the wrong class can cause a misclassification. On the other hand, the fine-grained CUB dataset behaves differently. The overlap between and distributions shows that the images from fine-grained classes are more susceptible to adversarial attacks and hence their attributes change significantly compared to images of coarse categories.

Contrasting the distances between the predicted attributes for the adversarial image and the ground truth attribute for the adversarial class:


with the distance between the predicted attribute for adversarial image and the ground truth attribute for the correct class:


we obtain results in Figure 5 (c). We observe that, for most of the images, the distance between the adversarial image attribute and the correct class attribute is higher than the distance between adversarial image attribute and the wrongly classified class . This result shows us that, adversarial images are misclassified because the adversarial image attributes are close to the incorrect class attributes whereas they are far away from the correct class attributes.

Figure 6: Qualitative analysis for untargeted/targeted attacks and adversarial training on CUB. The attributes ranked by importance for the classification decision are shown below the images. The grounded attributes are color coded for visibility (the ones in gray could not be grounded). The attributes for clean images (and adversarial images with adversarial training) are related to correct classes whereas the ones predicted for adversarial images change towards incorrect classes.

Our results comparing the distances between the predicted attributes of the adversarial images that are classified correctly with the help of adversarial training and incorrectly without adversarial training :


with the distances between the ground truth target class attributes and ground truth wrong class attributes :


are shown in Figure 5 (b). We observe that the overall behavior of the predicted attributes for that adversarial images with adversarial training and without adversarial training is similar to the behavior seen in Figure 5 (a) for clean and adversarial images. This shows that the adversarial images with adversarial training behave like clean images, i.e. predicted attributes for the adversarial images with adversarial training become closer to their ground truth correct class.

We compare the distances between the predicted attributes for the incorrectly classified adversarial image and the ground truth attribute of the adversarial class:


with the distance between the predicted attribute for adversarial image and the ground truth attribute for the correct class:


when the classifier is trained with adversarial training. From the results in Figure 5 (d) we observe a similar behavior as the results presented in Figure 5 (c). This shows adversarial images misclassified with adversarial training behaves like adversarial images misclassified without it.

4.3 Grounding Predicted Attributes

To qualitatively analyse the predicted attributes, we ground them on clean and adversarial images. We select our images among the ones that are correctly classified when clean and incorrectly classified when adversarially perturbed. For clean images (or adversarial images with adversarial training), we select the most discriminative attributes based on:


for adversarial images we select them based on:


We evaluate attributes that change their value the most for CUB, attributes for AWA, and attributes for LAD dataset. We match the selected attributes with the attributes predicted by Faster RCNN to ground them on the images.

Figure 7: Qualitative analysis for untargeted attacks on AWA and LAD. The attributes are ranked by importance for the classification decision, the grounded attributes are color coded for visibility (the ones in gray could not be grounded).

Qualitative Results in CUB. We perform an analysis with untargeted and targeted IFGSM attacks as well as adversarial training on the fine-grained CUB dataset.

In untargeted attacks (results at the top row of Figure 6), the image gets misclassified into the nearest incorrect class. We observe that the most important attributes for the clean images are localized accurately; however, for adversarial images misclassifications occur. Those attributes which are common among both clean and adversarial classes are localized correctly on the adversarial images; however, the attributes which are not related to the correct class, i.e. the ones that are related to the wrong class can not get grounded as there is no visual evidence that supports the presence of these attributes. For example “brown wing, long wing, long tail” attributes are common in both classes; hence, they are present both in the clean image and the adversarial image. On the other hand, has a brown color and a multicolored breast which are evidences that are not present in the adversarial image. Hence, they can not get grounded. Similarly, in the second example none of the attributes are grounded. This is because the evidence for those attributes are not present in the image. In the third example, common attributes are localized but “brown throat, spotted wing” are not localized for the same reasons.

In targeted attacks (results in the middle row of Figure 6) the images are forced to get missclassified into a randomly selected class. So, the images get missclassified into classes that do not share many common attributes. Our first visualizations show that none of the attributes of the adversary class were visible in the adversarial example, hence, those attributes could not get grounded. In other words, predicted adversarial image attributes are in accordance with the wrong class attributes but different from the clean image so none of the attribute got localized. For the second image, black tail is a common property between the clean image class and the adversary however, this is not the most discriminating property. One of the most discriminating properties such as “solid back” did not get localized since there is no visual evidence that supports the presence of this attribute in clean image. Similarly, in the third example, we observe that the most discriminating property is “striped wing” but it did not get localized in the adversarial image for the same reason.

Finally, our analysis with correctly classified images due to adversarial training shows that adversarial images with adversarial training behave like clean images also visually. In last row of Figure 6, we observe that the attributes of adversarial image without adversarial training are closer to the adversarial class attributes. However, the grounded attributes of adversarial image with adversarial training are closer to its ground truth class. For instance, the first example contains a “blue head” and a “black wing” whereas one of the most discriminating properties of the correct class “blue head” is not relevant to the adversarial class hence this attribute is not predicted as the most relevant by our model and hence our attribute grounder did not ground it.

Qualitative Results in AWA and LAD. Due to restricted space, we provide results on AWA and LAD only with images perturbed with untargeted attacks. Our results in Figure 7 show that the grounded attributes on clean images conform the classification into the correct class while the attributes grounded on adversarial images are common among clean and adversarial images. For instance first example of AWA “is black” attribute is common in both classes so it is grounded on both images but “has claws” is an important attribute for the adversarial class. As it is not present in correct class, it is not grounded.

On the other hand, compared to misclassifications caused by adversarial perturbations on CUB, as AWA and LAD are coarse grained datasets, images do not necessarily get misclassified into the most similar class. Therefore, there is less overlap of attributes between correct and adversarial classes, which is in accordance with our quantitative results. Furthermore, the attributes for both datasets are not highly structured as different objects can be distinguished from each other with only a small number of attributes. Our method grounds the common attributes. The second example for LAD in Figure 7 shows that attributes such as “red” and “green” are distinguishing for “strawberry” which are correctly predicted and grounded. On the other hand, “has nutlets” and “is big” are attributes distinguishing for “Mango”; hence, they can not be grounded on the adversarially perturbed strawberry image.

5 Conclusion

In this work, we proposed an attribute prediction and grounding framework to explain why adversarial perturbations cause misclassifications. Our model predicts class-specific properties of the objects via ranking relevant class attributes higher than irrelevant ones and grounds these attributes on their respective images. Our analysis involved images generated by targeted and untargeted attacks as well as adversarial training. We showed quantitatively and qualitatively that predicted attributes for adversarial images are relevant to the wrong class and to the correct class for clean images justifying why adversarial images get misclassified. We visually grounded these predicted attributes to show the visible and missing evidence when a misclassification occurs on three benchmark datasets.


  • [1] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps. In NeurIPS, 2018.
  • [2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR. IEEE, 2015.
  • [3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
  • [4] L. Anne Hendricks, R. Hu, T. Darrell, and Z. Akata. Grounding visual explanations. In ECCV, 2018.
  • [5] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. Goodfellow, A. Madry, and A. Kurakin. On evaluating adversarial robustness. ICLR, 2019.
  • [6] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In SP. IEEE, 2017.
  • [7] Y. Dong, H. Su, J. Zhu, and F. Bao. Towards interpretable deep neural networks by leveraging adversarial examples. arXiv, 2017.
  • [8] Y. Dong, H. Su, J. Zhu, and B. Zhang. Improving interpretability of deep neural networks with semantic information. In CVPR, 2017.
  • [9] M. Du, N. Liu, and X. Hu.

    Techniques for interpretable machine learning.

    arXiv, 2018.
  • [10] M. Du, N. Liu, Q. Song, and X. Hu. Towards explanation of dnn-based prediction with guided feature inversion. In SIGKDD. ACM, 2018.
  • [11] V. Fischer, M. C. Kumar, J. H. Metzen, and T. Brox. Adversarial examples for semantic image segmentation. ICLR, 2017.
  • [12] R. C. Fong and A. Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. arXiv, 2017.
  • [13] A. Ghorbani, A. Abid, and J. Zou. Interpretation of neural networks is fragile. arXiv, 2017.
  • [14] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow. Adversarial spheres. arXiv, 2018.
  • [15] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
  • [16] S. Gu and L. Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv, 2014.
  • [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, 2017.
  • [18] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell. Generating visual explanations. In ECCV. Springer, 2016.
  • [19] L. Jiang, S. Liu, and C. Chen. Recent research advances on interactive machine learning. Journal of Visualization, 2018.
  • [20] J. Kos, I. Fischer, and D. Song. Adversarial examples for generative models. In SPW. IEEE, 2018.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
  • [22] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. ICLR workshop, 2017.
  • [23] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR. IEEE, 2009.
  • [24] Y.-C. Lin, Z.-W. Hong, Y.-H. Liao, M.-L. Shih, M.-Y. Liu, and M. Sun. Tactics of adversarial attack on deep reinforcement learning agents. IJCAI, 2017.
  • [25] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ICLR, 2018.
  • [26] D. Meng and H. Chen. Magnet: a two-pronged defense against adversarial examples. In SIGSAC. ACM, 2017.
  • [27] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. ICLR, 2017.
  • [28] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing with virtual adversarial training. ICLR, 2016.
  • [29] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In CVPR, 2016.
  • [30] S. M. Moosavi Dezfooli, A. Fawzi, F. Omar, P. Frossard, and S. Soatto. Robustness of classifiers to universal perturbations: A geometric perspective. In ICLR, number CONF, 2018.
  • [31] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In EuroS&P. IEEE, 2016.
  • [32] D. H. Park, L. A. Hendricks, Z. Akata, B. Schiele, T. Darrell, and M. Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. In CVPR, 2018.
  • [33] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  • [34] P. Samangouei, M. Kabkab, and R. Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. ICLR, 2018.
  • [35] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  • [36] S. Shen, G. Jin, K. Gao, and Y. Zhang. Ae-gan: adversarial eliminating with gan. arXiv, 2017.
  • [37] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676), 2017.
  • [38] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv, 2013.
  • [39] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen, and Y. Gao. Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models. In ECCV, 2018.
  • [40] J. Su, D. V. Vargas, and K. Sakurai. One pixel attack for fooling deep neural networks.

    IEEE Transactions on Evolutionary Computation

    , 2019.
  • [41] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. ICLR, 2013.
  • [42] P. Tabacof, J. Tavares, and E. Valle.

    Adversarial images for variational autoencoders.

    arXiv, 2016.
  • [43] G. Tao, S. Ma, Y. Liu, and X. Zhang. Attacks meet interpretability: Attribute-steered detection of adversarial samples. In NeurIPS, 2018.
  • [44] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. ICLR, 2018.
  • [45] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry.

    Robustness may be at odds with accuracy.

    stat, 1050, 2018.
  • [46] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [47] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille. Adversarial examples for semantic segmentation and object detection. In CVPR, 2017.
  • [48] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV. Springer, 2014.
  • [49] X. Zhang, N. Wang, S. Ji, H. Shen, and T. Wang. Interpretable deep learning under fire. arXiv, 2018.
  • [50] B. Zhao, Y. Fu, R. Liang, J. Wu, Y. Wang, and Y. Wang. A large-scale attribute dataset for zero-shot learning. arXiv, 2018.
  • [51] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling. Visualizing deep neural network decisions: Prediction difference analysis. ICLR, 2017.