Deep neural networks (DNNs) have demonstrated significant success in a wide spectrum of applications [3, 5, 6, 9]. However, they have been found to be highly vulnerable to adversarial attacks: typically small, human-imperceptible perturbations on inputs that fool DNNs into making incorrect predictions [2, 4, 8, 11]. This jeopardizes many DNN-based technologies, especially in security and safety-critical applications such as autonomous driving and data-driven healthcare. To make deep learning more robust against such malicious attacks, it is essential to understand how the attacks permeate DNN models [12, 14]. Interpreting, and ultimately defending against adversarial attacks, is nontrivial due to a number of challenges:
[topsep=0mm, itemsep=0mm, parsep=2mm, leftmargin=5mm]
Entangled features and connections between benign and attacked inputs. A natural approach for understanding attacks is to compare a model’s operations on benign and attacked inputs, which could help people understand where and why predictions within a model diverge. However, designing an effective comparison can be challenging because the features contributing to the differences may correlate with one or more features in both the benign and attacked classes. For example, Figure 1 shows that a feature representing “ivory face with dark eyes and nose” (center purple node) is important for both the benign panda class and the attacked armadillo class. This shared feature correlates with a feature for panda (e.g., “black & white patches”), while also correlating with a few features for armadillo (e.g., “scales”, “crossed pattern”).
Diverse feature vulnerability. Given an adversarial attack, some learned features may be more easily manipulated and vulnerable than others. For example, manipulating a “basketball” feature into an “orange” feature is easier (similar shape and color) than changing it into a “truck” feature. As features exhibit a spectrum of vulnerability, enabling users to effectively visualize and understand an attack under varying levels of severity could help them design stronger countermeasures.
To address the aforementioned challenges, we are developing , an interactive visualization tool for interpreting adversarial attacks on deep learning models. Our ongoing work presents the following contributions:
[topsep=0mm, itemsep=0mm, parsep=2mm, leftmargin=5mm]
Novel graph-based comparison. To discover the features and connections activated or suppressed by an attack, we adapt the recently proposed attribution graph  in a novel way to visualize, summarize, and compare a model’s response to benign and attacked data. The original attribution graph aims to highlight how a model’s learned features interact to make predictions for a single class, by representing highly activated neurons as vertices and their most influential connections as edges. Our main idea for is to generate and integrate two attribution graphs: one for the benign data and another for the attacked data, as illustrated in Figure 1. The aggregated graph helps us understand which features are shared by both benign and attacked data (e.g., purple, center feature in Figure 1), which are solely activated by the benign data (blue, far left), and which are by the attacked data (red, far right). Importantly, also helps users more easily discover where a prediction starts to “diverge”, honing in to the critical parts of the model that the attack is exploiting.
Fractionation of neurons based on vulnerability. To help users prioritize their inspection of neurons, we develop a new way to sort and group them based on their vulnerability, i.e., “how easily can a neuron be activated or suppressed by an attack.” Our main idea is to vary an attack’s strength (or severity) and record all neuron activations. Neurons that are easily activated or suppressed by even a weak attack may warrant focused inspection since they can be easily manipulated with little effort.
2 System Design and Implementation
Nodes: Features activated (or suppressed) by attacks.
In attribution graphs, nodes represent DNN neurons which are trained to detect particular features in input data. To interpret what feature a neuron detects, represents each neuron with its feature visualization: a synthesized image that maximizes the neuron’s activation . Users can hover over any neuron’s feature visualization to also display example image patches from the dataset that most activate that neuron. For example, as seen in Figure 1.1, the feature visualization (left) and dataset examples (right) describe a neuron that detects a dotted pattern in scales.
We divide an attribution graph’s neurons into three groups according to their attack response. First, suppressed neurons are highly activated by benign inputs but become suppressed by adversarial inputs. These represent crucial features for the benign class, but the model fails to detect them when exposed to the attack. Second, emphasized neurons are not noticeably activated by benign inputs but become highly activated by adversarial inputs. These represent features that are typically not important for the benign class, but the model detects them as important features of the attacked class. Third, shared neurons are highly activated by and important to both benign and adversarial inputs.
We visually distinguish these three neuron groups with different colors and positions in the attribution graph view (Figure 2B). Suppressed neurons are colored blue and positioned on the left (Figure 2B.1). Emphasized neurons are colored orange and positioned on the right (Figure 2B.3). Shared neurons are colored purple and positioned in the middle between suppressed neurons and emphasized neurons (Figure 2B.2). The result is a visualization that disentangles and compares the DNN features and connections from the benign and attacked data.
Edges: Explaining why features are activated (or suppressed).
In attribution graphs, edges represent influential connections between neurons that most interact with each other to represent a particular class . These connections can explain why a feature is detected for the class by attribution: a neuron ’s feature is considered important because neurons that are connected to in the previous layer are highly activated. This process is then repeated from the output later to the input.
With , users can interactively visualize attribution graphs, e.g., drilling down into specific subgraphs by hovering over a neuron and automatically highlighting its previous connections (Figure 2D). In combination with the position of the benign and attacked features, this helps users understand why an important feature becomes activated by an attack. When inspecting a particular emphasized neuron, the highlighted connected neurons from the previous layer will be either shared neurons or other emphasized neurons, as seen in Figure 2A. Users can observe that the connected emphasized neurons from the previous layer cause the particular neuron to be activated. Similarly, to understand why a feature becomes suppressed by an attack, inspecting a particular suppressed neuron highlights its connected neurons from the previous layer; these will be either shared neurons or suppressed neurons, and the connected suppressed neurons provide evidence why the originally suppressed neuron is less activated. All together, this helps discover the features for which a model’s prediction diverges when input data is attacked.
[-25pc] (A) above shows a part of an attribution graph from Figure 2D. An emphasized neuron is connected to one shared neuron (purple outline) and three emphasized neurons (orange outline) from the previous layer. (B) shows how the same part of the attribution graph looks different for benign images. The three emphasized neurons in the previous layer are not activated by the benign inputs, which causes the emphasized neuron in the current layer to be less activated.
Neuron groups: Characterizing features’ attack vulnerability.
By increasing the strength of an attack, we can separate neurons into groups based on their vulnerability. A neuron is considered more vulnerable if its activation changes greatly under weaker attacks. We distinguish different neuron vulnerability using color and position. More vulnerable neurons are colored similarly and located closer to shared neurons, since they are on the border of the benign and attacked classes and cause misclassification under weaker attack.
3 Preliminary Results
We present usage scenarios showing how can help users better understand adversarial attacks on deep learning models. Our user Hailey is studying a targeted version of Fast Gradient Method  applied on the InceptionV1 model 
. The model is trained on the ImageNet dataset, which contains over 1.2 million images across 1,000 classes. Using the control panel (Figure 2A), she selects “giant panda" as the benign class and “armadillo" as the target class. She sets the maximum attack strength 3.5.
Which neurons are attacked?
Hailey starts by finding out which specific neurons are attacked to narrow down the part of the model to investigate. In the attribution graph view (Figure 2B), she hovers over the suppressed neurons (Figure 2B.1) and the emphasized neurons (Figure 2B.3). She can easily see which features are emphasized and suppressed using the neuron feature visualization and dataset example patches. Exploring these features, she finds the mixed5a layer interesting because three emphasized neurons (223, 698, and 128) in mixed5a look related to armadillos’ skin patterns (Figure 2C). She decides to focus on these emphasized neurons in mixed5a.
Which neurons are easily attacked?
To efficiently devise a countering defense, Hailey wants to prioritize the neurons and investigate them in order. She knows that fractionates the neurons according to how easily they are attacked, therefore she checks how the emphasized neurons in mixed5a are separated (Figure 2C). By hovering over the neurons from left to right, she observes that “scales pattern” is most vulnerable, and “baskets“ are more vulnerable than “bumpy texture”. She decides to explore the neurons in this order, since she thinks that it is more efficient to protect more vulnerable neurons.
Why are these neurons attacked?
Hailey now wants to know how to protect the attacked neurons related to armadillos’ skin. She sequentially performs the attribution process on the “scales pattern”, “baskets”, and “bumpy texture” neurons. When she applies the attribution process on the “bumpy texture” neuron (Figure 2D), shows that four neurons in the previous layer are highly interacting with the “bumpy texture” feature: a shared neuron representing “black circle,” and emphasized neurons representing “spider legs”, “granular texture”, and “a white hairy dog’s face.” As the three emphasized neurons in the previous layer can be major reasons behind the detection of “bumpy texture”, she decides to investigate these neurons more using .
4 Ongoing Work
Interactive neuron editing.
currently visualizes the neurons that are activated or suppressed by an attack under varying degrees of severity. We are working on extending ’s interactivity by allowing real-time neuron editing, e.g., deletion. This would allow a user to actively identify vulnerable neurons using our visualization and interactively remove them from the DNN to observe its effect in real-time. Neuron deletion would mask the activations of a particular neuron, potentially preventing the malicious effect of a targeted attack to propagate deeper into the network. This would enable a user to preemptively edit a DNN to enhance its robustness to adversarial attacks. For example, a user may identify and choose to delete a shared neuron that only feeds into emphasized neurons, preventing adversarially activated neurons from having any effect in the subsequent layers of the network, thus thwarting the targeted attack from succeeding.
We plan to evaluate the effectiveness of our visualization tool coupled with interactive neuron editing through in-lab user studies where participants seek to increase the robustness of a large-scale, pretrained DNN model. We will recruit students with basic knowledge of deep learning models. All participants will be asked edit the DNN for different benign-attacked class pairs and will be evaluated on the basis of reduction in targeted attack success rate. We will also conduct pre-test and post-test surveys to evaluate whether gave any deeper insights into the failure modes of the studied DNN and what factors the participants considered while editing the DNN to increase its robustness to adversarial attacks.
We present , an interactive system we are developing that visualizes how adversarial attacks permeate DNN models and cause misclassification. generates and visualizes multiple attribution graphs as a summary of what features are important for a particular class (e.g., benign or attacked class) and how the features are related. enables flexible comparison between benign and attacked attribution graphs, highlighting where and why the attribution graphs start to diverge, ultimately helping people better interpret complex deep learning models, their vulnerabilities, and how to best construct defenses.
This work was supported in part by NSF grants IIS-1563816, CNS-1704701, NASA NSTRF, gifts from Intel (ISTC-ARSA), NVIDIA, Google, Symantec, Yahoo! Labs, eBay, Amazon.
-  (2019) Activation atlas. Distill 4 (3), pp. e15. Cited by: item C1..
Shapeshifter: robust physical adversarial attack on faster r-cnn object detector.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 52–68. Cited by: §1.
-  (2019) A guide to deep learning in healthcare. Nature medicine 25 (1), pp. 24. Cited by: §1.
-  (2014) Explaining and harnessing adversarial examples. CoRR abs/1412.6572. Cited by: §1, §3.
-  (2019) A survey of deep learning techniques for autonomous driving. Journal of Field Robotics. Cited by: §1.
A survey on deep learning based face recognition. Computer Vision and Image Understanding 189, pp. 102805. Cited by: §1.
-  (2019) Summit: scaling deep learning interpretability by visualizing activation and attribution summarizations. IEEE VIS. Cited by: item C1., item 1, §2.
-  (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §1.
-  (2019) Speech recognition using deep neural networks: a systematic review. IEEE Access 7, pp. 19143–19165. Cited by: §1.
-  (2017) Feature visualization. Distill. Note: https://distill.pub/2017/feature-visualization External Links: Cited by: §2.
Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. arXiv preprint arXiv:1903.10346. Cited by: §1.
Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients.
Thirty-second AAAI conference on artificial intelligence, Cited by: §1.
Going deeper with convolutions.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §3.
-  (2018) Attacks meet interpretability: attribute-steered detection of adversarial samples. In Advances in Neural Information Processing Systems, pp. 7717–7728. Cited by: §1.