SentiNet: Detecting Physical Attacks Against Deep Learning Systems

by   Edward Chou, et al.
Stanford University

SentiNet is a novel detection framework for physical attacks on neural networks, a class of attacks that constrains an adversarial region to a visible portion of an image. Physical attacks have been shown to be robust and flexible techniques suited for deployment in real-world scenarios. Unlike most other adversarial detection works, SentiNet does not require training a model or preknowledge of an attack prior to detection. This attack-agnostic approach is appealing due to the large number of possible mechanisms and vectors of attack an attack-specific defense would have to consider. By leveraging the neural network's susceptibility to attacks and by using techniques from model interpretability and object detection as detection mechanisms, SentiNet turns a weakness of a model into a strength. We demonstrate the effectiveness of SentiNet on three different attacks - i.e., adversarial examples, data poisoning attacks, and trojaned networks - that have large variations in deployment mechanisms, and show that our defense is able to achieve very competitive performance metrics for all three threats, even against strong adaptive adversaries with full knowledge of SentiNet.



page 1

page 4

page 6

page 7

page 8

page 10

page 11


Attack as Defense: Characterizing Adversarial Examples using Robustness

As a new programming paradigm, deep learning has expanded its applicatio...

Unified Detection of Digital and Physical Face Attacks

State-of-the-art defense mechanisms against face attacks achieve near pe...

Adversarial Attack on Facial Recognition using Visible Light

The use of deep learning for human identification and object detection i...

Reducing Bias in Modeling Real-world Password Strength via Deep Learning and Dynamic Dictionaries

Password security hinges on an accurate understanding of the techniques ...

Dynamic Backdoors with Global Average Pooling

Outsourced training and machine learning as a service have resulted in n...

Backdoor Attacks on Network Certification via Data Poisoning

Certifiers for neural networks have made great progress towards provable...

Backdooring and Poisoning Neural Networks with Image-Scaling Attacks

Backdoors and poisoning attacks are a major threat to the security of ma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep neural networks are susceptible to adversarial attacks aimed at causing misclassifications [43, 18, 4]

. As deep learning models are also often re-used (e.g., via transfer learning 

[3]), model vulnerabilities—whether inherent or purposefully inserted by a malicious party [30]—can easily affect a large number of systems. This has severe implications for the trustability of deep learning models in security-critical decision making. Defending these attacks is challenging due to the wide variety of possible attack mechanisms and vectors, especially for models operating in the visual domain. In this work, we explore physical

attacks on visual classifiers and introduce SentiNet, a robust defense which detects adversarial inputs without requiring any specific model re-training or prior knowledge of the attack.

We focus on physical attacks which are localized, i.e., that constrain the adversarial region to a small contiguous portion of an image such as the “adversarial patch” attacks in [4, 30]. This localization constraint has been helpful in designing robust and physically realizable attacks, that take the form of an adversarial object or sticker placed inside a visual scene [4, 30, 11, 12]. In turn, these classes of attacks typically use unbounded perturbations (i.e., without any specific or constraint as in most digital attacks [43, 18]), to ensure that the attacks are robust to changes in viewpoint, lighting and other physical artifacts. Several such physical attacks aimed at causing misclassifications when applied to arbitrary images with different class labels [4, 30] have been demonstrated. A drawback of localized physical attacks is that they are generally visible and detectable by the human eye, but there are many situations where attacks can be deployed in autonomous settings or carefully disguised [4].

Fig. 1: Physical attacks are deployed in real-world settings using physical patterns and objects rather than modifying a digital image.

A prospective defender must first consider that the model being protected may have been compromised prior to being deployed. An attack can originate from the source of the network provider, as in the case with data poisoning attacks [20], or can be intercepted and modified, as with trojaning attacks [30]. Even if a network is integrity protected, attackers can still generate physical attacks that will affect the model at test time, via adversarial examples [4]. Furthermore, there are countless permutations positioning and behaviors these attacks could exhibit, all which are unknown to the defense. Altogether, this creates an extremely difficult security setting where vulnerabilities are easily distributed, where attackers target properties inherent to neural network systems that cannot be removed, and where an attacks might be too diverse in appearance for a signature-based scheme.

Our goal is to create a defense that is attack-agnostic. To this end, we analyze unifying necessary features of a successful localized physical adversarial attack, and develop SentiNet, a technique that exploits these attack behaviors to detect them. We start from the observation that physical attacks are designed to be robust to a variety of physical artifacts, while generalizing to a large distribution of inputs (e.g., the adversarial patch of [4] is designed to work when applied to any input image). Our first insight is that a physical attacks’ success relies on the use of “salient” features that strongly affect the model’s classification on many different inputs. We thus consider techniques from model interpretability and object detection

to discover highly salient contiguous regions of an input image. As we show, these techniques uncover adversarial image regions, as well as benign ones that strongly affect classification. In a second step, we exploit adversarial patches’ strong robustness and generalization properties to distinguish them from benign patches with high saliency. Specifically, we apply extracted image patches to a large number of held-out benign images, and test how often a patch results in a misclassification. Successful adversarial patches are much more likely than benign patches to generate misclassifications, and are thus detected by SentiNet. As we show in our thorough evaluation of SentiNet, mounting an attack that evades detection requires lowering an adversarial region’s saliency to a point where the region no longer works with high probability—even for a strong adaptive adversary with full knowledge of the defense.

Contributions—To summarize, this paper makes the following contributions:

  • We propose SentiNet, a new architecture that protects a neural network by using the same model to detect physical attacks.

  • To the best of our knowledge, SentiNet is the first attack-agnostic architecture that can defend from distinct families of physical attacks without previously seeing or having knowledge of the attack.

  • SentiNet uses a novel approach to detect a potential attack region using techniques developed for model visualization and object detection, and feeds the attack deployed on multiple test images back to the network to perform attack classification.

  • We evaluate SentiNet to protect three pre-existing compromised and uncompromised networks against three known attacks, i.e., adversarial examples, poisoned networks, and trojaned networks. We show that SentiNet can protect neural networks successfully, with an average true positive rate of 94.26% and an average true negative rate of 93.96%.

  • We further evaluate SentiNet against an adaptive attacker and present six attacks against SentiNet. We show that SentiNet is resistant even to strong adversaries by demonstrating the robustness of each individual component.

  • SentiNet requires no special training and can be instantiated with off-the-shelf neural networks, while introducing a 3x inference overhead on multi-GPU systems.

Paper Organization—Before presenting our approach and experiments, we summarize the structure of this paper. First, in Section II, we present the architecture of deep learning systems and the threat model. Then, in Section III, we present SentiNet and its architecture. In Section IV, we perform an extensive evaluation of SentiNet covering physical attacks taken from the literature, i.e., adversarial examples, data poisoning and trojaned networks, and adaptive attackers trying to fool SentiNet. Next, in Section V, we discuss our results. Finally, in Section VI, we cover related works, and in Section VII, we provide our conclusions.

Ii Background

Ii-a Neural Networks

Deep learning is a branch of machine learning which focuses on multi-layered artificial neural networks 

[29]. A neural network can be defined as a standard machine learning function , which given an input returns a prediction and prediction-confidence ; i.e. .

Convolutional Neural Networks (CNN) [28]

are a specific architecture of neural network primarily targeted at computer vision tasks. Common CNN architectures include VGG-16 

[41], ResNets [21] and Inception models [42].

Ii-B Deep Learning Systems

Deep learning (DL) systems are computer systems that rely on neural networks to perform specific tasks. At its essence, a DL system takes in an input, computes an output value, and makes a prediction based on the output. The analysis performed by a DL system is often a classification task such as a face-based user authentication mechanisms, voice recognition, or voice-to-text system. Recently, DL systems have been proposed for autonomous vehicles and digital assistants [48].

In this paper, we focus on DL systems acting while processing physical scenes, where data is acquired via sensors. Sensors are devices that transform physical signals originating on a scene into a stream of digital data. The output of a sensor is then processed by a neural network to make predictions. The prediction of the neural network will be used to determine an appropriate action. Figure 1

shows an example of a DL system implementing a face recognition system to unlock a mobile device or to let a user into a building. The scene includes the user’s face and other background objects. The sensor can be a CCD sensor of a camera that returns a digital image of the scene. The image is processed by a face classifier which predicts the user identity. If the user identity is valid, the actuator will unlock the device or open the gate.

Ii-C Threat Model

In this work, we assume a scenario where a DL system uses a deep CNN model to classify sensor data. The attacker’s goal is to create a physical attack that will hijack the prediction of the model. We allow the attacker to have white-box access to the model, to have full control of the training process and to modify the model in any way, and to be considered a trusted party to the model user. The attacker intends to create a general targeted

attack that redirects a large distribution of inputs to a specific targeted output. More specifically, we consider three types of attack that we review below: physical adversarial examples 

[40, 27, 4, 11, 12, 2], data poisoning attacks [20], and trojaned models [25, 30].

Training a large neural network from scratch requires a huge amount of data and computation, and developers tend to reuse pre-trained networks [3]. Reusing a pre-trained neural network exposes a DL system to several attacks. In particular, we consider two attacks that originate when the origin of the model is untrusted. One method to compromise a neural network is via data poisoning attacks. In a data poisoning attack, the adversary inserts malicious inputs, i.e., backdoors to compromise the model during training [20]. Instead of training a model with malicious data points, the adversary could also trojan

a neural network by modifying the weights of selected neurons 

[25, 30] to respond to a specific trigger. For both trojaned and poisoned networks, the adversary can modify a network to respond to specific objects.

Another threat to DL systems are adversarial examples [43, 18]. Adversarial examples are maliciously-crafted inputs causing a neural network to make an incorrect prediction. Most of the existing literature on adversarial examples requires the attacker to control the neural network input at the byte level. In a physical DL system, this means that the attacker controls either the sensor to generate the desired stream of bytes or the communication channel between the sensor and the model to modify the data. However, the sensor and the communication channel of a DL system may not be accessible by an attacker. As a result, the attacker can only influence the stream of data values to the model by showing the sensor physical objects. Recent works have shown that adversaries can create malicious physical objects, e.g., printed patches [40, 4, 12, 11] or 3D objects [2], that can fool a model under real-world conditions such as lighting, sensor noise, and rotation. As opposed to the more traditional perturbations, physical perturbations are localized in a region of the input.

Fig. 2: Overview of the SentiNet architecture. The output and class proposals of an input are used to generate masks, which are then fed back into the model to generate values for boundary analysis and attack classification.

Iii SentiNet

In this section, we present SentiNet. The goal of SentiNet is to identify adversarial inputs that will hijack the prediction of the model. Specifically, SentiNet intends to protect networks against adversarial examples, trigger trojans, and backdoors without assuming knowledge of what the attack will be beforehand. The core insight of SentiNet is to use this very behavior of adversarial misclassification to detect an attack. First, SentiNet uses techniques from model interpretability and object detection to extract from an input scene those regions that most highly influence the model prediction (Section III-A). These regions likely contain the malicious object (if present) as well as benign salient regions. Then, SentiNet applies these extracted regions on a set of benign test inputs and observes the behavior of the model. Finally, SentiNet uses fuzzing techniques to compare these synthetic behaviors with the known behavior of the model on benign inputs, to detect prediction hijacking (Section III-B).

Iii-a Adversarial Object Localization

The first step of our approach intends to localize on the given input the regions that might contain malicious objects. The idea is to identify the parts of the input that contribute to the model prediction . Because the physical attack is small and localized, we can hope to recover the true class of input if we evaluate the model on a segmented input that contains no part of the attack. In the following, we look into the details of each step. First, we present a segmentation-based approach to propose classes. Then, starting from proposed classes and the given input, we generate a mask for that may contain the malicious object.

Class Proposal via Segmentation—The detection of the attack begins with the identification of a set of possible classes that may be predicted by the model . The first of such classes is the actual prediction, i.e., . The other classes are identified by segmenting the input and then evaluating the network on each segment. Algorithm 1 shows the algorithm to propose classes via input segmentation. Different approaches can be used to segment a given input including sliding windows and network-based region proposals [37]. In our approach, we use the selective search image segmentation algorithm [45]. Selective search generates an exhaustive list of region proposals based on the patterns and edges found in natural scenes [45]. Then, we evaluate each proposed segment, i.e., , and return the most confident predictions, where is a configuration parameter of SentiNet.

in :  -- model;
       – input of ;
       -- propositions
out :  - set of proposed classes and confidence ()
1 = SelectiveSearch ();
2 = ;
3 = TopConf(, );
Algorithm 1 ClassProposal

Mask Generation—Over the past few years, several techniques have been proposed to explain and interpret the prediction of a model. One strategy attempts to “quantify” the relevance of individual pixels of the input, e.g., using saliency maps [46]. While effective in practice, focusing on individual pixels may result in a mask of non-contiguous pixels. Sparse masks may miss elements of the malicious object and are not suitable for the model testing phase (see Section III-B). Alternative approaches do not operate on individual pixels but attempt to recover discriminative image regions used by the model to identify the inputs of the same class. Unfortunately, many of these approaches require modifying and fine-tuning a base model, e.g., Class Activation Mapping (CAM) [49] . Such modifications may alter the behavior of the model, including the malicious behavior that SentiNet intends to detect and prevent from being exploited.111We verified that CAM cannot be used with trojaned network as it removes the last layer (i.e., Layer FC6-8 of face recognition model of [30]) rendering the trigger trojan not detectable.

A particularly suitable approach for our goal is Grad-CAM [8], a model-interpretation technique identifies contiguous spatial regions of an input without requiring modifications to the original model. At a high level, Grad-CAM uses gradients computed in the final layers of a network to calculate the saliency of input regions. For class , Grad-CAM calculates the gradients of the model’s output (the model’s logit score for class ) with respect to each of the feature maps of the model’s final pooling layer to obtain . The mean gradient value of each filter map, or “neuron importance weight”, is denoted . Finally, the feature maps are weighted by their neuron importance and aggregated to obtain the final Grad-CAM output: . Here,

is the ReLU activation function 

[15] which retains only the positive gradient signals for class

. The output of Grad-CAM is a coarse heatmap of the positive importance of the image, usually at a lower resolution that the input image due to downsampling in the model’s convolutional and pooling layers. Finally, masks are produced by binarizing the heatmap with a threshold of 15% of max intensity. We use this mask to segment salient regions for the next steps.

Fig. 3: Top Row: Mask Generation using Grad-CAM. The left figure shows the Grad-CAM heatmap with respect to the targeted ’0’ class, and the right figure shows the extracted mask that covers areas outside the physical attack. Bottom Row: The left-most figure is the Grad-CAM heatmap with respect to the targeted ’0’ class, and the center figure is Grad-CAM with respect to a proposed class. Combining masks of class proposals increases the precision of the mask with respect to the physical attack.

Precise Mask Generation—Although Grad-CAM can successfully identify discriminative input regions corresponding to adversarial objects, it may also identify benign salient areas. An illustrative example is in Figure 3, where the Grad-CAM generated heatmap for a facial recognition network covers both a trojan trigger patch but also the original face. To improve the accuracy of the mask, we query the model for additional predictions on selected regions of the input image. Then, for each prediction, we use Grad-CAM to extract a mask for the input area most relevant for the prediction. Finally, we combine these additional masks to refine the mask of the initial prediction .

Once we derive a list of possible classes present in the picture, we carve out the regions of more relevant for each predicted class. For simplicity, in this section, we assume that each input can contain only one malicious object. We show how to generalize this approach to multiple malicious inputs in Section IV-C. Algorithm 2 shows the procedure to extract input regions from .

We start by extracting the mask using Grad-CAM for the input and prediction . We also extract a mask for each proposed class . Performing Grad-CAM on the other proposed classes allows us to locate the important regions of the image besides the adversarial attack. Additionally, because the adversarial region is often negatively correlated with the non-targeted classes, the heatmap actively avoids highlighting the adversarial regions of the image. We can use these heatmaps to generate secondary masks to improve our original mask by subtracting regions where the masks overlap. This results in masks that highlight only the localized attack and not the other salient regions in the image. In Figure 3, we can see this approach applied to generate a more precise mask containing mostly the adversarial region only.

in :  -- model;
       – input for ;
      , – model prediction on ;
       -- proposed classes
out :  - masks for candidate regions with the malicious object
1 = MaskGradCAM(, , );
2 = MaskGradCAM(, , ) ;
Algorithm 2 MaskGeneration

Iii-B Attack Detection

The detection of an attack requires two steps. First, as described above, SentiNet extracts input regions that are likely to contain adversarial patches. Then, SentiNet tests each of these regions on a set of benign images, as detailed below, to discriminate adversarial regions from benign ones.

Testing—Once an input region is localized, SentiNet observes the effects the region has on the model to determine whether the region is adversarial or benign. To do so, SentiNet overlays the suspected region on a set of benign test images , which are often shipped together with deployed models. These test images are fed back into the network, where the number of fooled examples are counted and used for adversarial images. Intuitively, the higher the number of mutated images can fool the model, the more likely the suspected region is an adversarial attack.

When the recovered mask is small, this feedback technique is effective at distinguishing adversarial and benign inputs, as small benign objects cannot typically overwhelm a network’s prediction. However, one problem of such an approach is that a mask that covers a large fraction of the input image will likely cause misclassifications when overlaid onto other images, even for benign inputs. Consider, for example, a large mask of an input image . When overlaid, the features inside the mask are likely to be more relevant than the features outside, increasing the chance to classify the mutated test inputs as . A way to address this issue is suppressing the features inside a mask with inert patterns, i.e., a pattern with low confidence, so to increase the response of the network to the features outside the mask. Masks too large will tend to block out too much of the test images resulting in low confidences, giving us another data point to use for classifying attacks. An actual adversarial input should have both high and values, while a benign input will have tradeoffs between the two values.

in :  -- model;
       – input for ;
       – class of ;
       -- proposed masks;
       -- bening test images;
out :  - true/false
1 = ;
2 = InertPattern ();
3 = Overlay (, );
4 = Overlay (, );
5 = , = ;
6 for , do
7       = ;
8       if  then
9             += 1
10       +=
11 = ;
12 return , ;
Algorithm 3 Testing

Decision Boundary for Detection—Now, with these two metrics (number of images fooled and average inert pattern confidence values) we can determine under which conditions an input is adversarial. A naive approach is to use thresholding based rules, but it is hard to determine how to set the thresholds and which metric holds more importance. We examine this problem by plotting metrics from an example task in a 2D plot in Figure 4

, where the red triangular dots represent metrics found with adversarial examples containing physical and the blue circular dots are calculated from clean examples. We observe that the adversarial and clean points can be easily separated using a parabolic function, suggesting that a classifier approach such as linear regression or support vector machine could be used. However, this would operate under the false assumption that we have prior examples of adversarial inputs. Ideally, we want to create a technique that would allow us to identify an unseen adversarial input as an attack based on the attack-agnostic metrics.

We notice a general pattern where adversarial examples are usually clustered near the top-right of the plot. Because adversarial examples are designed to cause misclassifications when applied to other images and not overly obstruct the other important regions of an image, it makes sense that both the number of images fooled and the average noisy confidence metrics of an adversarial example would be high. We can use our set of benign test images to generate statistics of the metrics of normal examples, which will then allow us to formulate when an input is abnormally adversarial.

We can use our collected metrics on clean examples to approximate a curve. By taking the points most likely to be the bounds of the statistics of our examples, we can approximate a curve where points lying outside our curve function can be classified as adversarial attacks. We collect our points by taking the points with the highest y-values for x-intervals, and then using a non-linear least squares function to approximate our curve. We then use our approximated curve to set classify attacks by calculating the distance between a curve and a point —using the Constrained Optimization by Linear Approximation (COBYLA) method [44]

—and determining whether that distance is within a threshold estimated by the distances of clean examples lying outside the curve.

in :  – sampled behavior of
out :  -- approximated curve function;
        – acceptable distance from
1 = ApproximateCurve(OutPts());
2 = ;
3 for  do
4       if  then
5             += COBYLA();
8 = ;
9 return , ;
Algorithm 4 DecisionBoundary
Fig. 4: Example of boundary detection for the a trojaned using 400 data points and random noise as inert pattern. On the left, the adversarial and benign metrics are plotted as red triangles and blue circles respectively; on the right, we plot the curve proposal from the sampled points, i.e., the benign test images .

Iv Evaluation

SentiNet does not modify the original network through retraining or layer modification, and thus will not impact the performance of the network. Instead, it acts as a protection layer that determines whether the resulting output of the original network can be trusted. SentiNet needs to protect a network from adversarial inputs exploiting both inherent vulnerabilities of neural networks and implanted ones. In addition, SentiNet needs to be robust against an adaptive adversary that may attempt to bypass SentiNet. SentiNet is executed after the model prediction and requires several computationally intensive operations including the Grad-CAM visualizations, the selective search algorithm, and multiple evaluations of the network under protection. As a result, as SentiNet adds overhead between the classification and the actuator, SentiNet needs to be efficient.

In this section, we evaluate SentiNet covering these three dimensions. After presenting the experiment setup in Section IV-A, in Section IV-B, we evaluate the effectiveness of SentiNet in detecting three known attacks, i.e., adversarial patches, trojan triggers, and backdoors. Then, in Section IV-C, we evaluate the robustness of SentiNet against an adaptive attacker that attempts to bypass SentiNet. Finally, in Section IV-D, we look into the performance overhead of SentiNet in terms of wallclock and runtime analysis.

Iv-a Experiment Settings

We evaluated SentiNet when protecting three publicly available networks shared by other researchers. Two of the selected networks are compromised and one is uncompromised. The compromised networks are a backdoored Faster-RCNN network for reading signs detection by Gu et al. [20] and a VGG-16 trojaned network for facial recognition by Liu et al. [30]

. The uncompromised network is a VGG-16 network trained on the Imagenet dataset by Simonyan et al. 

[41]. Additionally, to operate, SentiNet requires a benign test image set and an inert pattern to generate the decision boundary as shown in Section III-B. We present the generation of each test set for the selected networks in Section IV-B. Unless specified, we use random noise for our inert pattern .

We evaluated SentiNet using a Tesla K-80 GPU on an Ubuntu 18.04 machine with Intel Xeon CPU E5-2697 and 72 cores. SentiNet uses Tensorflow 1.5 to generate the adversarial patches for the uncompromised network, BLVC-Caffe for the trojaned network, and Faster-RCNN Caffe 

[14] for the poisoned network. To parallelize the class proposal, SentiNet relies on the ROI pooling layer as implemented by the Fast-RCNN Caffe version. Finally, SentiNet uses the off-the-shelf implementations of Grad-CAM [38] and selective search [45].222We will publish scripts, models, and datasets on GitHub after publication.

Finally, we measure effectiveness and robustness in terms of accuracy and performance, collecting the TP/TN and FP/FN rates of each attack. We measure efficiency by profiling the execution of each step of SentiNet.

Iv-B Known Attacks

The first part of our evaluation assesses the effectiveness of SentiNet in protecting our selected networks against three attacks, i.e., backdoors [20], trojan triggers [30], and adversarial patches [4]. For each attack, we measured the effectiveness of SentiNet with and without mask refinement. For the adversarial patches attack, we considered an additional variant where the attacker uses multiple patches at the same time. The summary of this evaluation is in Table I.

Network Vuln. Attack Inert Noise TP TN FP FN
Faster-RCNN [20] Poisoned Attack #1 Random 85.07% 86.90% 13.10% 14.83%
Faster-RCNN [20] Poisoned Attack #1, w/o mask reduction Random 85.61% 76.55% 23.45% 14.39%
VGG-16 [30] Trojaned Attack #2 Random 99.18% 99.75% 0.25% 0.82%
VGG-16 [30] Trojaned Attack #2, w/o mask reduction Random 87.77% 99.25% 0.75% 12.23%
VGG-16 [30] Trojaned Attack #2 Checker 99.18% 99.50% 0.50% 0.82%
VGG-16 [41] - Attack #3 Random 98.52% 95.25% 4.75% 1.48%
VGG-16 [41] - Attack #3, w/o mask reduction Random 98.52% 95.00% 5.00% 1.48%
VGG-16 [41] - Attack #4 Random 99.19% 95.25% 4.75% 0.81%
TABLE I: Effectiveness of SentiNet against known attacks #1-4.
Fig. 5: The top row shows the backdoor, i.e,. a flower, and the Grad-CAM output for the class “warning-sign” on an adversarial stop sign. The bottom row shows the decision boundary generated of the poisoned network. The blue circles denote benign inputs and the red triangles denote adversarial inputs.

Attack #1: Poisoned Networks—The first attack that we evaluate is a backdoor attack against a poisoned network. While this attack has been studied in the past, the availability of poisoned networks for classification tasks is quite limited. Accordingly, we resort to the poisoned Faster-RCNN object detection network shared by Gu et al. [20], a network that is trained to detect the position of stop signs on the input image. The authors poisoned the training set so that the network will incorrectly classify stop signs with a yellow flower—the backdoor—as a warning sign (see Figure 5).

As Faster-RCNN is an object detection network, we paid particular attention when connecting it to SentiNet. First, the prediction of Faster-RCNN consists of bounding boxes, classes and confidence of the detected objects. Instead of processing these outputs, we bypass the bounding box prediction layer and connect SentiNet directly to the class likelihood output. In doing so, SentiNet will have access to classes and confidence values only. Second, the Faster-RCNN network detects objects that are often a small part of the background scene, in contrast to cropped images often used for classification tasks. To modify our input to be compatible with the detection network as a classification task, we take the image we intend to classify and place it onto a larger image. As an example, with an image of dimension , we can create a larger blank image of where and , with a carefully chosen n value. We then place our image at the coordinates , and input as our ROI proposal as our starting x, starting y, ending x, and ending y coordinates. This larger image is projected into a feature map using spatial pyramid techniques before we feed it through the network with our ground-truth bounding box () as the ROI input. This technique allows us to leave unchanged the performance of the Faster-RCNN.

Once prepared the classifier , we create the boundary decision as described in Section III-B. To prepare the data for this task, we collected 145 images of stop signs from the LISA dataset [33] by cropping out the images with their labelled bounding boxes, and place the backdoor at approximately below the “stop” text of each sign to create an “attack” dataset. The dataset in this technique contains 130 images which successfully fool our Faster-RCNN. We also prepare our test set by cropping out 100 images from each class with equal probability. With our prepared dataset, we run SentiNet on all 130 adversarial examples and on all 145 clean examples, collecting the final classifications for each example.

As shown in Figure 5, the clean data points trend towards lower values in comparison to the backdoored inputs. By plotting out enough data points from our clean dataset, we can construct the decision boundary to set the attack threshold. Our TP and TN rates are both 85%, as shown in Table I. As we will see later on in the section, these numbers are not as competitive as other experiments, which is likely due to both the coarser heatmaps generated by the smaller ROI pooling layer and the lower resolution of the images in the LISA dataset—stop signs are often cropped to sizes as small as . Nevertheless, SentiNet still produces detection accuracies comparable to other adversarial detection works (see, e.g., [5]).

Fig. 6: The top row shows the trigger used by Liu et al. [30] and the Grad-CAM output of “A.J. Buckley” on an a face of the actress Mary Astor with the trigger. The bottom row shows the decision boundary. We use blue circles for benign inputs and red triangles denoting adversarial inputs.

Attack #2: Trojaned Networks—A trojaning attack generates a trigger from specified neurons in the network, and then fine-tunes the model on the trigger to respond to a specific output. The generated trigger can be created with any shape or size using both convolutional and fully-connected neurons. The trojaned model we for our target network is the facial recognition model shared by Liu et al. [30] with a VGG-16 architecture [41] trained on the VGG-Face dataset [36]. In the evaluation, we used the square trigger that can be generated with neurons from the FC6 layer of the VGG-16 network. This trigger hijack the prediction of the VGG-16 network to class 0, i.e., “A.J. Buckley”. We use a separate facial recognition dataset to generate our clean and adversarial datasets. We use 400 examples of the LFW dataset [24], placing the trigger of each of these images. Of the 400 adversarial images, 368 of them successfully direct the output of the model to class 0. Our test set consists of 100 images from the original VGG-16 dataset [41].

The plot in Figure 6 shows clear separation between normal samples (in blue) and the adversarial samples (in red). We can see that most of the blue examples fail to fool the model with the exception of several masks which produce low scores, while the adversarial examples fool a high number of images while preserving very high scores with the overlaid inert pattern. Thus, after plotting out enough points from our clean dataset, we can subsample the points with the highest confidence scores to create our boundary. This classification rule produces TP and TN rates above 99%, as shown in Table I.

In this task, we can also see the greater importance of our mask reduction technique. The trojan mechanism in particular seems work with the combination of both trigger and the original facial regions. This often results in the Grad-Cam highlighting regions outside the trigger. Table I, we can see the clear improvement Mask Reduction has on improving the True Positive and True Negative rates of SentiNet.

Fig. 7: The top row shows the adversarial patch by Brown et al. [4] and the Grad-CAM output of “toaster” on a randomly chosen adversarial image. The bottom row shows the decision boundary. The blue circles denote benign inputs and the red triangles denote adversarial inputs.

Attack #3: Adversarial Patches—Adversarial patches are white-box attacks that use gradient perturbations to generate a highly salient patch. We use the patch provided by Brown et al. [4] as our physical attack, which fools Imagenet trained models to classify adversarial images as a “toaster”. In our case, we use a VGG-16 [41] Imagenet-pretrained [10] network as our target network . We choose the size of the patch to occupy around 25% of the input images. We randomly sample 400 images from the Imagenet test set [10] to implement our attack. We place the patch at random positions on the 400 images to generate our adversarial dataset, retrieving 338 images which successfully direct the output of our model to “toaster”. Our test set consists of 100 randomly selected images from the Imagenet training set.

In Figure 7

, we observe that the adversarial (red triangle) points are skewed towards producing both higher

and numbers. The classification rule generated by this curve produces TP rates of 98% and TN rates of 95% as seen in Table I. These results compare favorably to other literature focused on adversarial attack detection.

Fig. 8: The top row shows the Grad-CAM output of “toaster” with disjoint adversarial patches and the generated mask. The bottom row shows the decision boundary. The blue circles denote benign inputs and the red triangles denote adversarial inputs.

Attack #4: Multiple Adversarial Patches—As an extension of Attack #3, we perform experiments on images containing multiple adversarial patches, using the same target network and patch from Attack #3. The setup is largely the same as the previous setting, except that two slightly smaller patches are randomly placed on opposing sides of an image. This experiment is mainly performed to highlight SentiNet’s generalizability to detect attacks that are unpredictable in deployment and appearance. Being able to detect disjoint attacks is a great demonstration of the ability of our model to identify attacks without any prior knowledge of the adversary.

Grad-CAM can be used to visualize disjoint regions of an image that contribute towards the same class. As we can see in Figure 8, this applies to adversarial patches, with both patches successfully highlighted in the attack example. We do not have to modify our technique to make SentiNet work in this scenario. We use a random sample of 400 images from Imagenet test set to create our dataset (we also use 100 images from Imagenet train set as test set ). With two patches inserted, we are able to produce a higher attack success rate, with 369 detected examples. After running SentiNet on both our adversarial and clean datasets, we find our TP rate to be 99.19% and our TN rate to be 95.25% (see Table I). Additionally, we perform an experiment to measure how often SentiNet is able to successfully detect both patches by checking whether the generated masks are disjoint and cover substantial portions of both patches. We find that SentiNet discovers both patches 97.4% of the time, which we consider to be sufficiently accurate relative to the TP and TN values.

Iv-C Adaptive Attacks

Our previous analysis demonstrates how SentiNet detects multiple categories of attacks without having prior knowledge of the adversary or having to modify the defense. A threat is still posed by “adaptive attacks”, which are targeted and bypass specific defenses. The next section considers an attacker aware of the presence of SentiNet and its mechanisms, whereupon the attack is adapted to avoid detection by SentiNet. We probe our defense from the perspective of an actual attacker to measure the robustness of our technique. We attempt to compromise the three distinct component of our defense; the heatmap proposal (Section IV-C1), the class proposals (Section IV-C2), and the attack classification (Section IV-C3). We will analyze the robustness of each component from the perspective of a potential attacker.

Iv-C1 Attacking Region Proposals

Our defense is reliant on successfully localizing the adversarial region in an image. In our current framework, this is done using the Grad-CAM algorithm, which generates heatmap of the salient regions leading to a classification. If an attack can disrupt the Grad-CAM mechanism and avoid successful detection and localization, the subsequent components of the pipeline will fail. The Grad-CAM mechanism uses network back-propagation to measure region importance. This means our mechanism is differentiable, which theoretically means we can modify our heatmap outputs using targeted gradient perturbations. However, our experiments show that Grad-CAM is robust against adversarial attacks in our defense context, and strongly suggests that Grad-CAM is capturing the inherent saliency of a region and cannot be easily manipulated.

Attack #5: Perturbing Grad-CAM—We first show that perturbational noise can target specific Grad-CAM outputs. The Grad-CAM function

is differentiable, and we can optimize an input on this function given a target class. We can use a standard Stochastic Gradient Descent Optimizer (SGD) on a VGG-16 network 


to minimize a loss function calculated as the total difference between the current Grad-CAM output and the target Grad-CAM output, and iteratively add noise until our loss converges.

Figure 9 is a image of a dog overlaid by an adversarial patch, and the subsequent Grad-CAM heatmap on target class “toaster”. We start from random noise, which does not have any salient regions for the “toaster” class, and optimize the input on our loss function. We demonstrate in Figure 9 that the heatmap output of the generated noise at convergence is visually identical to the original heatmap. This conclusively shows that Grad-CAM outputs can be precisely manipulated through gradient optimization. However, to mount such an attack, the attacker is required to add noise to the entire image, which may not be feasible.

Fig. 9: We calculate Grad-CAM for label “toaster” on each of the inputs. The first row shows the Grad-CAM output for adversarial patch overlaid on an image of a dog. The second row demonstrates that we can reproduce the Grad-CAM output using gradient perturbations (Attack #5). The third row shows that producing a similar heatmap is still possible if the patch is located near the targeted heatmap (Attack #6). However, the fourth row shows that we are unable to affect the Grad-CAM output directly if we are not allowed perturb noise on the targeted Grad-CAM location (Attack #6).

Attack #6: Heatmap Misdirection—A potential behavior an attacker might attempt to generate is heatmap region misdirection, where the heatmap proposes a region that does not cover the adversarial region to either increase the region captured or avoid detection altogether. We demonstrate earlier that this is trivially possible if the attacker is allowed to add perturbational noise to an entire image. However, in our setting, the attacker cannot add noise beyond the region of the localized attack, and therefore Grad-CAM perturbations must also be constrained to the adversarial region. Therefore, the threat we want to consider is that an attacker can add noise in one region of an image that increases the Grad-CAM output value in a disjoint region. We again consider the target heatmap of our adversarial patch in Figure 9. We first show in Figure 9 that if noise region overlaps the Grad-CAM location we want to modify, we will able to modify the heatmap successfully. We also show in Figure 9

that if the noise region is disjoint from the target Grad-CAM region, our Grad-CAM optimization fails to achieve either visual similarity or equivalent final convergence loss. These experiments show that localized noise can only affect the corresponding Grad-CAM region, which strongly suggests that a misdirection attack is not possible.

Attack #7: Heatmap Minimization—If heatmap misdirection is not possible with localized noise, another option the attacker can consider is to minimize the corresponding Grad-CAM region to the greatest extent possible to avoid detection. We can modify our loss function as the value of the Grad-CAM region to optimize for minimal Grad-CAM output. We start from the adversarial patch image in Figure 10 and iteratively add perturbational noise to the region. In Figure 10, we show that as our loss converges, the Grad-CAM output is successfully minimized, avoiding detection. We can also see in Figure 10 that as more noise is added, the success rate of our attack is reduced. This pattern suggests that Grad-CAM is capturing some inherent saliency of the region and cannot be minimized without reducing the attack effectiveness.

Fig. 10: The top row shows a new adversarial patch that can minimize the Grad-CAM output of label “toaster” on the location of the patch. However, the bottom plot shows that as the Grad-CAM output sum decreases, the attack success drops correspondingly.

However, we would also expect the effectiveness of the attack to drop with the addition of any arbitrary noise, and does not rule out the possibility of perturbational noise that can optimize both targeted misclassification and Grad-CAM minimization. Our next experiments demonstrate how an attacker might attempt to jointly generate a patch for both criteria by performing both optimization functions during every iteration. We use a range of from 4.0 to 1.0 with intervals of 0.025 and set the learning rates of the Grad-CAM minimization and the misclassification as and respectively, as seen in Figure 11. This adjusts the level or prioritization we set for each optimization task. After we generate 40 patches for each value, we plot the average of both the percentage of successful misclassifications on Imagenet [10] test set from Attack #3 and the percentage of patches with overlap with the adversarial pixels and the Grad-CAM region in Figure 11. Our plot shows that there is an inverse relation between how well the patch fools test images and how well the patch is hidden, corroborating our previous findings. We further plot the patches that successfully fulfill both criteria in Figure 11, and find that the optimum value is where the ratios of successful misclassifications and hidden patches overlaps, giving us patches that fulfill both criteria at most a 10% success rate. This strongly suggests that minimizing the Grad-CAM output directly weakens the attack effectiveness, and that it is difficult to jointly optimize for both objectives.

Fig. 11: In the top row, the left and right patches are generated to maximize Grad-CAM minimization and Attack success respectively. We can see in the bottom row plot that it is difficult to optimize for both objectives, and that patches that successfully fulfill both objectives can achieve at most 10% effectiveness in both bypassing Grad-CAM and attacking successfully.

We can conclude that the redirection task for Grad-CAM is infeasible for localized patches, and that minimizing Grad-CAM is incompatible with the misclassification objective. Therefore, Grad-CAM is reasonably resistant to adaptive attacks and is a robust choice for the region proposal task.

Iv-C2 Class Proposal

Our class proposal module uses selective search [45] and a proposal network modified from the original network with a ROI pooling layer [14]. Selective search is a traditional image processing algorithm that uses a graph-based approach to segment an image based on color, shapes, textures, and size. There is no gradient component for an attacker to perturb, or a training procedure to poison, which severely limits an attacker’s mechanisms of attack compared to a network-generated proposal mechanism as seen in Faster-RCNN [37]. Our selective search algorithm is also designed to capture class proposals other than adversarial class, and the attacker will be unable to affect the selective search results outside of the adversarial region. Furthermore, because our proposal network uses the original network weights, there is no way to cause different behaviors between the original and proposal networks. Finally, the attacker will have limited motivation to attack the class proposal process of our network, as a successful attack will damage the accuracy of the attack detection rather than break the entire process. We can conclude that our class proposal mechanism is robust due to the properties of the individual components that are collectively resistant to perturbational or poisoning attacks.

Iv-C3 Attack Classification

We consider the decision procedure of our method by analyzing our attack classification robustness. Our classification procedure is not trained with gradient descent techniques, which removes the possibility of using gradient perturbations to fool the classification. Our thresholding is based on data points with two dimensions collected from a confidential dataset, the fooled-percentage and the average confidence. The average-confidence is calculated with a pattern .

Attack #8: Classification—If an adversary is able to manipulate the model to respond to an inert pattern with strong confidence they can produce similar outputs between benign and adversarial inputs, bypassing our defense. We show we can keep the pattern secret by demonstrating how arbitrary patterns still produce similar levels of accuracies, by using the standard random noise pattern and a new checker pattern as shown in Figure 12. In Table I, we can see that for Attack #2, the TP and TN rates of the random noise pattern and checker pattern are within . Also, the defense will always be able to find an inert pattern by using gradient descent to minimize response confidence for all classes. This component of SentiNet secure as long as the pattern is kept secret.

Fig. 12: Inert Patterns: The default pattern we use for is random noise shown on the left. Another pattern we can potentially use is the checkered pattern on the right which the VGG-Face network also responds weakly to.

Attack #9: Patterns Targeting Different Classes—An attacker can place multiple patches into the image targeting different classes. Our defense will capture one of the patches, leaving the other patch to pass through undetected. We can modify our defense to run iterations of our defense until the image is no longer classified as an adversarial by adding a linear increase in runtime for each additional patch.

Fig. 13: Size analysis of Adversarial Patches. The legend denotes the ratio of the image to the input, and we can see that the drops as the size increases. On the right, we plot the attack success rate after increasing the transparency values, and we can see that with large () patches the input drops below the decision boundary while still retaining a 90% attack success rate.

Attack #10: Size Attack—If the attacker uses a large enough patch, the average confidence on will be lowered, which reduces the effectiveness of our defense. We can see in Figure 13 that for adversarial patches, the attack’s drops as the size of the patch increases. By increasing the transparency of the patch, we can drop the attack below the threshold while retaining very high attack success. However, we argue that this is an unavoidable aspect of localized attacks. Brown et al. [4] notes that overlaying a image of an actual toaster will create the same behavior of an adversarial patch at a large enough size. This raises interested questions about what actually constitutes an “attack”, although for now we can conclude our defense captures small patches that abnormally affects the classification results at an extent greater than expected for natural images.

Iv-D Performance

Operation Runtime (s)
Selective Search 2.25
Forward Pass 2.5
Class Proposal 2.5
Grad-CAM 0.35
Sequential Total 23.3
Parallelized Total 7.6
TABLE II: Runtime of Analysis of SentiNet. Wallclock times of each individual component and of the total sequential and parallelized time are shown.

We evaluate the overhead of our defense in terms of rounded wallclock time in Table II, using a VGG-16 [41] architecture as a case study. In general, a forward pass through the network takes 2.5 seconds, regardless of batch size, while the selective search and Grad-CAM each take 2.25 and 0.35 seconds respectively. If we use a sequential approach to implement SentiNet, we will perform a forward pass to get the initial prediction on input , a selective search on , class proposal on , three Grad-CAMs on the highest class proposals, and three iterations of two batched computations to get and to get a total runtime of 23.3 seconds. However, many steps can be computed in parallel given enough compute. In general, the original prediction and class proposals can be computed in parallel, as well as the three Grad-CAM computations and the final 6 batched computations. This cuts the total time down to 7.6 seconds, which is 3x as large as the original prediction time of 2.5 seconds.

V Discussion

Our results show that our defense is able to record high accuracy metrics for detecting adversarial images in all three cases and is robust to strong adaptive adversaries. We now cover in more depth the strengths and limitations of our approach, highlighting some unusual aspects of our design.

V-a Strengths of our Defense

Proportional Defense

—The fundamental strength of SentiNet is that it relies on the fact our model is compromised by an attack. Therefore, our detection framework is unaffected by the mechanism or deployment the attack uses, detecting attacks successfully as long as they fool the model. In fact, our framework is better at detecting attacks when the adversary is stronger. Very powerful physical attacks are able to consistently fool the model while minimizing the size needed or the obfuscation important parts of the image to order to avoid detection. The properties that characterize a strong physical attack make it easier for SentiNet detection as such an attack would easily result in outlier behavior outside the threshold of an approximated curve. In real-world conditions, attacks have to be even more robust as they need to tolerate different lighting and viewpoint variations. Real-world deployments of neural networks could potentially represent the most potent deployment scenario for SentiNet.

Detection of Unsuccessful Attacks—SentiNet can also further extended for additional functionality. With the adversarial input detection and class proposal, SentiNet can analyze also the second or subsequent proposed classes, raising the possibility of using SentiNet to detect unsuccessful attempted attacks. This is a useful attribute which could help deter attackers from probing and testing our defense with experimental attacks.

Run-time Adversarial Object Suppression—Furthermore, the masking functionality can be used to still preserve functionality by reporting the output label with the inert pattern. Unlike other detection frameworks which simply identify when an attack has taken place, the calculated mask of SentiNet can easily be used to remove the physical attack and salvage as much of the rest of the data as possible.

V-B Limitations of our Defense

Large Adversarial Objects—In general, adversaries can override an image prediction if they are allowed to modify the majority of pixels in an image. SentiNet makes the assumption that adversarial objects are small. We show in Section IV that large enough patches can easily bypass the decision boundary of our model. However, we note once more that Brown et al. [4] demonstrates how large images of toasters will hijack the prediction of classification models, which is expected behavior even for a human classifier. In general, if an attacker can modify a majority of an image, it is trivial to influence the output by obfuscating the input with the targeted class. Our focus with SentiNet is to capture adversarial attacks that are unreasonably salient, designed to be small and unnoticeable and that would not fool humans. With this measure, SentiNet largely succeeds at detecting abnormal regions of images that deviate from patches of natural images.

Overhead—The runtime of our approach is not insignificant at 3x the original computation time. Each inference requires a constant overhead, raising the inference time from milliseconds to several seconds. This lag for real-time detection scenarios is perhaps acceptable for tasks like facial recognition, but could prove impractical in scenarios including autonomous driving. However, the computational cost is even more significant, as instead of performing one forward pass through our network, SentiNet performs hundreds of batched inferences. In non real-time situations, it is common for inference to be performed in batches. This would be incompatible with the current SentiNet setup, as for each input SentiNet will require 100 additional inputs during each fuzzing phase. Performing parallel analysis is theoretically possible, but will require 100x the amount of memory a regular inference scheme is current using. It is worth noting that many other detection schemes introduce significant overhead as well, and a small benefit of the SentiNet approach is its plug and play compatibility, only requiring a pre-computed run of small subset of clean samples.

Vi Related Work

We now review works closely related to this paper. First, we explore the domain of attacks against neural networks. Then, we expand on prior works on physical attacks. Finally, we review proposed approaches to detect attacks against neural networks.

Neural Network Attacks—The literature on adversarial attacks on neural networks is vast and still growing. Szegedy et al. [43] first demonstrated how adversarial noise can be used to fool neural network classifiers by adding small gradient perturbations to an image that is imperceptible to humans. Numerous works have built on this approach; some notable works include the Fast Gradient Sign Method [18], DeepFool [35], and Universal Adversarial Perturbations [34]. This area of research can be categorized as a cat-and-mouse game in recent years, where defenses are created for new attacks that bypass previous defenses [5] [6]. Additionally, adversarial attacks certainly are not limited to gradient-perturbation based techniques; data poisoning can be used to cause misintended model behaviors [39], and compromised hardware can also be used to insert trojans during the network inference procedure [9]. Akhtar et al. [1] provides a useful survey about the current state of the adversarial deep learning field.

Physical Attacks—Multiple works have demonstrated physical attacks within a variety of classification settings and attacker capabilities. Adversarial patches [4] are possibly the most well-known physical attack. These attacks are generated by performing back-propagation on the target class to calculate gradient noise localized to a region of the image. A related attack—Localized and Visible Adversarial Noise [26]—operates under a similar principle with smaller but less robust attacks. Robust Physical-World Attacks on Deep Learning Models [11] demonstrates how adversarial perturbations can be disguised as graffiti stickers to fool traffic sign attacks. Similarly, [40] uses perturbations placed on glasses to fool facial classification models. Trojaned Neural Networks [30] perform back-propagation on specifically chosen neurons in the network rather than the target class. The generated triggers are used to ”trojan” the model by performing slight fine-tuning to guide the trigger outputs towards a specified class, making sure that the triggers only cause misclassifications on trojaned models. BadNets [20] targets traffic sign detection models by inserting pre-chosen patterns into images with the target label, poisoning the data before the training process.

Adversarial Attack Detection—Detection techniques for adversarial attacks as a defensive measure have been proposed by many researchers. Safetynets [31] is designed to detect adversarial-noise based attacks and exploits the different activations adversarial perturbations produce to train a SVM classifier. Metzen et al. [22] use a similar approach by training a modified target classification network to detect adversarial perturbations. Feinman et al. [13] also trains a classifier to detect adversarial perturbational inputs based on the neural network features, while Gong et al. [17] introduces a classifier trained to detect adversarially-perturbed images. Magnet [32] trains a classifier on manifolds of normal examples to detect adversarial perturbations without prior knowledge of the attack. There are also some works designed at creating defenses that do not require training. Grosse et al. [19] uses statistical techniques to distinguish adversarial-perturbations outputs, while Hendrycks et al. [23] uses PCA to visualize differences in perturbed images. A survey by Yuan et al. [47] covers further detection defenses. All these works are only aimed at defending against adversarial perturbations whereas SentiNet can defend a network against other types of attacks, i.e., data poisoning and trojaning attacks.

Vii Conclusion

In this work, we introduce SentiNet, an attack agnostic framework for detecting physical attacks on Convolutional Neural Networks. Our method is notable because it only relies on the malicious behavior of an adversarial attack to perform classifications, without requiring prior knowledge of the deployment or mechanisms of an attack. We demonstrate the effectiveness SentiNet on three experiments with fundamentally different attack mechanisms; a data poisoning attack, a network trojaning attack, and a white-box adversarial attack. We also the robustness of SentiNet against strong adaptive adversaries by individually testing each component of our defense. Our approach can be run in real-time in many scenarios, and is flexible and easy to deploy.

There are further improvements that would help improve the performance of SentiNet. Better visualization techniques (see, e.g., Grad-CAM++ [8]

) would improve the heatmap quality to create better masks. Deep learning interpretability and visualization is a challenging problem and breakthroughs in this area can also allow us to further reason about whether an attack is taking place. Furthermore, our anomaly detection approach can be further extended to take advantage of the richer data provided by the neuron outputs during the inference process. Works have shown techniques to detect anomalous high dimensional data using one class neural networks (see, e.g., 

[16, 7]) which could enhance our current framework.

We hope SentiNet inspires further approaches towards creating attack-agnostic defenses. Tailoring a defense towards a specific attack means unknown attacks cannot be captured, and also makes the system highly vulnerable to strong adaptive adversaries. We believe a similar approach can be used to detect other adversarial attacks by leveraging the same core concepts of identifying an attack from a model’s weakness.


This work was partially supported by NSF, ONR, the Simons Foundation, a Google faculty fellowship, the Swiss National Science Foundation (SNSF project P1SKP2_178149), and the German Federal Ministry of Education and Research (BMBF) through funding for the CISPA-Stanford Center for Cybersecurity (FKZ: 13N1S0762).