Adversarial Examples in Deep Learning: Characterization and Divergence

06/29/2018
by   Wenqi Wei, et al.
0

The burgeoning success of deep learning has raised the security and privacy concerns as more and more tasks are accompanied with sensitive data. Adversarial attacks in deep learning have emerged as one of the dominating security threat to a range of mission-critical deep learning systems and applications. This paper takes a holistic and principled approach to perform statistical characterization of adversarial examples in deep learning. We provide a general formulation of adversarial examples and elaborate on the basic principle for adversarial attack algorithm design. We introduce easy and hard categorization of adversarial attacks to analyze the effectiveness of adversarial examples in terms of attack success rate, degree of change in adversarial perturbation, average entropy of prediction qualities, and fraction of adversarial examples that lead to successful attacks. We conduct extensive experimental study on adversarial behavior in easy and hard attacks under deep learning models with different hyperparameters and different deep learning frameworks. We show that the same adversarial attack behaves differently under different hyperparameters and across different frameworks due to the different features learned under different deep learning model training process. Our statistical characterization with strong empirical evidence provides a transformative enlightenment on mitigation strategies towards effective countermeasures against present and future adversarial attacks.

READ FULL TEXT VIEW PDF

page 5

page 7

page 9

page 10

page 11

10/06/2021

Reversible adversarial examples against local visual perturbation

Recently, studies have indicated that adversarial attacks pose a threat ...
09/28/2018

Adversarial Attacks and Defences: A Survey

Deep learning has emerged as a strong and efficient framework that can b...
06/19/2020

Adversarial Attacks for Multi-view Deep Models

Recent work has highlighted the vulnerability of many deep machine learn...
06/04/2020

Characterizing the Weight Space for Different Learning Models

Deep Learning has become one of the primary research areas in developing...
04/15/2018

Adversarial Attacks Against Medical Deep Learning Systems

The discovery of adversarial examples has raised concerns about the prac...
02/14/2019

Can Intelligent Hyperparameter Selection Improve Resistance to Adversarial Examples?

Convolutional Neural Networks and Deep Learning classification systems i...
11/28/2019

Towards Privacy and Security of Deep Learning Systems: A Survey

Deep learning has gained tremendous success and great popularity in the ...

1. Introduction

Deep learning has achieved impressive success on a wide range of domains like computer vision

(krizhevsky2012imagenet, )

and natural language processing

(collobert2008unified, )

, outperforming other machine learning approaches. Many of deep learning tasks, such as face recognition

(taigman2014deepface, ), self-driving cars (kim2017end, ), speech recognition (hinton2012deep, ) and malware detection (dahl2013large, ), are security-critical (barreno2006can, ; papernot2016towards, ).

Recent studies have shown that deep learning models are vulnerable to adversarial input at prediction phase (szegedy2013intriguing, ). Adversarial examples are the input artifacts that are created from natural data by adding adversarial distortions. The purpose of adding such adversarial noise is to covertly fool deep learning model to misclassify the input. For instance, attackers could use adversarial examples to confuse a face recognition authentication camera or a voice recognition system to breach a financial or government entity with misplaced authorization (sharif2016accessorize, ). Similarly, a self-driving vehicle could take unexpected action if the vision camera recognizes a stop sign, crafted by adversarial perturbation, as a speed limit sign, or if the voice instruction receiver misinterprets a compromised stop instruction as a drive-through instruction (carlini2016hidden, ).

The threat of adversarial examples has inspired a sizable body of research on various attack algorithms (madry2017towards, ; szegedy2013intriguing, ; kurakin2016adversarial, ; papernot2016limitations, ; evtimov2017robust, ; goodfellow2014explaining, ; carlini2017towards, ; sharif2016accessorize, ; carlini2016hidden, ; kurakin2016physical, ; xie2017adversarial, ; moosavi2016deepfool, ; elsayed2018adversarial, ; nguyen2015deep, ; ahmed2017poster, )

. Even with the black box access to the prediction API of a deep learning as a service, such as those provided by Amazon, Google or IBM, one could launch adversarial attacks to the privately trained deep neural network (DNN) model. Due to transferability

(tramer2017space, ; liu2016delving, ; papernot2016transferability, ; papernot2017practical, ), adversarial examples generated from one deep learning model can be transferred to fool other deep learning models. Given that deep learning is complex, there are many hidden spots that are not yet understood, such blind spots can be utilized as attack surfaces for generating adversarial examples. Furthermore, adversarial attacks can happen in both training and prediction phases. Typical training phase attacks inject adversarial training data into the original training data to mis-train the network model (huang2011adversarial, ). Most of existing adversarial attacks are schemed at prediction phase, which is our focus in this paper.

To develop effective mitigation strategies against adversarial attacks, we articulate that the important first step is to gain in-depth understanding of the adverse effect and divergence of adversarial examples on the deep learning systems. In this paper, we take a holistic and principled approach to characterize adversarial attacks as an adversarial learning of the input data and a constrained optimization problem. We dive into the general formulations of adversarial examples and establish basic principles for adversarial noise injection. We characterize adversarial examples into easy and hard attacks based on statistical measures such as success rate, degree of change and prediction entropy, and analyze different behavior of easy and hard cases under different hyperparameters, i.e., training epochs, sizes of feature maps and DNN frameworks. Moreover, we visualize the construction of adversarial examples and characterize their spatial and statistical features. We present empirical evidence on the effectiveness of adversarial attacks through extensive experiments. Our principled and statistical characterization of adversarial noise injection, the effectiveness of adversarial examples with easy and hard attack cases, and the impact of DNN on adversarial examples can be considered as enlightenment on the design of mitigation strategies and defense mechanisms against present and future adversarial attacks.

2. Adversarial Examples and Attacks

We first review the basic concept of DNN model and the threat model. Then we provide a general formulation of adversarial examples and attacks, describe the metrics for quantifying adversarial objectives and basic principle of adversarial perturbation.

2.1. DNN Model

Let be an input example in the training dataset . A DNN model is made of

successive layers of neurons from the input data to the output prediction, and

Each layer represents a parametric function

, which computes an activation function, e.g., ReLU, of weighted representation of the input from previous layer to generate a new representation. The parameter set

, consists of several weight vectors

and bias vectors

, where denotes the number of neurons in layer . Let be the class label space, where is the total number of labels, and be the classification prediction result for input

, in the form of m-dimension probability vector

, such that indicates the probability generated by DNN model toward class , and . The predicted label is the class with largest probability in the prediction vector. For ease of presentation, we assume that if there are multiple class labels with the same maximal probability, i.e., , only one predicted class label is chosen for a given example , denoted by

. The common output layer is a softmax layer, the prediction vector is also called logits, and we call the input of the softmax layer the prelogits.

The deployment of a DNN model consists of two phases: model training phase and model-based prediction phase. In training phase, a set of known input-label pairs is used to train a DNN model. is the known (ground truth) class label of the input . The DNN first uses existing parameters

to generate classification from input data forwardly, then computes a loss function

that measures the distance between the prediction vector and the ground truth label. With a goal of minimizing the loss function, the DNN training algorithm updates the parameters at each layer using backpropagation

with an optimizer, e.g., stochastic gradient descent (SGD), in an iterative fashion. The trained DNN will be refined through testing. Two metrics are used to measure the difference between the predicted vector and the ground truth label of the test input. One is accuracy that shows the percentage of test input whose predicted class

is identical with its ground truth label . The other is the loss function that computes the distance of the predicted class vector to the label . The trained DNN model produced at the end of the training phase will be used for classification prediction. At the prediction phase, the prediction API sends an input to the trained DNN model to compute its prediction vector and the corresponding predicted label via the set of fixed parameters learned in the training phase.

2.2. Threat Model

Insider Threat: An insider threat refers to white-box attack and compromises from inside the organization that provides the DNN model based prediction service, such as poisoning attacks during training phase. Insider adversaries may know the DNN architecture, the intermediate results during the DNN computation, and are able to fully or partially manipulate the DNN model training process, e.g., injecting adversarial samples into the training dataset, manipulating the training outcome by controlling the inputs and outputs in some layers of DNN.

Outsider Threat: An outsider threat refers to black-box attack and compromises from external to a DNN model. Such attackers can only access the prediction API of the DNN as a service but do not have access to the DNN training and the trained DNN prediction model. However, attackers may have general and common background knowledge of DNN that is publicly available. We consider two types of outsider attacks: untargeted or targeted.

Untargeted Attack is a source class misclassification attack, which aims to misclassify the benign input by adding adversarial perturbation so that the predicted class of the benign input is changed to some other classes in without a specific target class.

Targeted Attack is a source-target misclassification, which aims to misclassify the predicted class of a benign example input to a targeted class in by purposely crafting the benign input example via adversarial perturbation. As a result, the predicted class of the input is covertly changed from the original class (source) to the specific target class intended by the attacker.

Let be the maliciously crafted sample of benign input . Figure 1 illustrates the workflow of an adversarial example based outsider attack to the prediction API of a deep learning service provider, consisting of 7 steps. (1) A benign input is sent to the prediction API, which (2) invokes the trained model to compute the prediction result. (3) Upon the API returning the prediction probability vector and its predicted class label , (4) the attacker may intercept the result and (5) launch an adversarial example based attack by generating adversarial example (in one step or iteratively). (6) The adversary collects the prediction vector and the predicted class . The iteration stops if the attack is successful. (7) The user receives the incorrect result.

Figure 1. Outsider Adversarial Attack Workflow

2.3. Formulation of Adversarial Examples

For adversaries, an ideal adversarial attack is to construct a perturbed input with minimal distance to to fool the DNN model such that the resulting prediction vector is with the predicted label , which is different from , and yet its human user would still visually consider the adversarial input similar to or the same as the benign input and thus believe that the predicted label should be identical to . Let measure the distance between the original input and the adversarial input . Example distance metrics can be , , and norm. For image input, norm describes the number of pixels of that are changed; norm is the Euclidean distance between input and ; and norm denotes the maximum change to any pixel of input . can be seen as the perturbation noise injected into the benign input to construct an adversarial input by the adversarial attack algorithm. We below provide a general formulation of an adversarial example based attack:

(1)
(2)

where is a flag of untargeted and targeted attack: =-1 when the attack is untargeted and =1 when the attack is targeted. is some objective function the attack seeks to optimize. Parameter controls the relative importance of the perturbation and the objective function. means that human perceptional class of adversarial input needs to be the same as that of the original benign input.

For untageted attack, the label denotes the predicted class of the benign input such that measures the distance between the prediction vector of adversarial input and . This distance is to be maximized to make the adversarial input successfully misclassified, thus achieving the goal of predicting the label of adversarial input as any label but not the label of original benign input . Accordingly, the perturbation only concerns the prediction vector of adversarial example and the predicted class of benign example, defined as .

For targeted attack, the label denotes the target label so that is some objective function that measures the distance between the prediction vector of adversarial input and the attack target label . The attack is to minimize this distance so that the adversarial input , generated by maliciously injecting perturbation to

, is classified as the target label with high confidence. In contrast to untargeted attack, which only needs to lower the probability confidence on the original label

, the target attack needs to enhance the probability confidence on the target label while mitigating the probability confidence for all other labels so that the target label could stand out. This makes targeted attack much harder than untargeted ones. Thus, the adversarial noise is generated from perturbation function .

Examples of objective function in Equation 2 include (1) loss, (2) cross-entropy loss, loss like (3) and their variants.

where is a one-hot prediction vector whose prediction class has the probability of 1 and the rest is 0, i.e., , .

Equation 2 aims to optimize the objective function and the amount of perturbation at the same time while the parameter controls the relative importance of these two items. When , the problem is centered on minimizing the amount of perturbation while maintaining human imperceptibility. Jacobian-based attack (papernot2016limitations, ) is an example. When , Equation 2 aims at optimizing the objective function without any regulation on the amount of perturbation. It may suffer from over-crafting and lead to violation of human imperceptibility (sharif2016accessorize, ). When , Equation 2 minimizes the amount of perturbation and the objective function together to avoid either of them dominating the optimization problem. Since large noise would increase the probability of attack being detected by human, the first term of Equation 2 acts as regularization of the second. In fact, the objective function could have an additional regularizer. Example attacks when are the Fast Gradient Sign Method (FGSM) (goodfellow2014explaining, ) and those optimization-based attacks (szegedy2013intriguing, ; carlini2017towards, ; evtimov2017robust, ).

2.4. Adverse Effect of Adversarial Examples

We propose to use the following three evaluation metrics to analyze and compare the adverse effect of adversarial examples: Success Rate(SR), Degree of Change (DoC), and information entropy.

SR measures the percentage of adversarial examples that result in successful attacks in all adversarial input data generated by using the adversarial example generator.

An adversarial example is considered successful when under limited amount of perturbation noise, the following conditions hold: (1) ; and (2) let , for untargeted attack, , we have ; and for targeted attack, is the maximum probability in the prediction vector such that , , and . By using SR, we are able to statistically measure how hard it is to move the original class of an input into another class. The higher the SR is, the more adverse effect the adversarial attack can cause. However, the SR alone is not sufficient to characterize the adverse effect of adversarial example. This motivates us to introduce per-class SR and other metrics. Per-class SR measures the percentage of adversarial inputs that are successfully misclassified among the total adversarial examples generated from one class.

DoC describes the distance between adversarial example and original benign input , which is the objective that adversarial attacks aim to minimize in Equation 2. Let be the distance of the two inputs, and the DoC is computed as

(3)

where is the norm of the input data . can be . For example, if the perturbation noise changes 30 pixels in a 28*28-pixel image, the DoC for this perturbed image is under distance. If the dataset has 100 maliciously crafted images, i.e., , we add each of their DoC values and compute their average. From the perspective of adversaries, there are several advantages in keeping DoC theoretically the smallest. First, it means that less effort is required to launch the adversarial attack in a fast and efficient manner. Second, the minimal amount of change is one way to satisfy , making the attack human imperceptible and hard to detect. In contrast, randomly added perturbation can be inefficient for adversaries with high risk of making more effort adding noise with low SR.

The third metric is information entropy (shannon1948mathematical, ), defined as the average amount of information produced by a stochastic source of data. We compute the entropy on the distribution of probabilities in .

(4)

The more even the probability is distributed, the larger the entropy is. Generally, entropy refers to disorder or uncertainty of a distribution. Thus the prediction vector of has larger entropy than that of , because on the prediction of the second class, the latter is certain with a probability of 1 whereas the former is less certain with a probability of 0.8. We also use the average information entropy, which is defined as follows: for input images, we compute their individual information entropy, accumulate the entropy of all inputs, and divide the entropy sum by . The average entropy serves as a good alternative indicator to differentiate attacks with high and low SR.

2.5. Basic Principle of Perturbation Injection

The goal of adversarial perturbation is to make DNN model to misclassify an input by changing the objective function value based on its gradients on the adversarial direction. Given the trained DNN has fixed parameters, the update should be on the benign input and be minimized so that only small amount of perturbation is added to the input to keep the human imperceptibility intact. Such adversarial perturbation update can be done in either one-step () or multiple steps iteratively ():

(5)

For untargeted attacks, . For targeted attacks, . controls how much to update at a time. is a function that describes the relation between the prediction vector of adversarial input and some attack objective class . concerns the prediction vector or loss function of the original input or the adversarial input in the previous iteration, which is the attack spot for adversarial perturbation update. In contrast, focuses on the prediction vector of the current adversarial input, which indicates the adversarial attack objective to be optimized. is the crafting rule that is based on partial gradient of function , which implies how the perturbation is added. also ensures that the perturbation is clipped. Clip refers to a value constraint so that the perturbation cannot go beyond the range of the feature value. For image, if the perturbation increases the value of a pixel beyond 255, then the pixel value is clipped to 255. The update term has the same size as the input when and as the previous adversarial input when .

One-step v.s. Multi-step Adversarial Examples. One-step attack is fast but excessive noise may be added to the benign input unnecessarily, because it is difficult to calibrate exactly how much noise is needed for successful attack. One-step attack puts more weight on the objective function and less on minimizing the amount of perturbation. The partial gradient of in Equation 5 only points out the direction of change and the relative importance of different places to perturb in the input data. Unlike one-step attack which has only one update on the original benign input, the multi-step attack uses a feedback-loop that iteratively modifies the input with more carefully injected perturbation until the input is successfully misclassified or the maximum perturbation threshold is reached to ensure the human imperceptibility (). Although iterative attack is computationally more expensive, the attack is more strategic with high SR and less perturbation. The multi-step attack strikes a good balance on minimizing the amount of perturbation and the objective function.

Loss Function as the objective function. The loss function measures the distance between the prediction vector and the attack destination class . When , it leads to minimal (local minimal) value of the convex (non-convex) loss function. When , adding values to input will increase the value of loss function. However, when , adding values to input will decrease the value of loss function. Thus, manipulating the input could change the loss function.

Typical attacks that utilize the gradient of loss function include one step untargeted/targeted FGSM (goodfellow2014explaining, ), untargeted/targeted Iterative Method (kurakin2016physical, ), and the attack in (sharif2016accessorize, ). To attack an image input by following (goodfellow2014explaining, ), we build a noise map based on the gradient of loss function and a simple pixel-based crafting rule , in which the pixel value is set to 0 (dark) if the gradient of loss function at that pixel position is below zero, to 127 if the gradient is zero, and to 255 (light) if the gradient is above zero. We are able to perform untargeted attacks by controlling the amount of noise injected using . Figure 2 shows three adversarial examples (on the right) generated by applying the same perturbation noise (middle) to the same original input on the left under different values. When , the attack fails. As we increase to , the attack successfully misclassifies the input image of digit 1 to the class of digit 7, and such misclassification is imperceptible as we visually still see the perturbed image on the right as digit 1 image without much change (). By further increasing to , the attack remains successful, but it misclassifies the same input image of digit 1 to the class of digit 8 instead. In Figure 3, we use a different input image of digit 1, with the same and the same algorithm to add perturbation noise produced in the same way as in Figure 2. However, the attack is unsuccessful and the prediction vector indicates that the predicted class for the maliciously crafted example is still digit 1. This experiment shows that the success of attack is influenced both by and by the input instance. Furthermore, Figure 2 shows that both entropy and loss will grow with the larger gradient-based noise. Larger entropy indicates more even distribution of the probabilities in the prediction vector. Larger loss indicates that the prediction vector of adversarial input is more erroneous when compared with the predicted label of the benign input. This empirical evidence also indicates that the entropy of the prediction vector and the loss is also a good metric for characterizing the effectiveness and divergence of attacks.

=0 =0.05 =0.1 =0.2
entropy 0 1.30 3.09 3.32
loss 0 1.23 2.13 2.31
Figure 2. An illustration of loss function based attack (=0.05, 0.1, 0.2), pixel values are added in light area of the perturbation, are deducted in dark area, and do not change in gray area.

Figure 3. An unsuccessful loss function based attack (=0.2)

Optimization of loss function with regularization is an extension of the loss function based attacks using adversarial examples. CW attack (carlini2017towards, ) and RP attack (evtimov2017robust, ) are examples of such attacks. The benefit of controlling the amount of change with a regularizer is to avoid drastic change. However, the optimization problem is often solved via techniques like Adam optimizer, which introduces considerable computation overhead.

Prediction vector based objective function. Similarly, for prediction vector, adding noise by increasing values on input will make the prediction vector value on class go up if the partial gradient of prediction vector on class is greater than zero, i.e., . Analogously, means increasing the value of input will make the prediction vector value on class go down. When , changing the input a little bit will not change the prediction vector at all. The same attack principle can be applied to prelogits as well, as prediction vector can be seen as normalized and difference-amplified prelogits. Example attacks that utilize the gradient of prediction vector are Jacobian-based attack (papernot2016limitations, ) and Deepfool (moosavi2016deepfool, )

. In fact, the gradient of loss function can be seen as an extension of gradient of prediction vector due to the chain rule since loss function is computed from prediction vector.

In summary, based on the inherent relation among input data, the gradient of loss function, and prediction vector, we are able to establish basic attack principle for untargeted and targeted adversarial attacks. For untargeted attack, the goal is to craft the image in a way to lower the probability or confidence of the original class of the input until it is no longer the largest probability in the prediction vector. There are three ways to do so: (1) The adversaries could increase the loss function , pushing the prediction away from the predicted class of the benign input. When the loss function is large enough, the prediction will be changed to some destination class other than . (2) One can decrease the value of in prelogits (feature vector) or logits (prediction vector), until this value is no longer the largest among all classes, the DNN prediction would misclassify the adversarial input. (3) By extending the loss function with a regularizer for the added noise, the objective function is optimized by increasing the loss function between the prediction vector of and , while minimizing the impact of perturbation.

Similarly, there are three methods for targeted attack: (1) The adversary may decrease the loss function with perturbation, so that the crafting process is to perturb the input toward the target class . (2) The adversary may increase the value of the prelogit or prediction vector of until it becomes the largest one, so that the DNN would misclassify the input into the target class. (3) The adversary extends loss function with optimization and decreases the loss function while balancing the added noise.

Table 1 lists representative untargeted (U) and targeted (T) attacks with respect to noise origin, distance norm, human perception (HP), and one-step or multi-step attack algorithm.

Attack type noise origin distance norm HP constrain iteration
FGSM U/T loss function one
Iterative Method U/T loss function multiple
Jacobian-based Attack U/T prelogit & prediction max iteration multiple
CW (carlini2017towards, ) U/T optimization , , regularizer multiple
RP (evtimov2017robust, ) U/T optimization , , regularizer multiple
Table 1. Example Adversarial Attacks and Algorithms

3. One-Step Adversarial Examples

3.1. One-Step Attack Generation

We characterize one-step generation of adversarial examples on their attack effectiveness and divergence using FGSM. All experiments are conducted on MNIST dataset (lecun2010mnist, )

with TensorFlow

(liu2018benchmarking, ). The adversarial update of FGSM is a special case of Equation 5 with the crafting rule defined by function :

In FGSM, serves as the DoC in distance and controls the amount of injected noise. Note that DoC in distance is different across instances as the gradient of loss function is determined by the input instance and the DNN model. The objective function , which is the loss function, is to be optimized to achieve the attack goal. The rule of perturbation noise injection depends on , which control the direction of the perturbation according to the objective of the loss function based attack. For untargeted attack, pixel values should be decreased if ¡0, and pixel values should be increased if ¿0. Both are controlled by and aim at increasing (maximizing) the loss function between the predicted vector and the predicted class label of the benign input , which causes misclassification on .

By slightly changing the crafting rule, targeted FGSM attacks can be established (kurakin2016physical, ). The difference is that the loss function for targeted attack is defined between the prediction vector of a benign input and the target class of the attack. The direction of change is to decrease (minimize) the loss function so that the prediction moves towards the target class. We visualize the gradient of loss function for targeted FGSM attack in Figure 4. It shows the pixel position whose value is to be increased when ¡0 (dark area) and decreased when ¿0 (light area). Compared to untargeted attacks (recall Figure 2 and Figure 3), Figure 4 shows that generating an adversarial example for successful targeted attack is much harder. When the targeted classes are set to digit 0, 4, 5, 6, and 9 respectively, we vary from to with an interval of , the attacks are unsuccessful no matter how is set. When the targeted classes are set to digit 2, 3, 7, and 8 respectively, the attack succeeds at different noise level (). The easiest target class with smallest noise level () is digit 8, followed by digit 7 (), then digit 2 and digit 3 (). Larger beyond will cause clear violation of .

Figure 4. Visualization of Loss Function-Based Noise Injection for targeted FGSM attack

Takeaway Remarks. The effectiveness of one-step attack heavily relies on , which is the only parameter the adversary could fine-tune once the crafting rule is fixed. For untargeted attack, larger leads to larger update, and thus may result in over-crafting and violate the minimal amount of perturbation constraint. Also, large perturbation noise may dominate the original input, violating the human imperceptibility constraint. For targeted attack, only the correct amount of noise can lead to successful targeted misclassification. Over-crafting may cause the prediction go beyond the decision boundary of the target class, thus fail the attack. On the other hand, smaller leads to smaller perturbation, and higher chances of constructing adversarial examples that are not sufficiently perturbed to make attacks successful in one step.

As we have demonstrated in Figure 2 and Figure 3, is an important control parameter for the effectiveness of one step FGSM attack. For the same input, larger tends to increase SR. However, larger successful for attacking one input may not guarantee the success of attacking another input of the same class. Our analysis and characterization calibrate the effectiveness and divergence of adversarial examples with three interesting findings: (1) Different input images of the same class (e.g., digit 1) may have different attack effectiveness even with the same level of noise (same ) under the same attack method. This reflects the inherent problem of using fixed crating rules for all instances. The robustness against adversarial perturbation is different across input instances. (2) For any benign input, there are more than one way to generate successful adversarial examples (e.g., using different values) using the same attack method. (3) Different levels of noise (varying ) may lead to different attack effectiveness for the same input since the attack is not successful when . Also, two different but successful attacks to the same benign input may lead to inconsistent misclassification results (recall Figure 2, v.s. ). Moreover, different images from the same class can be misclassified into different destination classes (shown in Table 2). It is critical to study such non-deterministic nature of adversarial examples for defense mechanism design.

3.2. Characterization of One-step Attack

We have shown that it is hard to balance between the amount of noise added and the change of loss function value in one step attack. In this section, we further characterize the effectiveness and divergence of one-step adversarial examples by introducing a binary classification in terms of easy and hard attack categories for all successful attacks. This allows us to gain deeper understanding of both and the crafting rule with respect to SR, DoC, entropy, and the fraction of misclassified adversarial examples.

We observe that the hardness for changing the predicted class of an input into another destination class varies for different destination classes under untargeted or targeted attacks. We can distinguish successful attacks with higher SR as easy and efficient attacks and successful attacks with lower SR are hard and inefficient ones. Table 2 shows a statistical characterization of easy and hard cases in one-step attacks with . The diagonal shows the percentage of adversarial examples that fail the attacks. We consider two types of hardness in terms of different SRs: attack hardness based on source (S) classes and destination (D) classes. For the attack hardness on source classes, it is observed that the SR of digit 1 is as high as 0.995, and the SR of digit 2 is as low as 0.771, indicating that images of digit 1 and images of digit 2 succeed in source misclassification attack. We consider those attacks whose source SRs are relatively higher as one kind of easy cases. However, within each source class, different fractions of its

images are misclassified into different destination classes, and such distribution is highly skewed for some cases. For source class of digit 1, the highest fraction that are misclassified into a particular destination class, i.e., digit 8, is 0.524 (

images), and the rest of the destination classes have significantly smaller fractions, ranging from 0.021 to 0.123. This indicates that the destination class of untargeted attacks is not uniformly random. We regard those successful attacks whose destination classes have higher fractions as another kind of easy cases.

We first study the hardness in terms of source class. Although the per-class SR for each source class (e.g., digit 1) is different, Table 2 shows that one-step attack is highly effective and all SRs are relatively high with =0.2. To better understand the attack hardness of source classes, we gradually lower the value from 0.2 to 0.1 and 0.05. Figure 5 shows the comparison of SR under different . We highlight two observations. First, the SRs of all digits drop sharply as we decrease the . Also when , all source class attacks become hard as all SRs are smaller than 0.35. Second, Digits 1 and 9 have higher SRs consistently with all three settings of compared to other digits. They are source class easy attacks under FGSM. We view this type of hardness as the vulnerability of the source class.

SD 0 1 2 3 4 5 6 7 8 9 SR # image
0 0.058 0.018 0.067 0.004 0.015 0.691 0.037 0.027 0.082 0.001 0.942 980
1 0.068 0.005 0.123 0.092 0.015 0.078 0.006 0.068 0.524 0.021 0.995 1135
2 0.010 0.001 0.229 0.052 0.049 0.341 0.018 0.065 0.208 0.027 0.771 1032
3 0.032 0.052 0.192 0.178 0.022 0.272 0 0.031 0.093 0.128 0.822 1010
4 0.004 0.017 0.110 0.063 0.048 0.045 0.014 0.049 0.571 0.079 0.952 982
5 0.047 0.006 0.142 0.100 0.074 0.135 0.029 0.016 0.341 0.110 0.865 892
6 0.056 0.070 0.217 0.013 0.116 0.156 0.041 0.004 0.307 0.02 0.959 958
7 0.031 0.030 0.314 0.047 0.018 0.018 0.001 0.117 0.385 0.039 0.883 1028
8 0.020 0.033 0.201 0.031 0.094 0.380 0.015 0.040 0.158 0.028 0.842 974
9 0.005 0.016 0.336 0.013 0.046 0.194 0.003 0.210 0.168 0.009 0.991 1009
Table 2. Untargeted FGSM Attack (=0.2): the cell at row and column represents the fraction of adversarial inputs misclassifies source class in row to destination class in column.
Figure 5. SR of untargeted FGSM with Different : x-axis denotes the 10 classes and y-axis denotes SR.

The average information entropy is another statistical indicator to show the effectiveness of adversarial examples and the distribution of the prediction vectors. Table 3 shows the average entropy of source classes under untargeted FGMS attack with different values. Clearly, the more vulnerable a source class is, or the more successful the source misclassification attack is, the higher entropy is, showing more even distribution of the probabilities in the prediction vector(s). Note that without attack, the entropy is as low as an order of . From both Figure 5 and Table 3, we can see that the most vulnerable source class is digit 1, followed by digits 9, 6, 4, 0, whose SRs are above 0.9. The next set of digits with SR above 0.8 is 8, 7, 5, 3, with digit 2 having the lowest SR of 0.773, which indicates that 234 inputs out of 1032 were failed under FGSM. We observe that FGSM attacks with high SRs result in more evenly distributed probabilities in prediction vector and larger entropy.

S 0 1 2 3 4 5 6 7 8 9
0.05 1.46 1.96 0.54 1.47 1.3 1.06 1.24 0.93 2.17 2.26
0.1 2.72 2.82 1.37 2.64 2.54 2.16 2.3 2 3.02 3
0.2 2.97 3.27 2.44 3.06 3.12 2.89 2.96 2.83 3.11 3.12
Table 3. Entropy of FGSM under different

We next examine the second type of attack hardness with respect to destination (D) classes. Table 4 lists the top three easy and top three hard attacks. For source class digit 1, the top 3 easy destination classes are target digits 8, 2, and 3 with fraction of 0.524, 0.123 and 0.092 misclassified into digit 8 (1135*0.524=595 images), digit 2 (1135*0.123=139 images), and digit 3 (1135*0.092=104 images) respectively. This indicates that the fraction of misclassified adversarial examples is a good indicator for characterization of the effectiveness of attacks.

S Easy 1 Easy 2 Easy 3 Hard 1 Hard 2 Hard 3
0 5/0.691 8/0.082 2/0.067 9/0.001 3/0.004 4/0.015
1 8/0.524 2/0.123 3/0.092 6/0.006 4/0.015 9/0.021
2 5/0.341 8/0.208 3/0.052 1/0.001 0/0.01 6/0.016
3 5/0.272 2/0.192 9/0.128 6/0.0 4/0.025 7/0.031
4 8/0.571 2/0.11 9/0.079 0/0.004 6/0.014 1/0.017
5 8/0.341 2/0.142 3,9/0.11 1/0.006 7/0.016 6/0.029
6 8/0.307 2/0.217 5/0.156 7/0.004 3/0.013 0/0.055
7 8/0.385 2/0.314 3/0.047 6/0.001 4/0.018 5/0.018
8 5/0.38 2/0.201 4/0.094 6/0.015 9/0.028 0/0.033
9 2/0.336 7/0.21 5/0.194 6/0.003 0/0.005 3/0.013
Table 4. Top 3 Easy & Hard Attacks under untargeted FGSM: each cell indicates the destination class digit and the fraction of adversarial examples being misclassified into that destination class.

4. Multi-Step Adversarial Examples

In one-step generation of adversarial example, indicates to what extent an adversarial input is perturbed in one shot when the crafting rule is fixed. However, determining a right is non-trivial. Especially, smaller may lead to low SR or failure in one step attack. One remedy of achieving high attack SR with small is to use a multi-step boosting method. The iteration process terminates when the misclassification goal is reached or when the perturbation violates . The multi-step approach could increase SR significantly and make hard attacks at one-step easier. Table 5 shows the SR of multi-step iterative attack under untargeted FGMS with , comparing the three settings of iterations. Since the loss function is computed at each iteration to fine-tune the attack direction, the noise injection is not simply repeating the previous iteration. With a few iterations, the attacker can significantly enhance SR. Hard cases in one-step attack may require more iterations to achieve certain SR goal comparable to one-step easy cases with high SR. Thus, the number of iterations can be a good indicator of the attack hardness.

iter S 0 1 2 3 4 5 6 7 8 9
1 0.172 0.325 0.026 0.063 0.184 0.089 0.213 0.11 0.137 0.261
3 0.796 0.921 0.751 0.789 0.897 0.793 0.903 0.843 0.775 0.960
5 0.988 0.997 0.924 0.959 0.971 0.935 0.982 0.976 0.937 0.998
Table 5. SR of Multi-step FGSM ().

For targeted FGSM attacks, however, multi-step attack is less effective. The experiment conducted to attack input images of digit 1 fails to reach the target digit 0 for all 1135 images in MNIST. This is consistent with one-step targeted FGSM in Figure 4, where one-step targeted attack is not successful even when reaches . Consequently, boosting small iteratively may not improve SR of the attack when the attack under large is not successful. In addition to tuning , the crafting rule may also need to be refined iteratively to boost attack SR.

4.1. Multi-Step Attack Generation

To better characterize the behavior of multi-step adversarial attack on its adverse effect and divergence, we further analyze the generation and effectiveness of multi-step adversarial example using targeted Jacobian-based attack  (papernot2016limitations, ), which possesses a prediction vector-based objective function with two alternative crafting rules based on single pixel or a pair of pixels.

Single-pixel crafting rule. Given input and its prediction vector , the attacker first computes the Jacobian matrix . Jacobian matrix on label indicates the relation between input features (pixels for image data) and the prediction on that label. That is, adding pixel value on one pixel would increase the value of the prediction if the Jacobian matrix on class has a positive gradient on that pixel. Particularly, is the Jacobian matrix for target class . Since the prediction vector of the legitimate input is generated from the DNN model, the gradient value is determined by the training process and the model is assumed to be differentiable. After computing the Jacobian matrix for the entire prediction vector, the adversary can compute the adversarial saliency map for target class for each pixel .

(6)

This equation gives four concrete gradient based crafting rules: (1) if adding pixel value does not move the prediction towards the target class, i.e., the gradient of prelogit or prediction for target class for the pixel is ¡0, or (2) the sum of all gradients other than that of the target class () for the pixel is ¿0, then the value on adversarial saliency map on that pixel is set to 0. However, if adding pixel value does move the prelogit and prediction towards the target class, i.e., (3) the gradient of logit of the target class for the pixel is ¿0, or (4) the sum of all gradients other than that of the target class for the pixel is ¡0, then the value of adversarial saliency map on that pixel is set to be the gradient product of (3) and (4). The power of adversarial saliency map is that it optimizes the objective function by considering both the gradient towards the target class and the gradient of all other classes. Once the adversarial saliency map is computed, the adversary could craft the image with the pixels that have the largest adversarial saliency maps.

Pair-wise adversarial crafting rule.

The above adversarial saliency map considering individual pixels one at a time is too strict, especially when very few pixels would meet the heuristic search criteria in Equation

6. Papernot, et al (papernot2016limitations, ) introduces the pair-wise adversarial saliency map. The heuristic searching for pairs of pixels is a greedy strategy that modifies one pair at a time. The incentive is the assumption that one pixel can compensate a minor flaw of the other pixel. For a pair of pixels , we first compute its Jacobian matrices according to the prediction on each label. Then pair-wisely, we compute and :

where represents to what extent changing these two pixels will change the prediction on the target class. denotes the impact of changing the two pixels on classes other than the target. Similar to adversarial saliency map, pixel pair with the largest value on when and is chosen to be crafted. The perturbation sets the pixel value to 255. A dynamic search domain is maintained to keep track of those pixels whose values already reach 255. The multi-step perturbation process iterates until the attack is successful or reaches the pre-defined maximum level of noise tolerance, such as 15% of pixels. For image of , the adversarial example changes up to 28*28*0.15=118 pixels. If the adversarial input reaches this maximum noise level but is still not predicted as the target label, the attack fails.

Figure 6 visualizes the adversarial saliency map for targeted attacks with each of the classes other than the source digit 1 as the target in the first iteration. Moreover, the adversarial saliency map of untargeted Jacbian-based attack is provided for reference. The construction is straightforward according to the general principle of perturbation injection. Since the value of adversarial saliency map is widely ranged, it is hard to demonstrate its numerical difference. For better visualization, we set all pixels whose values are non-zero in the saliency map towards the target class to 255. This means pixels in light area are potential candidates for crafting. The visualization shows clearly that the saliency map perturbation is different across the attack target class, and the DoC values for successful targeted attacks are different. Target attacks to digits 0 and 6 are failed with the 15% pixel-level perturbation threshold.

Figure 6. Visualization of Adversarial Saliency Map-based Noise Injection for targeted attacks. The Adversarial Saliency Map shown is from the first iteration. The noise of digit 1 is for untargted attack.

Figure 7. Prelogits over 10 steps (source digit 1, target digit 8)
step 0 1 2 3 4 5 6 7 8 9 10
1 788 490 329 268 228 171 160 119 94 -119 -148
4 -697 -430 -288 -156 -99 -66 -74 -58 -47 -2 -34
8 -507 -318 -339 -557 -617 -591 -564 -497 -443 -57 -11
Table 6. Prelogits trajectory of three representative classes (digits 1, 4, 8) over 10 iteration steps (source digit 1, target digit 8)

We use a successful 10-step Jacobian-based targeted attack from source class 1 to target class 8 as an example to characterize the multi-step Jacobian noise injection. Table 6 shows three representative classes over the 10-step targeted attack and Figure 7 shows the prelogit trajectory for all classes in the 10-step attack. At each step, an adversarial saliency map is computed and the pixel position with largest value is chosen to be the crafting pixel. It is a nested multi-step ensemble process with every step correcting the perturbation trajectory path a little bit. Though the perturbation is based on the gradient of prediction vector, the prediction vector remains to be [0,1,0,0,0,0,0,0,0,0] for the first 8 steps, and changes to [0,0,0,0,1,0,0,0,0,0] (predicted as digit 4) at the 9th step, and to [0, 0, 0, 0.0067, 0, 0, 0, 0, 0.9933, 0] at the 10th step, successfully reaching the digit 8 goal of this targeted attack. It is observed that the trajectory of prelogits is much smoother and more informative compared to the drastic change in prediction vector in the iterative ensemble learning process. According to Figure 7, the perturbation enhances the prediction of targeted class while the gap between prelogits for different labels shrinks as the attack progresses step by step to the iteration. The prelogit of target class 8 gradually becomes the largest, succeeding in misclassifying digit 1 to digit 8 at the step.

Takeaway Remarks. (1) The pixel-level perturbation changes the prelogits, the prediction vector, and the probability of every single class in each step. This increases the hardness of targeted attack when the number of classes is large, as it is more difficult to choose the fraction of pixels as the maximum crafting cap, and very likely 15% of pixel perturbation may lead to low SR or failure of the attack. (2) While softmax layer amplifies the numerical difference in prelogits and normalizes them into the prediction vectors, performing prediction vector based attack is equivalent to performing attacks via prelogits. More interestingly, by noticing that the gap of prelogits converges for several successive prediction attempts, it may reveal the presence of a targeted attack. (3) The pixel with larger adversarial saliency map indicates that adding its value will move the prelogit or prediction toward the target class more. This crafting rule, together with a limitation on the maximum iteration ensures minimal amount of perturbation as well as human imperceptibility. However, it also exposes some problems of this attack. In addition to its computation inefficiency, the perturbation simply sets the chosen pixel values to 255 at each iteration, which may over craft the input at times so that the noise deviates too much from the amount needed and results in unsuccessful attacks. Also adding full 255 to a pixel each time may not be effective or human-imperceptible for colored images.

4.2. Effectiveness of Multi-step Attack

In this section we characterize the effectiveness and divergence of multi-step targeted attack by analyzing the easy and hard cases.

Success Rate (SR). We first categorize the easy and hard cases in multi-step targeted attacks using the high and low SR or the large or small fraction of adversarial examples misclassified. Table 7 shows the results. For source class digit 1 with 1135 images and attack target digits 0, 2, 6, and 8, the SR is 0.1%, 85.6%, 3%, and 97% respectively. Clearly, the SRs for misclassifying digit 1 to target digit 0 or 6 (hard cases), are much lower than the SRs for the target digits 8 or 2 (easy cases). Also, the effectiveness of targeted attacks is asymmetrical, i.e., the attack is much harder with low SR of 0.208 than the reverse attack with high SR of 0.944. In Table 7, the last column and the last row are average SR for each source class and each target class respectively. Figure 9 shows vulnerable and robust source classes, and Figure 9 shows hard and easy target classes using the SR sum. Within each SR bar, different colors indicate different contributions of each digit to build the SR of the attack. It is easy to see that 1, 9, 6 are the top 3 vulnerable source classes while 2, 8, 3 are the top 3 easy target classes.

Figure 8. Source Vulnerability
Figure 9. Hardness of Target

Degree of Change (DoC). DoC is another good discriminator for easy and hard cases: the higher average DoC means that the attack is harder. Table 8 shows the DoC for target attack on digit 1 and other classes being the target options. Digits 2 and 3 are easy targets with high SR (Table 7) and relatively low DoC. It only takes 5% or 6.6% of change in pixels on average to misclassify an image of digit 1 as digit 2 or digit 3 respectively. However, for hard attack , the DoC is 15% and the SR is 0.1%.

Average Entropy. We may also use average entropy to differentiate easy and hard attacks. Two situations have low entropy: benign input whose order of entropy is around , and unsuccessful adversarial example. However, there is no linear correlation between high SR and larger entropy for two reasons. (1) When injecting full value of 255 to a pixel at a time, the per-step perturbation may be too large, and such coarse-grained perturbation leads to the change of prediction vector from one-hot source class to one-hot target class directly, which will not increase entropy. (2) The resemblance of source and target images (digits) may play a role. For example, digits 5 and 6, digits 7 and 9 look alike, respectively. And attacks and have higher SRs, and are more successful. But their entropy values are smaller than attacks and .

S T 0 1 2 3 4 5 6 7 8 9 S: avg
0 0.027 0.970 0.039 0.205 0.147 0.049 0.307 0.352 0.170 0.252
1 0.001 0.856 0.838 0.415 0.502 0.030 0.686 0.970 0.510 0.534
2 0.001 0.006 0.285 0.007 0.003 0.009 0.136 0.237 0.004 0.076
3 0.001 0.027 0.483 0.005 0.136 0.003 0.125 0.114 0.110 0.112
4 0.000 0.188 0.633 0.155 0.145 0.013 0.768 0.386 0.173 0.273
5 0.013 0.246 0.077 0.592 0.033 0.037 0.217 0.478 0.105 0.120
6 0.040 0.176 0.815 0.223 0.618 0.382 0.183 0.630 0.116 0.354
7 0.003 0.034 0.636 0.562 0.027 0.129 0.000 0.320 0.208 0.213
8 0.003 0.086 0.858 0.575 0.071 0.317 0.016 0.107 0.015 0.228
9 0.010 0.084 0.613 0.761 0.387 0.003 0.000 0.944 0.825 0.403
T: avg 0.008 0.097 0.660 0.448 0.196 0.196 0.017 0.386 0.479 0.157
Table 7. SR of adversarial examples in Jacobian-based attack.

Figure 10. Top 3 easy cases per target in Jacobian-based Attack
Target 0 2 3 4 5 6 7 8 9
DoC 0.150 0.050 0.066 0.101 0.102 0.148 0.066 0.029 0.093
Entropy 0.026 0.069 0.068 0.03 0.064 0.017 0.05 0.067 0.048
Table 8. DoC and entropy of 1135 images of digit 1.

Takeaway Remarks. (1) Designing a strong attack requires to trade off between larger per-step perturbation and minimal degree of change. An attack is considered strong if it succeeds with high confidence or if the adversarial prediction vector is one-hot vector. However, not all successful attacks are accompanied with high SR. Often the successful multi-step attacks may lower the confidence (the probability) of prediction to a greater extent. Moreover, targeted attack is hard to develop and pays higher cost to produce successful adversarial examples. We measure the time for performing untargeted FGSM attack and the time for performing Jacobian based targeted attack using Intel @ Core i5-2300 CPU. The former takes second and the latter takes seconds per instance, which is 7 times on average for the same input. (2) Even with the same attack algorithm, top three easy cases may vary notably with respect to SR, DoC and entropy, so do the hard cases. Figure 10 shows the top 3 easy Jacobian based attacks. The number in the bracket is the SR. For images of digit 5, the top 1 easy attack is with SR of 59%, and the top 3 easy attack is with SR of only 25%, though the attack is successful with only 86 perturbed pixels. Also the crafted image of digit 5 still looks like digit 5 visually, but the perturbation noise is visible too. These empirical evidences show some strong and complex connection between benign input, adversarial input, loss function, and prediction vector, which inspires us to investigate the effectiveness of adversarial examples from another set of factors related to adversarial learning and DNN training in Section 6.

5. Attack Effect of DNN Frameworks

In this section, we characterize the effect of different settings of hyperparameters and different DNN frameworks on the attack effectiveness of adversarial examples. We choose the number of training epochs to study the impact of overfitting, various sizes of feature maps to compare the effectiveness under different DNN capacity and different DNN frameworks to evaluate their impact on the effectiveness of adversarial attacks.

5.1. Different Number of Training Epochs

The first set of experiments reports the presence of inconsistent easy and hard attacks under different training epochs.

(a) Vulnerability of Source
(b) Hardness of Target
Figure 11. Impact of training epochs: higher SR, more vulnerable and lower SR, harder attack.

We study easy and hard cases using Jacobian-based targeted attack with a DNN model trained under three different settings of epochs: 1 epoch (underfitting), 10 epochs (TensorFlow default) and 30 epochs (overfitting). Figure 11 compares the vulnerability of source class and the hardness of target class. The height of the bar demonstrates the sum of SRs for each of the source classes (left figure) or target classes (right figure). We highlight two interesting observations: (1) Statistically, digit 1 is the most vulnerable source class for all settings of epochs and digit 8 is the most easy attack target for 1 epoch of training. For DNN with 30 epochs of training, digits 1 and 6 are the most vulnerable source classes, and digits 2, 4, and 5 are the most easy targets. Both results indicate different behavior of easy and hard attacks compared to 10-epoch results, where digits 1 and 9 are the most vulnerable source classes, and digits 2, 3, and 8 are the most easy targets. (2) The reason why easy and hard attack cases vary under different training epochs is due to the fact that different training accounts for different trained network parameters, which describe the learned features. The different learned feature is reflected on the gradient of loss function and prediction vectors, and subsequently impacts the effectiveness of adversarial examples. Figure 12 visualizes the gradient of loss function for DNN training under 1 epoch or 30 epochs for FGSM attack and Figure 4 is for 10 epochs, where successful ones marked by their value. These empirical evidence shows visible inconsistency across different training epochs regarding success or failure of attack, as well as regarding SR and DoC for successful attacks.

Figure 12. Loss Function-Based Noise at 1 and 30 epochs

5.2. Different Sizes of Feature Maps

We next study whether different sizes of feature maps have different impacts on the features learned by the DNN model, as the change of learned features will be reflected by the gradients of loss function and adversarial saliency maps, which will impact the behavior of easy and hard attacks. We reduce and double the original number of output features to generate feature map of half and double sizes for the first four DNN layers in TensorFlow. Figure 13 compares three sizes of feature maps: half, original and double on the vulnerability of source classes and the hardness of target classes. For half feature map case, digit 1 is the most vulnerable source class, whereas digits 2 and 3 are the easiest targets. For double feature map case, digit 1 and 9 are the most vulnerable source classes, whereas digits 2 and 8 are the easiest targets. Again, the hard and easy attacks vary for three sizes of feature maps with more easy cases for normal size feature maps. Figure 14 visualizes the different features learned under half and double feature maps using the gradient of loss function under targeted FGSM attack, and visualization for normal feature map was given in Figure 4. Similar inconsistency is observed across different sizes of feature maps, though the impact of different training epochs on the degree of inconsistency is much larger.

(a) Vulnerability of Source
(b) Hardness of Target
Figure 13. Impact of varying sizes of feature maps: higher SR, more vulnerable and lower SR, harder attack.
Figure 14. Loss function-based noise with different feature maps

5.3. Different DNN Frameworks

We next evaluate the impact of different DNN frameworks on the effectiveness and divergence of adversarial examples. Figure 15

reports the comparison results for TensorFlow, Caffe, Theano and Torch. Clearly, TensorFlow and Theano are consistently more vulnerable under FGSM, followed by Caffe and Torch. Figure

17 visualizes the gradient of loss function based adversarial perturbation using DNN model trained by Caffe, Theano and Torch respectively. It is clear that different frameworks lead to different features learned by their DNN model, which contributes to their different influence on the effectiveness of adversarial attacks with respect to easy and hard cases. Figure 17 also exposes some inherent problems in FGSM attack method. The crafting rule in FGSM treats all pixels equally, which may be inefficient since the gradients reflected in the input data for each pixel are not the same. While the positive and negative signs of the sign function are useful, assigning the magnitude of sign function to 1 is not effective in many cases. For example, in Torch, the identical perturbation noise smooths the numerical difference of the gradient of the loss function on different targets. Thus, the attack does not make full use of these gradients. This may contribute to the low SR on untargeted FGSM attacks in Torch.

Takeaway Remarks. First, adversarial attacks heavily rely on the gradient of loss function and prediction vector produced by the trained DNN model, and such gradient is determined by the parameters in the DNN function learned during the training process. Both the hyperparameters, which influence on how the training process will be conducted, and the learned parameters, which are the fixed components of the trained model, will impact the computation of gradient of loss function and prediction vector during the generation of adversarial examples, regardless of specific attack algorithms, and subsequently impact the effectiveness and divergence of adversarial attacks. Second, for the adversarial examples crafted using the same attack method, easy and hard attacks tend to vary under different hyperparameters and across different DNN frameworks, indicating that the effectiveness of adversarial attacks is inconsistent and unpredictable across different DNN frameworks. Such inconsistency also presents under diverse settings of hyperparameters used for DNN training within the same DNN framework.

Figure 15. SR of untargeted FGSM with different frameworks

Figure 16 shows the SR of untargeted FGSM attack on each source class for only Caffe and Torch. Within each SR bar, different colors indicate different contributions of destination classes to building the attack SR. It is easy to see that the top 3 vulnerable source classes in Caffe are very different from that in Torch.

(a) Caffe
(b) Torch
Figure 16. SR of untargeted FGSM with Caffe and Torch.
Figure 17. Loss function-based noise with frameworks

6. Attack Mitigation Strategies

We have characterized the effectiveness of adversarial examples in deep learning through general formulation, extensive empirical evidences, and systematic study of successful attacks and their divergence in terms of easy and hard cases. Motivated by the results of this study, we propose some attack mitigation strategies from two perspectives: prediction phase mitigation and model training phase mitigation.

Prediction Phase Mitigation. We have shown that (1) successful attacks (targeted or untargeted) often do not agree on the same noise level () under the same crafting rule ; (2) the same value that generates one successful adversarial example against a benign input of class may not work effectively for another benign input of the same class; and (3) the destination class of untargeted attacks is not uniformly random. Similarly, some target classes are much harder to attack under the same attack scheme. These observations inspire us to propose two prediction phase mitigation strategies: consensus based mitigation and time-out based mitigation, which are independent of trained DNN model and specific DNN framework used for training.

Consensus based Mitigation. For each prediction query with an input , the prediction API from the DNN as a service provider will generate input queries such that the prediction result is only accepted by a client when the majority reaches a consensus. Given that all adversarial attacks do not respond to the different input data of the same class consistently, such data-diversity based consensus can be an economical and effective mitigation strategy. There are several ways to generate such query inputs. For images, one can leverage computer vision and computer graphics techniques to generate alternatives views of the same image. Also, the consensus protocol can be decentralized to make it more resilient to single point of failure (kosba2016hawk, ). Each client may accept a prediction result upon obtaining consensus votes from the network.

Time-out based Mitigation. For each type of prediction queries, a time-out threshold is pre-set by the DNN as a service provider. If such a query input is compromised by an adversarial example for multi-step attacks, then the time for turning-around prediction for may exceed the normal histogram statistics for this type of prediction task, and thus turn on an alarm. Such time-out threshold can be learned over time or through training. This mitigation strategy can be especially useful for hard attacks, which requires longer iteration rounds to be successful.

Training Phase Mitigation. We have shown that the effectiveness of adversarial attacks is inconsistent and unpredictable and easy and hard attacks tend to vary under different hyperparameters and across different DNN frameworks. Also adversarial attacks heavily rely on the gradient of loss function and prediction vector produced by the trained DNN model, and such gradient is determined by the parameters in the DNN function learned during the training process. Thus, we propose three training phase mitigation strategies as proactive countermeasures that can be exercised by the DNN as a service provider.

Data based Ensemble Training. For each of the prediction classes, an adversarial training in conjunction with data driven ensemble is employed. This enables the training set to include sufficient representations of training data for each class, including those that can strengthen the resilience of prediction queries against adversarial examples in the prediction phase. For instance, by studying the hard cases and easy cases of each source class and each target class, we can generate training examples that make the easy attacks harder and make the harder attacks impossible to succeed.

Hyperparameter based ensemble training. By utilizing different settings of hyperparameters, such as different number of epochs, we can train alternative models and use these diverse models as a collection of candidate models in the prediction phase. There are several ways to implement the consensus for hyperparameter based ensemble. For instance, for each prediction query with input , a subset of hyperparameter-varied models will be selected to produce prediction results and collect consensus accordingly. Round robin, random, weighted round robin, power of two (richa2001power, ) or generalized power of choice (park2011generalization, ) can be employed to implement the selection algorithms.

DNN Framework based ensemble training. We propose to deploy two types of DNN framework based ensemble training. The first approach is to train a DNN ensemble model for prediction using a number of different deep learning frameworks (e.g., TensorFlow, Caffee, Torch) or using different hyperparameters, such as feature maps, within one framework such as TensorFlow. Recall Section 5, we have shown that same adversarial examples have inconsistent effects when using different sizes of feature maps, different DL frameworks from different DNN software providers due to variations in neural network structures and parallel computation libraries used in their implementations (liu2018benchmarking, ). In addition to the ensemble of final trained DNN models, the second alternative approach for DNN framework based ensemble training is to allow multiple DNN models trained over the same training dataset to co-exist for serving the prediction queries.

Both approaches provide a number of advantages. First, different models respond to the same adversarial example very differently in terms of easy and hard attacks as shown in Section 5, thus the prediction API can detect inconsistency, spot the attack attempts, and mitigate risks proactively. Second, one can also integrate the two alternative approaches for the DNN framework based ensemble training, to further strength the attack resilience through combining multi-framework or multi-configuration of hyperparameter based ensemble training with multi-view based prediction ensemble. Such integrated approach can provide a larger pool of alternative trained models for both training and prediction-based consensus, which further strengthens the prediction query based consensus.

Finally, we would like to note that our proposed DNN framework ensemble approaches are different from the cross ML-models based ensemble strategy, which provides ensemble learning model by integrating different machine learning models, such as SVM, Decision Tree, with DNNs, in order to train a prediction model on the same training dataset 

(papernot2016transferability, ). The recent study of the transferability of adversarial attacks has shown that using the cross ML models based ensemble learning may not be effective under transferability of untargeted adversarial attacks (tramer2017space, ; papernot2017practical, ; papernot2016transferability, ; liu2016delving, ). As pointed out in (liu2016delving, ), the transferability only works under untargeted adversarial attacks, and targeted adversarial attacks do not transfer. Therefore, our proposed two types of multi-framework based ensemble strategies can be viewed as a step forward towards developing a unifying mitigation architecture for both targeted and untargeted adversarial attacks.

7. Related Work

Research on adversarial attacks in deep learning can be classified into two broad categories: attack algorithms and defense proposals (metzen2017detecting, ; carlini2017detection, ; goodfellow2014explaining, ; papernot2016distillation, ; gu2014towards, ; cao2017mitigating, ; goswami2018unravelling, ; zhao2018retrieval, ; rakin2018robust, ; meng2017magnet, ). Given vulnerability of DNN, there have been quite a few attempts in building a robust system against adversarial examples. Two classes of defense mechanisms have been proposed. The first type of defense is to detect adversarial examples so that malicious data can be removed before prediction (metzen2017detecting, ; carlini2017detection, ). (carlini2017detection, ) surveys ten recent proposals designed for adversarial example detection. They show that all ten detection methods can be defeated by constructing new loss functions, which makes the adversarial example detection of little use. Besides, the defense mechanism of simply distinguishing between clean and adversarial data is not strong enough. It is better to also correctly classify the carefully-injected adversarial examples. The second type of defense is to increase robustness by modifying the DNN model, aiming to increase the cost of crafting benign samples into misleading ones (goodfellow2014explaining, ; papernot2016distillation, ; gu2014towards, ; cao2017mitigating, ; goswami2018unravelling, ; zhao2018retrieval, ; rakin2018robust, ; zantedeschi2017efficient, ; ahmed2017poster, ). Representative defense includes adversarial training (goodfellow2014explaining, )

, autoencoder-based defense

(gu2014towards, )

and defensive distillation

(papernot2016distillation, ). While adversarial training is computation-inefficient, the latter two mechanisms require some major modifications on DNN architecture. Region-based defense (cao2017mitigating, ) is a recently proposed defense mechanism. It generates a number of samples around the input data and regards the label of majority of the samples as the predicted label of the input.

For classification of adversarial attacks, (madry2017towards, ) shows the impact of network architecture on adversarial robustness, claiming that networks with a larger capacity than needed for correctly classifying natural examples could reliably withstand adversarial attacks. However, larger capacity of the network will increase computation overhead significantly. (ma2018dawn, )

uses local intrinsic dimensionality in layer-wise hidden representations of DNNs to study adversarial subspaces, while

(lu2018limitation, ) points out its limitation. In spite of a growing number of proposed attacks and defenses, there is a lack of statistical and principled characterization of adversarial attacks, which is critical for systematic instrumentation of mitigation strategies and defense methods.

8. Conclusion

We have taken a holistic approach to study the effectiveness and divergence of adversarial examples and attacks in deep learning systems. We show that by providing a general formulation and establishing basic principle for adversarial attack algorithm design, we are able to define statistical measures and categorize successful attacks into easy and hard cases. These developments enhance our ability to analyze both convergence and divergence of adversarial behavior with respect to easy and hard attacks, in terms of success rate, degree of change, entropy and fraction of successful adversarial attacks, as well as under different hyperparameters and different DNN frameworks. By leveraging the fact that adversarial attacks exhibit multi-level inconsistency and unpredictability, regardless specific attack algorithms and adversarial perturbation methods, we put forward both prediction phase mitigation strategies and training phase mitigation strategies against present and future adversarial attacks in deep learning.

Acknowledgements.
This research is partially support by the National Science Foundation under Grants SaTC 1564097, NSF 1547102, and an IBM Faculty Award.

References