1 Introduction
Despite the fact that deep neural networks demonstrate outstanding performance for many machine learning tasks, researchers have found that they are susceptible to attacks by adversarial examples ([2]; [3]). Adversarial examples which are generated by adding crafted perturbations to legitimate input samples are indistinguishable to human eyes. For classification tasks, these perturbations may cause the legitimate samples to be misclassified by the model at the inference time. While there exists no widely agreed conclusion, several studies attempted to explain the underlying causes of the susceptibility of deep neural networks toward adversarial examples. The vulnerability is ascribed to the linearity of the model ([3]), low flexibility ([4]), or the flatness/curvedness of the decision boundaries ([5]), but a more general cause is still under research. The recent literature considered two types of threat models: blackbox and whitebox attacks. In blackbox attacks, the attacker is assumed to have no access to the architecture and parameters of the model, whereas in whitebox attacks, the attacker has complete access to such information. Several whitebox attack methods were proposed ([3], [6], [7], [8], [9]). In response, several defenses have been proposed to mitigate the effect of adversarial attacks. These defenses were developed along three main directions: (1)
expanding the training data to make the classifier more robustly learn the underlying function, e.g., by adversarial training which augments the training data set with adversarial examples generated by certain attack methods (
[2], [3], [10]); (2) modifying the training procedure to reduce the gradients of the model w.r.t. the input such that the classifier becomes more robust to input perturbations, e.g., via input gradient regularization [11][12]; and (3) using external models as network addons when classifying unseen examples (feature squeezing [13], MagNet [14], and DefenseGAN) [15]).Adversarial training, a simple but effective method to improve the robustness of a deep neural network against whitebox adversarial attacks, uses the same whitebox attack mechanism to generate adversarial examples for augmenting the training data set. However, if the attacker applies a different attack strategy, adversarial training does not work well due to gradient masking [16]. [1] have suggested the effectiveness of iterative multistep adversarial attacks. In particular, it was suggested that projected gradient descent (PGD) PGD may be considered the strongest firstorder attack so that the adversarial training with PGD can boost the resistance against many other firstorder attacks. However, in the literature a large number (e.g. 40) of steps of back propagation are typically used in the iterative attack method of PGD or its closely related variant iterative fast gradient (IFGSM) [10] to find strong adversarial examples to be used in each adversarial training step, incurring a prohibitively high computational complexity particularly for large DNNs or training datasets.
In this paper, we propose an efficient twostep adversarial defense technique, called e2SAD, to facilitate defense against multiple types of whitebox and blackbox attacks with a quality on a par with the expensive adversarial training using the wellknown multistep attack the iterative fast gradient method (IFGSM) [10]
. The first step of e2SAD is similar to the basic adversarial training, where an adversarial example is generated by applying a simple onestep attack method such as the fast gradient sign method (FGSM). Then in the second step, e2SAD attemps to generate a second adversarial example at which the vulnerability of the current model is maximally revealed such that the resulting defense is at the same quality level of the much more expensive IFGSMbased adversarial training. Finally, the two adversarial examples are taken into consideration in the proposed loss function according to which a more robust model is trained, resulting strong defense to both onestep and multistep iterative attacks with a training time much less less than that of the adversarial training using IFGSM. The main contributions of this paper are as follows:

We propose a computationally efficient method to generate two adversarial examples per input example while effectively revealing the vulnerability of the learned classifier in the neighborhood of each clean data point;

We show that by considering the generated adversarial examples as part of a welldesigned final loss function, the resulting model is robust to both onestep and iterative white box attacks;

We further demonstrate that by adopting other techniques in our twostep approach like the use of soft labels and hyper parameter tuning, robust defense against black box attacks can be achieved.
2 Background
We provide a brief overview of related existing attacks and defense methods, part of which will be also used to compare with the proposed e2SAD approach.
Attack Models and Algorithms.
The goal of all attack models is to find a perturbation to be added to a clean input , resulting in an adversarial example which may potentially lead to misclassification of the classifier. Typically, the noise level of the perturbation is constrained by the ball denoted by to make sure that the perturbation is sufficiently small. Based on the amount of information the attacker knows, there are two threat levels as follows:

White box: the attacker has full information about the model including its architecture and parameters such that it is possible craft adversarial examples using techniques such as gradient based attacks to specifically target the model;

Black box: the attacker has no knowledge about the architecture and parameters of the model. Neither is the attacker able to query the model. Adversarial examples can be generated using a substitute model which is a whitebox to the attacker.
2.1 White Box Attacks
The Fast Gradient Sign Method (FGSM).
Given a clean input and its corresponding true label y, FGSM perturbs x by [3]:
(1) 
where is the loss function and is a constant value used to constrain the noise level of the perturbation.
Iterative Fast Gradient Sign Method (IFGSM) and PGD
IFGSM attack generates adversarial examples by iteratively applying FGSM attack multiple, say , times to a clean input with a small constant [10]
(2) 
In our implementation, we set
. Typically, each component of the input vector, e.g. a pixel, is normalized to be within [0, 1]. The function
is an elementwise clipping function which clips each element of input into the range of .Projected gradient descent (PGD) is a closely related variant of IFGSM. Typically, PGD first randomly picks a point within a confined small ball around each clean input and then applies the multistep IFGSM to generate adversarial examples for that clean input.
2.2 Representative Existing Defense Methods
Adversarial Training.
This is a popular defense approach which augments the training dataset with adversarial examples ([3], [10]). In our implementation, we adopt the adversarial training equation proposed in [3] as the loss function
(3) 
where is a constant specifying the relative importance of the adversarial examples. In our latter comparison, we choose two methods, FGSM and IFGSM, for generating the adversarial examples.
Minmax.
There exist defense methods ([17]; [1]) which view the process of training a robust model as solving a minmax optimization problem
(4) 
where is the underlying training data distribution, is the loss function, and is the parameters of the model. In ([17]) and [1]), the maximization with respect to is approximated by a specific attack method, for example, by PGD [1].
[18] proposes a new approach which can instead of targeting the saddle points like previous methods, find the true optimal solution of the minmax optimization problem. [18] chooses FGSM to approximate the inner maximization step and for the outer minimization step, instead of plugging the adversarial version of the clean data directly and solving the optimization problem, changes the minimization objective function to
(5) 
where is a constant restricting the level of input perturbation, and is the number of the training examples in each minibatch.
3 Methods
3.1 Adversarial Training
Adversarial training, which augments the training dataset with adversarial examples during the training process, has been shown to increase the robustness of the model against white box attacks when the attack method used to generate the augmented training set is the same as the method used by the attacker. However, if the attacker uses a different attack strategy to apply the white box attack, adversarial training does not perform well. For example, adversarial training using onestep FGSM can not improve the robustness of the model against multistep attacks such as IFGSM and PGD. However, compared to adversarial training using IFGSM or PGD, adversarial training using onestep FGSM takes much less time for the training process since it takes only one step of back propagation to generate adversarial example during each training iteration. [1] suggest that PGD, one particular type of multistep iterative adversarial attack, is the strongest universal firstorder adversary. It is also suggested that the model trained by the adversarial training with PGD is robust against both PGD and onestep FGSM, however at the expenses of multiple steps of back propagation per a clean training data point.
3.2 Proposed Efficient TwoStep Adversarial Defense (e2SAD)
Our ojective is to develop a defense method with a cost similar to that of FGSM adversarial training while being robust to both FGSM and multiplestep attacks such as IFGSM.
A shown in Figure 1, the proposed efficient twostep adversarial defense (e2SAD) approach takes only two steps of back propagation to find adversarial examples. First, for a given input we define the categorical distribution
of the model as the vector of probabilities the model outputs, where each component of the vector representing the probability for the input to be in the corresponding class. At the first step of e2SAD, a onestep attack method such as FGSM is applied to find the first adversarial example per each clean input. At the second step, within the neighborhood of the first adversarial example, the input point whose categorical distribution is most different from that of the first adversarial point is selected as the second adversarial example. For each clean input, these two generated adversarial examples are considered in the final loss function for training. The loss function consists of three terms: the loss of the original clean inputs, the loss of the adversarial examples generated at the first step, and the dissimilarity in categorical distribution of all pairs of the corresponding first and second adversarial examples. It is worth noting that the twostep e2SAD approach is structured in a particular way such that it may provide strong defense against both onestep and multiplestep attacks, as detailed below.
3.2.1 Robustness against onestep adversarial attacks
The main objective of the first step of e2SAD is to find a highly vulnerable neighborhood immediately around each clean training data point such that the trained model can be made robust to onestep gradientbased attacks. In so doing, we simply apply a onestep attack method such as FGSM to maximize the loss around around each clean input to generate the first adversarial example
(6) 
where is a constant chosen step size. We include the loss of this adversarial example in the final loss function (8), discussed in detail in the next subsection. Essentially, by doing so, the first term of (8) guides the training process to reduce the losses of both and , acting as a mechanism for defending onestep adversarial attacks.
3.2.2 Robustness against iterative adversarial attacks
As discussed earlier, compared to onestep adversarial attacks iterative multistep attacks can be much stronger as they search the neighborhood of a clean data point more exhaustively, which in turns makes the adversarial training using iterative adversarial attacks a stronger defense. At the second step of e2SAD, our goal is to efficiently defend against multistep attacks by using only one extra step of computation. As such, the key challenge here is to find a second adversarial example which is close to and can effectively reveal the vulnerability of the model in a way similar to expensive multistep attacks.
In a multistep attack method such as IFGSM or PGD, each adversarial example in the iterative process is typically found by perturbing the preceding adversarial example to maximize its loss, where the loss, for example, may be described using the cross entropy based on either the hard or soft label. Despite this common practice, we argue that a more appropriate approach is to instead constrain the training process such that a level of similarity (or uniformity) in the prediction of the trained model is maintained in the neighborhood of each clean input . It is important to note that maintaining similarity of prediction and minimizing the loss may be correlated but are not necessarily identical objectives; the latter attempts to ensure that predictions made in some neighborhood of the input individually have low loss without specifically constraining these predictions to be similar to each other. Nevertheless, we believe that the objective of maintaining similarity of prediction is more relevant as far as adversarial defense is concerned as it may lead to a wellregularized decision boundary around each .
With the above understanding, at the second step of e2SAD, we attempt to find the second adversarial example whose categorical distribution is maximally different from that of the first adversarial example in the neighborhood of . The dissimilarity in categorical distribution between these two points is measured by cross entropy (CE). To locate , FGSM is used as a onestep optimization method to maximize the CEbased dissimilarity measure
(7) 
where is the step size, and the gradient of the CEbased dissimilarity is evaluated at .
The reason for using categorical distribution as the measure of dissimilarity to find the second adversarial point is as follows. First, note that the value of loss for a model prediction does not fully indicate whether the prediction is a misclassification or not. To see this, consider a simple classification task with three classes. Assume that the true class labels for two different inputs are both the first class, and the corresponding categorical distributions are and
, respectively. Let us further assume that onehot encoding is conventionally used for the labels. In this case, the model misclassifies the first input while correctly classifies the second. However, this happens even when the loss of the first input is lower than that of the second input.
Figure 2 shows how the choice of the optimization objective may influence the generation of the second adversarial example for an illustrative threeclass classification problem. The probabilities of three classes predicted by a trained model for a set of inputs are illustrated using the green, red, and purple curves, respectively. Accordingly, the onehot encoding loss as a function of the input is shown by the blue curve. The cross entropy of categorical distribution between each input and the first adversarial example (green cross) is shown by the orange curve. Note that both the clean input and first adversarial example belong to class 1 in this setup. Starting from the green cross, maximizing the loss using FGSM produces the red cross as the second adversarial example. In comparison, using the CE dissimilarity measure as the objective function leads to the yellow cross. While having the highest loss, the red cross is correctly classified by the model. On the other hand, misclassification happens at the yellow cross which is found based on the CE dissimilarity measure, suggesting its effectiveness in finding stronger adversarial points.
3.2.3 The final loss function
Based the two adversarial examples generated at the two steps of e2SAD, we design the loss function used for training the final model as follows. For a mini batch of clean examples and the corresponding mini batch of the first set of adversarial examples generated at the first step of e2SAD, the total loss function is given by
(8) 
where is the parameters of the model, each is the second adversarial example which has the maximally different categorical distribution from the corresponding first adversarial example , indicates the categorical distribution output function of the model for input , and is the crossentropy dissimilarity measure.
3.3 The training process
We adopt label smoothing [2] for the training process. Here, instead of using hard labels (onehot labels) for each crossentropy loss, we employ the socalled soft labels which assign the correct class a target probability of and divide the remaining probability mass uniformly among the incorrect classes. We have found that the use of label smoothing in e2SAD leads to better performance.
The overall training algorithm of the proposed e2SAD approach is summarized in Algorithm 1
. The hyperparameters
and specify the weights for the losses of the clean and first set of adversarial inputs and those for the dissimilarity between each pair of the first and second adversarial inputs, respectively. While the first two terms in the final loss function target the defense against onestep adversarial attacks, the last term mainly plays the role of defending multistep attacks. In practice, and shall be properly chosen to balance between these two different defense needs.We visually show the twostep e2SAD adversarial example generation process and the loss surfaces of four different models for a minibatch of 128 clean images from the MNIST handwritten digits dataset [19] in Figure 3 and Figure 6 of the Appendix, respectively, to demonstrate the effectiveness of e2SAD.
4 Experimental Results
We compare the proposed e2SAD method with two widely adopted techniques in the literature: adversarial training using singlestep FGSM [3] and the adversarial training using multistep IFGSM. We also report our experience on the minimax adversarial defense method proposed in [18]. We adopt the widely used the MNIST handwritten digits dataset [19] and the Street View House Numbers (SVHN) Dataset [20] as benchmarks.
4.1 Results on the MNIST Handwritten digits dataset
MNIST consists 60,000 training images and 10,000 testing images, where each pixel value is normalized to be within . The adversarial attacks considered are:

Whitebox attacks with FGSM under different noise levels: .

Whitebox attacks with IFGSM under the fixed noise level of with different numbers of steps: .

Blackbox attacks from three substitute models: the naturally trained model (i.e. the one trained using only the clean inputs without any additional defense strategy), one trained with FGSM adversaries under the noise level of , and one trained with IFGSM adversaries under the total noise level of and step size of 0.01 . With respect to these substitute models, IFGSM with and is used to generate adversarial examples, which are then employed to attack each of the targeted models.
All CNNs we use consist of two convolutional layers with 32 and 64 filters respectively, each of which is followed by a maxpooling layer and ReLU activation function, and a fully connected layer of neurons. The configuration of the CNNs is summarized in Table 4 in the Appendix.
4.1.1 Results on WhiteBox Attacks
For our proposed e2SAD method, we set the hyperparameters in the training Algorithm 1 as: , , , and . To increase the searching ability of the second step of e2SAD, we do not clamp the second adversarial point to be within a norm ball around the clean data point. All models are trained on MNIST for 30,000 iterations with the batch size of 256.
Comparison with adversarial training
The performances of different models under various whitebox attacks are shown in Table 1. It can be seen that each model reaches the accuracy of over on the clean dataset. The baseline model trained naturally shows no defense ability towards both FGSM and IFGSM adversaries while other three models demonstrate different levels of defense. The model obtained via FGSM adversarial training maintains a very high accuracy under FGSM attacks with different noise levels. However, FGSM adversarial training can only defend FGSM attacks while shows no defense ability against IFGSM attacks of any step number. IFGSM adversarial training performs well under IFGSM adversaries and also shows robustness against FGSM attacks. However, the defense ability drops fast when the noise level increases in the case of FGSM attacks. Specifically, the accuracy can drop by almost under the FGSM attacks when the noise level increases to . Note that it makes 30 steps to generate IFGSM adversarial examples in each training iteration, leading to the high cost of the considered IFGSM adversarial training.
Attack  Label  Natural  FGSM Adv. Train  IFGSM Adv. Train  e2SAD  

Clean Data  H  0.9942  0.9938  0.9921  0.9913  0.9932  
S  0.9952  0.9943  0.9941  0.9913  
FGSM  H  0.1741  0.9768  0.9658  0.9519  0.9641  
S  0.4256  0.9846  0.9919  0.9595  
H  0.1027  0.9136  0.9732  0.8152  0.9499  
S  0.2146  0.9374  0.9914  0.9012  
IFGSM  k=10  H  0.0001  0.0856  0.2108  0.9336  0.8687 
S  0.1019  0.1166  0.0101  0.9422  
k=30  H  0  0.082  0.1968  0.9325  0.8633  
S  0.093  0.0966  0.0065  0.9412 
Among all models considered, the proposed e2SAD method produces the highest accuracy for both the clean data and FGSM attacks at different noise levels. Under IFGSM attacks e2SAD significantly outperforms the FGSM adversarial training, demonstrating the effectiveness of the proposed twostep approach’s generalization capability with respect to defense against strong multistep attacks. Compared with the adversarial training using IFGSM, e2SAD offers stronger defense against FGSM attacks while maintaining a good robustness against IFGSM attacks. Note that these are achieved using only two steps of gradient calculation in each training iteration, presenting a significant reduction of computational cost compared with the IFGSM adversarial training, which performs 30 steps of gradient computation.
Label smoothing is adopted in e2SAD and it is shown to be effective in helping the trained model generalize well. In our experiments, we set the probability for the correct label to 0.75 and the one for all other incorrect labels to 0.25. Table 1 shows that label smoothing also improves the performance of the traditional adversarial training under some circumstances, but not significantly.
Comparison with the minimax adversarial defense
We also implemented the minimax adversarial defense method proposed in [18] with a minor modification that the model is trained using a mixture of clean and adversarial examples to achieve better performance. Our results show that the trained model is very robust against FGSM attacks, however, shows no defense against IFGSM attacks.
4.1.2 Results on BlackBox Attacks
In Table 2, we consider how adversarial examples generated by applying IFGSM to a substitute model may attack a different model. The rows of the table are the considered substitute models: “Natural model” is again the baseline model without any additional defense strategy; “FGSM ” is the model obtained via FGSM adversarial training with the setting ; “IFGSM ” is the model obtained via IFGSM adversarial training with the setting . The substitution models are trained using hard labels (“H”) and label smoothing (“S”), then attacked by IFGSM with the setting () for generating adversarial examples. The adversarial examples generated from the substitute models are used to attack the four models shown in the columns of the table: “Natural” is the baseline model; “FGSM Adv. Train” and “IFGSM Adv. Train” are models trained by the FGSM and IFGSM adversarial training using the settings specified in the table, respectively; “e2SAD” is the proposed model. The models under attack are trained using both hard and label smoothing except for e2SAD which is based on label smoothing only. Note that in Table 2, whitebox attacks are resulted when the substitute model and the one under attack are identical, and all other combinations correspond to blackbox attacks.
Table 2 demonstrates that the proposed e2SAD approach delivers a wellbalanced defense against blackbox IFGSM attacks from all three substitute models with an accuracy of nearly or higher. There are several cases under which the natural training (baseline) and FGSM adversarial training have a poor performance. In all cases, e2SAD either noticeably outperforms both the natural training and FGSM adversarial training or produce a fairly close performance. Compared with the models trained with the 30steps IFGSM adversarial training, e2SAD is still very competitive particularly given the fact that only twosteps of gradient computation are performed at each training iteration.
Substitude Model  Label  Natural  FGSM Adv. Train  IFGSM Adv.Train  e2SAD  

Natural Model (H)  H  0  0.9083  0.9158  0.9668  0.868 
S  0.1376  0.7887  0.8963  0.9641  
FGSM (H)  H  0.9127  0.082  0.8945  0.967  0.9422 
S  0.9083  0.7581  0.8214  0.9671  
IFGSM (H)  H  0.9163  0.9319  0.9148  0.9324  0.8886 
S  0.885  0.768  0.8482  0.9574  
Natural Model (S)  H  0.9024  0.9742  0.9674  0.9838  0.9719 
S  0.0929  0.8485  0.8963  0.9847  
FGSM (S)  H  0.9777  0.9708  0.9543  0.9825  0.9698 
S  0.9745  0.0966  0.7658  0.9832  
IFGSM (S)  H  0.939  0.952  0.9464  0.9578  0.9525 
S  0.9435  0.9476  0.9481  0.9412 
4.2 Results on the Street View House Numbers (SVHN) Dataset
The Street View House Numbers (SVHN) dataset ( [20]
) consists of a training set of 73,257 digits and a testing set of 26,032 digits obtained from house numbers in Google Street View images, representing a significantly harder realworld dataset compared to MNIST. We process the SVHN dataset by removing the mean and normalizing the pixel values with the standard deviation of all pixels in each image so that the normalized pixel values are within [1, 1].
We train three different models with the CNN configuration summarized in Table 5
in the Appendix and compare their performances under the scenario of white box attacks. All models are trained for 20 epochs with the following setup

FGSMbased adversarial training: {Batch size = 256, optimizer=AdamOptimizer with learning rate 0.001, , }

IFGSMbased adversarial training: {Batch size = 256, optimizer=AdamOptimizer with learning rate 0.001, , , attack steps=10}

e2SAD: {Batch size = 256, optimizer=AdamOptimizer with learning rate 0.001, , , , , label smoothing with correct class probability of 0.75}
The performances of the various models on this much harder SVHN dataset are summarized in Table 3. It turns out that e2SAD outperforms all other models in this case. More specifically, the baseline (natural) model shows no defense to any attack. e2SAD attains a significantly stronger robustness against the iterative IFGSM whitebox attacks compared with the FGSM adversarial training, which shows no defense to such attacks. Furthermore, compared with the expensive IFGSM adversarial training, e2SAD offers a much stronger defense against the onestep FGSM attacks. This fact may be attributed to the particular twostep structure of e2SAD, which is geared towards defending both onestep and multistep adversarial attacks.
Attack  Natural  FGSM Adv. Train  IFGSM Adv. Train  e2SAD  

Clean data  0.9006  0.9119  0.9001  0.9236  
FGSM  0.1962  0.7842  0.4289  0.7881  
IFGSM  , k=10  0.0628  0.0455  0.3228  0.3328 
, k=20  0.0602  0.0402  0.3165  0.4020  
, k=30  0.0593  0.0385  0.3146  0.3868 
5 Conclusion
We have aimed to improve the robustness of deep neural networks by presenting an efficient twostep adversarial defense technique e2SAD, particularly w.r.t to strong iterative multistep attacks. This objective is achieved by finding a combination of two adversarial points to best reveal the vulnerability of the model around each clean input. In particular, we have demonstrated that using a dissimilarity measure between the first and second adversarial examples we are able to appropriately locate the second adversary in a way such that including both types of adversaries in the final training loss function leads to improved robustness against multistep adversarial attacks. We have demonstrated the effectiveness of e2SAD in terms of defense against whilebox onestep FGSM and multistep IFGSM attacks and blackbox IFGSM attacks under various settings.
e2SAD provides a general mechanism for defending both onestep and multiple attacks and for balancing between these two defense needs, the latter of which can be achieved by properly tuning the corresponding weight hyperparameters in the training loss function. In the future work, we will explore hyperparameter tuning and other new techniques to provide a more balanced or further improved defense quality for a wider range of white and black box attacks.
References
 [1] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
 [2] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” 2014.
 [3] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015.
 [4] A. Fawzi, O. Fawzi, and P. Frossard, “Analysis of classifiers’ robustness to adversarial perturbations,” Machine Learning, vol. 107, no. 3, pp. 481–508, 2018.
 [5] S.M. MoosaviDezfooli, A. Fawzi, O. Fawzi, P. Frossard, and S. Soatto, “Analysis of universal adversarial perturbations,” arXiv preprint arXiv:1705.09554, 2017.

[6]
N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in
Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pp. 372–387, IEEE, 2016.  [7] J. Su, D. V. Vargas, and S. Kouichi, “One pixel attack for fooling deep neural networks,” arXiv preprint arXiv:1710.08864, 2017.
 [8] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” arXiv preprint arXiv:1608.04644, 2016.

[9]
S.M. MoosaviDezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and
accurate method to fool deep neural networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2574–2582, 2016.  [10] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machine learning at scale,” arXiv preprint arXiv:1611.01236, 2016.
 [11] A. S. Ross and F. DoshiVelez, “Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients,” arXiv preprint arXiv:1711.09404, 2017.
 [12] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597, IEEE, 2016.
 [13] J. Wang, J. Sun, P. Zhang, and X. Wang, “Detecting adversarial samples for deep neural networks through mutation testing,” arXiv preprint arXiv:1805.05010, 2018.
 [14] D. Meng and H. Chen, “Magnet: a twopronged defense against adversarial examples,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 135–147, ACM, 2017.
 [15] P. Samangouei, M. Kabkab, and R. Chellappa, “Defensegan: Protecting classifiers against adversarial attacks using generative models,” arXiv preprint arXiv:1805.06605, 2018.
 [16] N. Papernot, P. McDaniel, A. Sinha, and M. Wellman, “Towards the science of security and privacy in machine learning,” arXiv preprint arXiv:1611.03814, 2016.
 [17] R. Huang, B. Xu, D. Schuurmans, and C. Szepesvári, “Learning with a strong adversary,” arXiv preprint arXiv:1511.03034, 2015.
 [18] J. Hamm, “Machine vs machine: Minimaxoptimal defense against adversarial examples,” 2018.
 [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [20] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop on deep learning and unsupervised feature learning, vol. 2011, p. 5, 2011.
Appendix A Appendix
a.1 CNN model configurations
Layer Type  Details 

ReLU Convolutional  
Max Pooling  2*2 
ReLU Convolutional  64 filters (5*5, stride 1, padding 2) 
Max Pooling  2*2 
ReLU Fully Connect  1024 units 
Fully Connect  10 units 
Softmax  10 units 
Layer Type  Details 

ReLU Convolutional  32 filters (5*5, stride 1, padding 0) 
ReLU Convolutional  64 filters (5*5, stride 1, padding 0) 
Max Pooling  2*2 
ReLU Convolutional  128 filters (3*3, stride 1, padding 0) 
Max Pooling  2*2 
ReLU Fully Connect  512 units 
Fully Connect  10 units 
Softmax  10 units 
a.2 Visualization of e2SAD generated adversarial example
To demonstrate the twostep adversarial generation process of e2SAD, we consider a minibatch of 128 clean images from the MNIST handwritten digits dataset [19]. We apply e2SAD to find the first and second adversarial examples for each clean image in the batch. To help visualize the loss surface of the model around this minibatch, which may be explored by IFGSM attacks in a twodimensional input space, we identify a search direction , where is the adversary for found by IFGSM. We define a second search direction to be orthogonal to . Then around each , we generate a set of perturbed images along and : , . and are chosen to be the two lateral axes in Figure 3. Here the loss is defined as the cross entropy loss based on hard target labels. The mesh loss surface shows the loss of the model summed over the perturbed images for the entire minibatch as a function of and . The blue dot at location is the loss of the minibatch of clean images. The red line starting from this blue point illustrates the twostep e2SAD adversarial searching direction. The second and third blue points on the red line show the losses summed over the first and second sets of adversarial examples, respectively, generated by e2SAD for this minibatch of clean images. The locations of these two points are projected on the and coordinates for visualization. In this case, at the second step e2SAD is able to identify an effective set adversarial examples with a cost further increased from the first set, suggesting its effectiveness in defending both onestep and multistep adversarial attacks.
a.3 Visualization of the Loss surfaces of four different models
We visualize the loss surfaces of different models to shed light on the potential defense capabilities of these models with respect to both onestep FGSM attacks and multistep IFGSM attacks in Figure (a)a and Figure (b)b, respectively. Here, the baseline model again is only trained with the clean data and with no additional defense strategy; “FGSM Adv. Train” is the model is trained by adversarial training with adversaries generated from FGSM (); “IFGSM Adv. Train” is the model trained by adversarial training with adversaries generated from IFGSM (); And e2SAD is the proposed approach with the setting (). All models are trained using a total number of 30,000 minibatches of 256 images each over the MNIST dataset.
Figure (a)a and Figure (b)b illustrate the loss surface of each model in the input space, which may be viewed by FGSM and IFGSM attacks, respectively, when they generate adversarial examples. To make visualizations possible in a reduced 2dimensional input space, we take the approach adopted in Figure 3. For example, in the case of Figure (a)a, we identify a search direction , where is the adversary for each clean image found by the FGSM attack. We define a second search direction to be orthogonal to . Then around each , we generate a set of perturbed images along and : , . and are again chosen to be the two lateral axes in Figure (a)a as in Figure 3. The mesh loss surface of a model shows the loss summed over the perturbed images for the entire MNIST dataset as a function of and . Again, the value at location is the loss of all (MNIST) clean images. The same visualization approach is taken in Figure (b)b with the difference that the two search directions are defined by the adversary found by the IFGSM attack for each clean image.
In both figures, it can be observed that the loss surface of the e2SAD model is the flattest one with the lowest average value within the large 2dimensional adversarial searching space. This is consistent with the empirically observed effectiveness of e2SAD’s defense against both FGSM and IFGSM attacks.
Comments
There are no comments yet.