1 Introduction
Deep neural networks have achieved stateoftheart performances on a wide variety of computer vision applications, such as image classification, object detection, tracking, and activity recognition
[8]. In spite of their success in addressing these challenging tasks, they are vulnerable to active adversaries. Most notably, they are susceptible to adversarial examples^{1}^{1}1This also affects other machine learning classifiers, but we restrict our analysis to CNNs, that are most commonly used in computer vision tasks.
, in which adding small perturbations to an image, often imperceptible to a human observer, causes a misclassification [2, 18].Recent research on adversarial examples developed attacks that allow for evaluating the robustness of models, as well as defenses against these attacks. Attacks have been proposed to achieve different objectives, such as minimizing the amount of noise that induces misclassification [5, 18], or being fast enough to be incorporated into the training procedure [7, 19]. In particular, considering the case of obtaining adversarial examples with lowest perturbation (measured by its norm), the stateoftheart attack has been proposed by Carlini and Wagner (C&W) [5]. While this attack generates adversarial examples with low noise, it also requires a high number of iterations, which makes it impractical for training a robust model to defend against such attacks. In contrast, onestep attacks are fast to generate, but using them for training does not increase model robustness on whitebox scenarios, with full knowledge of the model under attack [19]. Developing an attack that finds adversarial examples with low noise in few iterations would enable adversarial training with such examples, which could potentially increase model robustness against whitebox attacks.
Developing attacks that minimize the norm of the adversarial perturbations requires optimizing two objectives: 1) obtaining a low norm, while 2) inducing a misclassification. With the current stateoftheart method (C&W [5]
), this is addressed by using a twoterm loss function, with the weight balancing the two competing objectives found via an expensive line search, typically requiring a large number of iterations. This makes the evaluation of a system’s robustness very slow and therefore, unpractical in adversarial training scenarios.
In this paper, we propose an efficient gradientbased attack called Decoupled Direction and Norm^{2}^{2}2Code available at https://github.com/jeromerony/fast_adversarial. (DDN) that induces misclassification with a low norm. This attack optimizes the crossentropy loss, and instead of penalizing the norm in each iteration, projects the perturbation onto a sphere centered at the original image. The change in norm is then based on whether the sample is adversarial or not. Using this approach to decouple the direction and norm of the adversarial noise leads to an attack that needs significantly fewer iterations, achieving a level of performance comparable to stateoftheart, while being amenable to be used for adversarial training.
A comprehensive set of experiments was conducted using the MNIST, CIFAR10 and ImageNet datasets. Our attack obtains comparable results to the stateoftheart while requiring much fewer iterations (~100 times less than C&W). For untargeted attacks on the ImageNet dataset, our attack achieves better performance than the C&W attack, taking less than 10 minutes to attack 1 000 images, versus over 35 hours to run the C&W attack.
Results for adversarial training on the MNIST and CIFAR10 datasets indicate that DDN can achieve stateoftheart robustness compared to the Madry defense [13]. These models require that attacks use a higher average norm to induce misclassifications. They also obtain a higher accuracy when the norm of the attacks is bounded. On MNIST, if the attack norm is restricted to , the model trained with the Madry defense achieves 67.3% accuracy, while our model achieves 87.2% accuracy. On CIFAR10, for attacks restricted to a norm of , the Madry model achieves 56.1% accuracy, compared to 67.6% in our model.
2 Related Work
In this section, we formalize the problem of adversarial examples, the threat model, and review the main attack and defense methods proposed in the literature.
2.1 Problem Formulation
Let be an sample from the input space , with label from a set of possible labels . Let be a distance measure that compares two input samples (ideally capturing their perceptual similarity). is a model (classifier) parameterized by . An example is called adversarial (for nontargeted attacks) against the classifier if and , for a given maximum perturbation . A targeted attack with a given desired class further requires that . We denote as , the crossentropy between the prediction of the model for an input and a label . Fig. 1 illustrates a targeted attack on the ImageNet dataset, against an Inception v3 model [17].
In this paper, attacks are considered to be generated by a gradientbased optimization procedure, restricting our analysis to differentiable classifiers. These attacks can be formulated either to obtain a minimum distortion , or to obtain the worst possible loss in a region . As an example, consider that the distance function is a norm (e.g., , or ), and the inputs are images (where each pixel’s value is constrained between 0 and ). In a whitebox scenario, the optimization procedure to obtain an nontargeted attack with minimum distortion can be formulated as:
(1)  
and 
With a similar formulation for targeted attacks, by changing the constraint to be equal to the target class.
If the objective is to obtain the worst possible loss for a given maximum noise of norm , the problem can be formulated as:
(2)  
and 
With a similar formulation for targeted attacks, by maximizing .
We focus on gradientbased attacks that optimize the
norm of the distortion. While this distance does not perfectly capture perceptual similarity, it is widely used in computer vision to measure similarity between images (e.g. comparing image compression algorithms, where Peak SignaltoNoise Ratio is used, which is directly related to the
measure). A differentiable distance measure that captures perceptual similarity is still an open research problem.2.2 Threat Model
In this paper, a whitebox scenario is considered, also known as a Perfect Knowledge scenario [2]. In this scenario, we consider that an attacker has perfect knowledge of the system, including the neural network architecture and the learned weights . This threat model serves to evaluate system security under the worst case scenario. Other scenarios can be conceived to evaluate attacks under different assumptions on the attacker’s knowledge, for instance, no access to the trained model, no access to the same training set, among others. These scenarios are referred as blackbox or LimitedKnowledge [2].
2.3 Attacks
Several attacks were proposed in the literature, either focusing on obtaining adversarial examples with a small (Eq. 1) [5, 14, 18], or on obtaining adversarial examples in one (or few) steps for adversarial training [7, 12].
LBFGS. Szegedy et al. [18] proposed an attack for minimally distorted examples (Eq. 1), by considering the following approximation:
(3)  
where the constraint was addressed by using a boxconstrained optimizer (LBFGS: Limited memory Broyden–Fletcher–Goldfarb–Shanno), and a linesearch to find an appropriate value of .
FGSM. Goodfellow et al [7] proposed the Fast Gradient Sign Method, a onestep method that could generate adversarial examples. The original formulation was developed considering the norm, but it has also been used to generate attacks that focus on the norm as follows:
(4) 
where the constraint was addressed by simply clipping the resulting adversarial example.
DeepFool. This method considers a linear approximation of the model, and iteratively refines an adversary example by choosing the point that would cross the decision boundary under this approximation. This method was developed for untargeted attacks, and for any norm [14].
C&W. Similarly to the LBFGS method, the C&W attack [5] minimizes two criteria at the same time – the perturbation that makes the sample adversarial (e.g., misclassified by the model), and the norm of the perturbation. Instead of using a boxconstrained optimization method, they propose changing variables using the
function, and instead of optimizing the crossentropy of the adversarial example, they use a difference between logits. For a targeted attack aiming to obtain class
, with being a mapping denoting the model output before the softmax activation (logits), the C&W method optimizes:(5) 
where denotes the logit corresponding to the th class. By increasing the confidence parameter , the adversarial sample will be misclassified with higher confidence. To use this attack in the untargeted setting, the definition of is modified to where is the original label. This method achieves stateoftheart results in adversarial attacks, but requires a high number of iterations (this is further discussed in Section 3).
2.4 Defenses
Developing defenses against adversarial examples is an active area of research. To some extent, there is an arms race on developing defenses and attacks that break them. Goodfellow et al. proposed a method called adversarial training [7], in which the training data is augmented with FGSM samples. This was later shown not to be robust against iterative whitebox attacks, nor blackbox singlestep attacks [19]. Papernot et al. [15] proposed a distillation procedure to train robust networks, which was shown to be easily broken by iterative whitebox attacks [5]. Other defenses involve obfuscated gradients [1], where models either incorporate nondifferentiable steps (such that the gradient cannot be computed) [4, 9]
, or randomized elements (to induce incorrect estimations of the gradient)
[6, 20]. These defenses were later shown to be ineffective when attacked with Backward Pass Differentiable Approximation (BPDA) [1], where the actual model is used for forward propagation, and the gradient in the backwardpass is approximated. The Madry defense [13], which considers a worstcase optimization, is the only defense that has been shown to be somewhat robust (on the MNIST and CIFAR10 datasets). Below we provide more detail on the general approach of adversarial training, and the Madry defense.Adversarial Training. This defense considers augmenting the training objective with adversarial examples [7], with the intention of improving robustness. Given a model with loss function , training is augmented as follows:
(6) 
where is an adversarial sample. In [7], the FGSM is used to generate the adversarial example in a single step. Tramèr et al. [19] extended this method, showing that generating onestep attacks using the model under training introduced an issue. The model can converge to a degenerate solution where its gradients produce “easy” adversarial samples, causing the adversarial loss to have a limited influence on the training objective. They proposed a method in which an ensemble of models is also used to generate the adversarial examples . This method displays some robustness against blackbox attacks using surrogate models, but does not increase robustness in whitebox scenarios.
Madry Defense. Madry et al. [13] proposed a saddle point optimization problem, in which we optimize for the worst case:
(7) 
where is the training set, and indicates the feasible region for the attacker (e.g. ). They show that Eq. 7
can be optimized by stochastic gradient descent – during each training iteration, it first finds the adversarial example that maximizes the loss around the current training sample
(i.e., maximizing the loss over, which is equivalent to minimizing the probability of the correct class as in
Eq. 2), and then, it minimizes the loss over . Experiments in Athalye et al. [1] show that it was the only defense not broken under whitebox attacks.3 Decoupled Direction and Norm Attack
From the problem definition, we see that finding the worst adversary in a fixed region is an easier task. In Eq. 2, both constraints can be expressed in terms of , and the resulting equation can be optimized using projected gradient descent. Finding the closest adversarial example is harder: Eq. 1 has a constraint on the prediction of the model, which cannot be addressed by a simple projection. A common approach, which is used by Szegedy et al. [18] and in the C&W [5] attack, is to approximate the constrained problem in Eq. 1 by an unconstrained one, replacing the constraint with a penalty. This amounts to jointly optimizing both terms, the norm of and a classification term (see Eq. 3 and 5), with a sufficiently high parameter . In the general context of constrained optimization, such a penaltybased approach is a well known general principle [11]. While tackling an unconstrained problem is convenient, penalty methods have wellknown difficulties in practice. The main difficulty is that one has to choose parameter in an ad hoc way. For instance, if is too small in Eq. 5, the example will not be adversarial; if it is too large, this term will dominate, and result in an adversarial example with more noise. This can be particularly problematic when optimizing with a low number of steps (e.g. to enable its use in adversarial training). Fig. 2 plots a histogram of the values of that were obtained by running the C&W attack on the MNIST dataset. We can see that the optimum varies significantly among different examples, ranging from to . We also see that the distribution of the best constant changes whether we attack a model with or without adversarial training (adversarially trained models often require higher ). Furthermore, penalty methods typically result in slow convergence [11].
Given the difficulty of finding the appropriate constant for this optimization, we propose an algorithm that does not impose a penalty on the norm during the optimization. Instead, the norm is constrained by projecting the adversarial perturbation on an sphere around the original image . Then, the norm is modified through a binary decision. If the sample is not adversarial at step , the norm is increased for step , otherwise it is decreased.
We also note that optimizing the crossentropy may present two other difficulties. First, the function is not bounded, which can make it dominate in the optimization of Eq. 3. Second, when attacking trained models, often the predicted probability of the correct class for the original image is very close to 1, which causes the cross entropy to start very low and increase by several orders of magnitude during the search for an adversarial example. This affects the norm of the gradient, making it hard to find an appropriate learning rate. C&W address these issues by optimizing the difference between logits instead of the crossentropy. In this work, the issue of it being unbounded does not affect the attack procedure, since the decision to update the norm is done on the model’s prediction (not on the crossentropy). In order to handle the issue of large changes in gradient norm, we normalize the gradient to have unit norm before taking a step in its direction.
The full procedure is described in Algorithm 1 and illustrated in Fig. 3. We start from the original image , and iteratively refine the noise . In iteration , if the current sample is still not adversarial, we consider a larger norm . Otherwise, if the sample is adversarial, we consider a smaller . In both cases, we take a step (step 5 of Algorithm 1) from the point (red arrow in Fig. 3), and project it back onto an sphere centered at (the direction given by the dashed blue line in Fig. 3), obtaining . Lastly, is projected onto the feasible region of the input space . In the case of images normalized to , we simply clip the value of each pixel to be inside this range (step 13 of Algorithm 1). Besides this step, we can also consider quantizing the image in each iteration, to ensure the attack is a valid image.
It’s worth noting that, when reaching a point where the decision boundary is tangent to the sphere, will have the same direction as . This means that will be projected on the direction of . Therefore, the norm will oscillate between the two sides of the decision boundary in this direction. Multiplying by and will result in a global decrease (on two steps) of the norm by , leading to a finer search of the best norm.
4 Attack Evaluation
Attack  Budget  Success  Mean  Median  #Grads  Runtime (s)  
MNIST 
C&W  425  100.0  1.7382  1.7400  100  1.7 
1100  99.4  1.5917  1.6405  100  1.7  
910 000  100.0  1.3961  1.4121  54 007  856.8  
DeepFool  100  75.4  1.9685  2.2909  98    
DDN  100  100.0  1.4563  1.4506  100  1.5  
300  100.0  1.4357  1.4386  300  4.5  
1 000  100.0  1.4240  1.4342  1 000  14.9  
CIFAR10 
C&W  425  100.0  0.1924  0.1541  60  3.0 
1100  99.8  0.1728  0.1620  91  4.6  
910 000  100.0  0.1543  0.1453  36 009  1 793.2  
DeepFool  100  99.7  0.1796  0.1497  25    
DDN  100  100.0  0.1503  0.1333  100  4.7  
300  100.0  0.1487  0.1322  300  14.2  
1 000  100.0  0.1480  0.1317  1 000  47.6  
ImageNet 
C&W  425  100.0  1.5812  1.3382  63  379.3 
1100  100.0  0.9858  0.9587  48  287.1  
910 000  100.0  0.4692  0.3980  21 309  127 755.6  
DeepFool  100  98.5  0.3800  0.2655  41    
DDN  100  99.6  0.3831  0.3227  100  593.6  
300  100.0  0.3749  0.3210  300  1 779.4  
1 000  100.0  0.3617  0.3188  1 000  5 933.6 
Average case  Least Likely  
Attack  Success  Mean  Success  Mean 
C&W 425  96.11  2.8254  69.9  5.0090 
C&W 1100  86.89  2.0940  31.7  2.6062 
C&W 910 000  100.00  1.9481  100.0  2.5370 
DDN 100  100.00  1.9763  100.0  2.6008 
DDN 300  100.00  1.9577  100.0  2.5503 
DDN 1 000  100.00  1.9511  100.0  2.5348 
Average case  Least Likely  
Attack  Success  Mean  Success  Mean 
C&W 425  99.78  0.3247  98.7  0.5060 
C&W 1100  99.32  0.3104  95.8  0.4159 
C&W 910 000  100.00  0.2798  100.0  0.3905 
DDN 100  100.00  0.2925  100.0  0.4170 
DDN 300  100.00  0.2887  100.0  0.4090 
DDN 1 000  100.00  0.2867  100.0  0.4050 
Average case  Least Likely  
Attack  Success  Mean  Success  Mean 
C&W 425  99.13  4.2826  80.6  8.7336 
C&W 1100  96.74  1.7718  66.2  2.2997 
C&W 910 000 [5]  100.00  0.96  100.0  2.22 
DDN 100  99.98  1.0260  99.5  1.7074 
DDN 300  100.00  0.9021  100.0  1.3634 
DDN 1 000  100.00  0.8444  100.0  1.2240 
Experiments were conducted on the MNIST, CIFAR10 and ImageNet datasets, comparing the proposed attack to the stateoftheart attacks proposed in the literature: DeepFool [14] and C&W attack [5]
. We use the same model architectures with identical hyperparameters for training as in
[5] for MNIST and CIFAR10 (see the supplementary material for details). Our base classifiers obtain 99.44% and 85.51% accuracy on the test sets of MNIST and CIFAR10, respectively. For the ImageNet experiments, we use a pretrained Inception V3 [17], that achieves 22.51% top1 error on the validation set. Inception V3 takes images of size 299299 as input, which are cropped from images of size 342342.For experiments with DeepFool [14], we used the implementation from Foolbox [16]
, with a budget of 100 iterations. For the experiments with C&W, we ported the attack (originally implemented on TensorFlow) on PyTorch to evaluate the models in the frameworks in which they were trained. We use the same hyperparameters from
[5]: 9 search steps on C with an initial constant of , with 10 000 iterations for each search step (with early stopping)  we refer to this scenario as C&W 910 000 in the tables. As we are interested in obtaining attacks that require few iterations, we also report experiments in a scenario where the number of iterations is limited to 100. We consider a scenario of running 100 steps with a fixed (1100), and a scenario of running 4 search steps on , of 25 iterations each (425). Since the hyperparameters proposed in [5] were tuned for a larger number of iterations and search steps, we performed a grid search for each dataset, using learning rates in the range [0.01, 0.05, 0.1, 0.5, 1], and in the range [0.001, 0.01, 0.1, 1, 10, 100, 1 000]. We report the results for C&W with the hyperparameters that achieve best Median . Selected parameters are listed in the supplementary material.For the experiments using DDN, we ran attacks with budgets of 100, 300 and 1 000 iterations, in all cases, using and . The initial step size , was reduced with cosine annealing to in the last iteration. The choice of is based on the encoding of images. For any correctly classified image, the smallest possible perturbation consists in changing one pixel by (for images encoded in 8 bit values), corresponding to a norm of . Since we perform quantization, the values are rounded, meaning that the algorithm must be able to achieve a norm lower than . When using steps, this imposes:
(8) 
Using and yields . Therefore, if there exists an adversarial example with smallest perturbation, the algorithm may find it in a fixed number of steps.
For the results with DDN, we consider quantized images (to 256 levels). The quantization step is included in each iteration (see step 13 of Algorithm 1). All results reported in the paper consider images in the range.
Two sets of experiments were conducted: untargeted attacks and targeted attacks. As in [5], we generated attacks on the first 1 000 images of the test set for MNIST and CIFAR10, while for ImageNet we randomly chose 1 000 images from the validation set that are correctly classified. For the untargeted attacks, we report the success rate of the attack (percentage of samples for which an attack was found), the mean norm of the adversarial noise (for successful attacks), and the median norm over all attacks while considering unsuccessful attacks as worstcase adversarial (distance to a uniform gray image, as introduced by [3]). We also report the average number (for batch execution) of gradient computations and the total runtimes (in seconds) on a NVIDIA GTX 1080 Ti with 11GB of memory. We did not report runtimes for the DeepFool attack, since the implementation from foolbox generates adversarial examples onebyone and is executed on CPU, leading to unrepresentative runtimes. Attacks on MNIST and CIFAR10 have been executed in a single batch of 1 000 samples, whereas attacks on ImageNet have been executed in 20 batches of 50 samples.
For the targeted attacks, following the protocol from [5], we generate attacks against all possible classes on MNIST and CIFAR10 (9 attacks per image), and against 100 randomly chosen classes for ImageNet (10% of the number of classes). Therefore, in each targeted attack experiment, we run 9 000 attacks on MNIST and CIFAR10, and 100 000 attacks on ImageNet. Results are reported for two scenarios: 1) average over all attacks; 2) average performance when choosing the least likely class (i.e. choosing the worst attack performance over all target classes, for each image). The reported norms are, as in the untargeted scenario, the means over successful attacks.
Table 1 reports the results of DDN compared to the C&W and DeepFool attacks on the MNIST, CIFAR10 and ImageNet datasets. For the MNIST and CIFAR10 datasets, results with DDN are comparable to the stateoftheart. DDN obtains slightly worse norms on the MNIST dataset (when compared to the C&W 910 000), however, our attack is able to get within 5% of the norm found by C&W in only 100 iterations compared to the 54 007 iterations required for the C&W attack. When the C&W attack is restricted to use a maximum of 100 iterations, it always performed worse than DDN with 100 iterations. On the ImageNet dataset, our attack obtains better Mean norms than both other attacks. The DDN attack needs 300 iterations to reach 100% success rate. DeepFool obtains close results but fails to reach 100% success rate. It is also worth noting that DeepFool seems to performs worse against adversarially trained models (discussed in Section 6). Supplementary material reports curves of the perturbation size against accuracy of the models for the three attacks.
Tables 2, 3 and 4 present the results on targeted attacks on the MNIST, CIFAR10 and ImageNet datasets, respectively. For the MNIST and CIFAR10 datasets, DDN yields similar performance compared to the C&W attack with 910 000 iterations, and always perform better than the C&W attack when it is restricted to 100 iterations (we reiterate that the hyperparameters for the C&W attack were tuned for each dataset, while the hyperparameters for DDN are fixed for all experiments). On the ImageNet dataset, DDN run with 100 iterations obtains superior performance than C&W. For all datasets, with the scenario restricted to 100 iterations, the C&W algorithm has a noticeable drop in success rate for finding adversarial examples to the least likely class.
5 Adversarial Training with DDN
Since the DDN attack can produce adversarial examples in relatively few iterations, it can be used for adversarial training. For this, we consider the following loss function:
(9) 
where is an adversarial example produced by the DNN algorithm, that is projected to an ball around , such that the classifier is trained with adversarial examples with a maximum norm of . It is worth making a parallel of this approach with the Madry defense [13] where, in each iteration, the loss of the worstcase adversarial (see Eq. 2) in an ball around the original sample is used for optimization. In our proposed adversarial training procedure, we optimize the loss of the closest adversarial example (see Eq. 1). The intuition of this defense is to push the decision boundary away from in each iteration. We do note that this method does not have the theoretical guarantees of the Madry defense. However, since in practice the Madry defense uses approximations (when searching for the global maximum of the loss around ), we argue that both methods deserve empirical comparison.
6 Defense Evaluation
We trained models using the same architectures as [5] for MNIST, and a Wide ResNet (WRN) 2810 [21] for CIFAR10 (similar to [13] where they use a WRN 3410). As described in Section 5, we augment the training images with adversarial perturbations. For each training step, we run the DDN attack with a budget of 100 iterations, and limit the norm of the perturbation to a maximum on the MNIST experiments, and
for the CIFAR10 experiments. For MNIST, we train the model for 30 epochs with a learning rate of
and then for 20 epochs with a learning rate of . To reduce the training time with CIFAR10, we first train the model on original images for 200 epochs using the hyperparameters from [21]. Then, we continue training for 30 more epochs using Eq. 9, keeping the same final learning rate of . Our robust MNIST model has a test accuracy of 99.01% on the clean samples, while the Madry model has an accuracy of 98.53%. On CIFAR10, our model reaches a test accuracy of 89.0% while the model by Madry et al. obtains 87.3%.Defense  Attack 

Mean  Median 



Baseline  C&W 910 000  100.0  1.3961  1.4121  42.1  
DeepFool 100  75.4  1.9685  2.2909  81.8  
DDN 1 000  100.0  1.4240  1.4342  45.2  
All  100.0  1.3778  1.3946  40.8  

C&W 910 000  100.0  2.0813  2.1071  73.0  
DeepFool 100  91.6  4.9585  5.2946  93.1  
DDN 1 000  99.6  1.8436  1.8994  69.9  
All  100.0  1.6917  1.8307  67.3  
Ours  C&W 910 000  100.0  2.5181  2.6146  88.0  
DeepFool 100  94.3  3.9449  4.1754  92.7  
DDN 1 000  100.0  2.4874  2.5781  87.6  
All  100.0  2.4497  2.5538  87.2 
Defense  Attack 

Mean  Median 




C&W 910 000  100.0  0.1343  0.1273  0.2  
DeepFool 100  99.3  0.5085  0.4241  38.3  
DDN 1 000  100.0  0.1430  0.1370  0.1  
All  100.0  0.1282  0.1222  0.1  

C&W 910 000  100.0  0.6912  0.6050  57.1  
DeepFool 100  95.6  1.4856  0.9576  64.7  
DDN 1 000  100.0  0.6732  0.5876  56.9  
All  100.0  0.6601  0.5804  56.1  

C&W 910 000  100.0  0.8860  0.8254  67.9  
DeepFool 100  99.7  1.5298  1.1163  69.9  
DDN 1 000  100.0  0.8688  0.8177  68.0  
All  100.0  0.8597  0.8151  67.6 
We evaluate the adversarial robustness of the models using three untargeted attacks: Carlini 910 000, DeepFool 100 and DDN 1 000. For each sample, we consider the smallest adversarial perturbation produced by the three attacks and report it in the “All” row. Tables 5 and 6 report the results of this evaluation with a comparison to the defense of Madry et al. [13]^{3}^{3}3Models taken from https://github.com/MadryLab and the baseline (without adversarial training) for CIFAR10. For MNIST, the baseline corresponds to the model used in Section 4. We observe that for attacks with unbounded norm, the attacks can successfully generate adversarial examples almost 100% of the time. However, an increased norm is required to generate attacks against the model trained with DDN.
Fig. 4 shows the robustness of the MNIST and CIFAR10 models respectively for different attacks with increasing maximum norm. These figures can be interpreted as the expected accuracy of the systems in a scenario where the adversary is constrained to make changes with norm . For instance on MNIST, if the attacker is limited to a maximum norm of , the baseline performance decreases to 40.8%; Madry to 67.3% and our defense to 87.2%. At , baseline performance decreases to 9.2%, Madry to 38.6% and our defense to 74.8%. On CIFAR10, if the attacker is limited to a maximum norm of , the baseline performance decreases to 0.1%; Madry to 56.1% and our defense to 67.6%. At , baseline performance decreases to 0%, Madry to 24.4% and our defense to 39.9%. For both datasets, the model trained with DDN outperforms the model trained with the Madry defense for all values of .
footnote 5 shows adversarial examples produced by the DDN 1 000 attack for different models on MNIST and CIFAR10. On MNIST, adversarial examples for the baseline are not meaningful (the still visually belong to the original class), whereas some adversarial examples obtained for the adversarially trained model (DDN) actually change classes (bottom right: 0 changes to 6). For all models, there are still some adversarial examples that are very close to the original images (first column). On CIFAR10, while the adversarially trained models require higher norms for the attacks, most adversarial examples still perceptually resemble the original images. In few cases (bottomright example for CIFAR10), it could cause a confusion: it can appear as changing to class 1  a (cropped) automobile facing right.
7 Conclusion
We presented the Decoupled Direction and Norm attack, which obtains comparable results with the stateoftheart for norm adversarial perturbations, but in much fewer iterations. Our attack allows for faster evaluation of the robustness of differentiable models, and enables a novel adversarial training, where, at each iteration, we train with examples close to the decision boundary. Our experiments with MNIST and CIFAR10 show stateoftheart robustness against based attacks in a whitebox scenario. Future work includes the evaluation of the transferability of attacks in blackbox scenarios.
The methods presented in this paper were used in NIPS 2018 Adversarial Vision Challenge, ranking first in untargeted attacks, and third in targeted attacks and robust models (both attacks and defense in a blackbox scenario). These results highlight the effectiveness of the defense mechanism, and suggest that attacks using adversariallytrained surrogate models can be effective in blackbox scenarios, which is a promising future direction.
Acknowledgements
We thank professors Marco Pedersoli and Christian Desrosiers for their insightful feedback on this paper. This work was supported by the Fonds de recherche du Québec  Nature et technologies (FRQNT) and the CNPq grant #206318/20146.
References
 [1] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 274–283, 2018.

[2]
B. Biggio and F. Roli.
Wild patterns: Ten years after the rise of adversarial machine learning.
Pattern Recognition, 84:317–331, Dec. 2018.  [3] W. Brendel, J. Rauber, A. Kurakin, N. Papernot, B. Veliqi, M. Salathé, S. P. Mohanty, and M. Bethge. Adversarial vision challenge. arXiv:1808.01976, 2018.
 [4] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow. Thermometer Encoding: One Hot Way To Resist Adversarial Examples. In International Conference on Learning Representations, 2018.
 [5] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), pages 39–57, 2017.
 [6] G. S. Dhillon, K. Azizzadenesheli, Z. C. Lipton, J. Bernstein, J. Kossaifi, A. Khanna, and A. Anandkumar. Stochastic Activation Pruning for Robust Adversarial Defense. In International Conference on Learning Representations, 2018.
 [7] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations, 2015.

[8]
J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang,
G. Wang, J. Cai, and T. Chen.
Recent advances in convolutional neural networks.
Pattern Recognition, 77:354–377, May 2018.  [9] C. Guo, M. Rana, M. Cissé, and L. van der Maaten. Countering Adversarial Images using Input Transformations. In International Conference on Learning Representations, 2018.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [11] P. A. Jensen and J. F. a. Bard. Operations Research Models and Methods. Wiley, 2003.
 [12] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. In International Conference on Learning Representations (workshop track), 2017.

[13]
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu.
Towards Deep Learning Models Resistant to Adversarial Attacks.
International Conference on Learning Representations, 2018.  [14] S.M. MoosaviDezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.
 [15] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy (SP), pages 582–597, 2016.
 [16] J. Rauber, W. Brendel, and M. Bethge. Foolbox: A python toolbox to benchmark the robustness of machine learning models. arXiv:1707.04131, 2017.
 [17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
 [18] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
 [19] F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble Adversarial Training: Attacks and Defenses. In International Conference on Learning Representations, 2018.
 [20] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille. Mitigating Adversarial Effects Through Randomization. In International Conference on Learning Representations, 2018.
 [21] S. Zagoruyko and N. Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference, pages 87.1–87.12, 2016.
Supplementary material
Appendix A Model architectures
Table 7 lists the architectures of the CNNs used in the Attack Evaluation  we used the same architecture as in [5] for a fair comparison against the C&W and DeepFool attacks. Table 8 lists the architecture used in the robust model (defense) trained on CIFAR10. We used a Wide ResNet with layers and widening factor of (WRN2810). The residual blocks used are the “basic block" [10, 21]
, with stride
for the first group and stride for the second an third groups. This architecture is slightly different from the one used by Madry et al. [13], where they use a modified version of Wide ResNet with residual blocks instead ofin each group, and without convolutions in the residual connections (when the shape of the output changes, e.g. with stride=
).Layer Type  MNIST Model  CIFAR10 Model 

Convolution + ReLU 

Convolution + ReLU  
Max Pooling  
Convolution + ReLU  
Convolution + ReLU  
Max Pooling  
Fully Connected + ReLU  200  256 
Fully Connected + ReLU  200  256 
Fully Connected + Softmax  10  10 
Appendix B Hyperparameters selected for the C&W
attack
Layer Type  Size 

Convolution  
Residual Block  
Residual Block  
Residual Block  
Batch Normalization + ReLU   
Average Pooling  
Fully Connected + Softmax  10 
We considered a scenario of running the C&W attack with 100 steps and a fixed (1100), and a scenario of running 4 search steps on , of 25 iterations each (425). Since the hyperparameters proposed in [5] were tuned for a larger number of iterations and search steps, we performed a grid search for each dataset, using learning rates in the range [0.01, 0.05, 0.1, 0.5, 1], and in the range [0.001, 0.01, 0.1, 1, 10, 100, 1 000]. We selected the hyperparameters that resulted in targeted attacks with lowest Median for each dataset. Table 9 lists the hyperparameters found through this search procedure.
Dataset  # Iterations  Parameters 

MNIST  ,  
MNIST  ,  
CIFAR10  ,  
CIFAR10  ,  
ImageNet  ,  
ImageNet  , 
Appendix C Examples of adversarial images
Fig. 6 plots a grid of attacks (obtained with the C&W attack) against the first 10 examples in the MNIST dataset. The rows indicate the source classification (label), and the columns indicate the target class used to generate the attack (images on the diagonal are the original samples). We can see that in the adversarially trained model, the attacks need to introduce much larger changes to the samples in order to make them adversarial, and some of the adversarial samples visually resemble another class.
Fig. 7 shows randomlyselected adversarial examples for the CIFAR10 dataset, comparing the baseline model (WRN 2810), the Madry defense and our proposed defense. For each image and model, we ran three attacks (DDN 1 000, C&W 910 000, DeepFool 100), and present the adversarial example with minimum perturbation among them. Fig. 8 shows cherrypicked adversarial examples on CIFAR10, that visually resemble another class, when attacking the proposed defense. We see that on the average case (randomlyselected), adversarial examples against the defenses still require low amounts of noise (perceptually) to induce misclassification. On the other hand, we notice that on adversarially trained models, some examples do require a much larger change on the image, making it effectively resemble another class.
Appendix D Attack performance curves
Fig. 9 reports curves of the perturbation size against accuracy of the models for three attacks: Carlini 910 000, DeepFool 100 and DDN 300.
Comments
There are no comments yet.