1 Introduction
Manipulating just a few pixels in an input can easily derail the predictions of a deep neural network (DNN). This susceptibility threatens deployed machine learning models and highlights a gap between human and machine perception. This phenomenon has been intensely studied since its discovery in Deep Learning (Szegedy et al., 2014) but progress has been slow (Athalye et al., 2018a).
One core issue behind this lack of progress is the shortage of tools to reliably evaluate the robustness of machine learning models. Almost all published defenses against adversarial perturbations have later been found to be ineffective (Athalye et al., 2018a): the models just appeared robust on the surface because standard adversarial attacks failed to find the true minimal adversarial perturbations against them. Stateoftheart attacks like PGD (Madry et al., 2018) or C&W (Carlini and Wagner, 2016)
may fail for a number of reasons, ranging from (1) suboptimal hyperparameters over (2) an insufficient number of optimization steps to (3) masking of the backpropagated gradients.
In this paper, we adopt ideas from the decisionbased boundary attack (Brendel et al., 2018)
and combine them with gradientbased estimates of the boundary. The resulting class of gradientbased attacks surpasses current stateoftheart methods in terms of attack success, query efficiency and reliability. Like the decisionbased boundary attack, but unlike existing gradientbased attacks, our attacks start from a point far away from the clean input and follow the boundary between the adversarial and nonadversarial region towards the clean input, Figure
1 (middle). This approach has several advantages: first, we always stay close to the decision boundary of the model, the most likely region to feature reliable gradient information. Second, instead of minimizing some surrogate loss (e.g. a weighted combination of the crossentropy and the distance loss), we can formulate a clean quadratic optimization problem. Its solution relies on the local plane of the boundary to estimate the optimal step towards the clean input under the given norm and the pixel bounds, see Figure 1 (right). Third, because we always stay close to the boundary, our method features only a single hyperparameter (the trust region) but no other tradeoff parameters as in C&W or a fixed norm ball as in PGD. We tested our attacks against the current stateoftheart in the and metric on two conditions (targeted and untargeted) on six different models across three different data sets. To make all comparisons as fair as possible, we conducted a largescale hyperparameter tuning for each attack. In all cases tested, we find that our attacks outperform the current stateoftheart in terms of attack success, query efficiency and robustness to suboptimal hyperparameter settings. We hope that these improvements will facilitate progress towards robust machine learning models.2 Related work
Gradientbased attacks are the most widely used tools to evaluate model robustness due to their efficiency and success rate relative to other classes of attacks with less model information (like decisionbased, scorebased or transferbased attacks, see (Brendel et al., 2018)). This class includes many of the bestknown attacks such as LBFGS (Szegedy et al., 2014), FGSM (Goodfellow et al., 2015), JSMA (Papernot et al., 2016), DeepFool (MoosaviDezfooli et al., 2016), PGD (Kurakin et al., 2016; Madry et al., 2018), C&W (Carlini and Wagner, 2016), EAD (Chen et al., 2017) and SparseFool (Modas et al., 2019). Nowadays, the two most important ones are PGD with a random starting point (Madry et al., 2018) and C&W (Carlini and Wagner, 2016). They are usually considered the state of the art for (PGD) and (CW). The other ones are either much weaker (FGSM, DeepFool) or minimize other norms, e.g. (JSMA, SparseFool) or (EAD).
More recently, there have been some improvements to PGD that aim at making it more effective and/or more queryefficient by changing its update rule to Adam (Uesato et al., 2018) or momentum (Dong et al., 2018). Initial comparisons to these attacks (not shown) do not suggest any changes in our conclusions w.r.t. to our results on but we will add a full comparison in the next version of the manuscript.
3 Attack algorithm
Our attacks are inspired by the decisionbased boundary attack (Brendel et al., 2018) but use gradients to estimate the local boundary between adversarial and nonadversarial inputs. We will refer to this boundary as the adversarial boundary for the rest of this manuscript. In a nutshell, the attack starts from an adversarial input (which may be far away from the clean sample) and then follows the adversarial boundary towards the clean input , see Figure 1 (middle). To compute the optimal step in each iteration, Figure 1 (right), we solve a quadratic trust region optimization problem. The goal of this optimization problem is to find a step such that (1) the updated perturbation has a smaller distance to the clean input , (2) the size of the step is smaller than a given trust region radius , (3) the updated perturbation stays within the boxconstraints of the valid input value range (e.g. or for input) and (4) the updated perturbation is approximately placed on the adversarial boundary.
Optimization problem
In mathematical terms, this optimization problem can be phrased as
(1) 
where denotes the norm and
denotes the estimate of the normal vector of the local boundary (see Figure
1) around (see below for details). For , Eq. (1) is a quadraticallyconstrained quadratic program (QCQP) while for , it is straightforward to write Equation (1) as a linear program with quadratic constraints (LPQC), see the supplementary material. Both problems can be solved with offtheshelf solvers like ECOS
(Domahidi et al., 2013) or SCS (O’Donoghue et al., 2016) but the runtime of these solvers as well as their numerical instabilities in high dimensions prohibits their use in practice. We therefore derived efficient iterative algorithms to solve Eq. (1) for and . The additional optimization step has little to no impact on the runtime of our attack compared to standard iterative gradientbased attacks like PGD. We report the details of the derivation and the resulting algorithms in the supplements.For , the algorithm to solve Equation (1) is basically an activeset method: in each iteration, we first ignore the pixel bounds, solve the residual QCQP analytically, and then project the solution back into the pixel bounds. In practice, the algorithm converges after a few iterations to the optimal solution .
For , we note that the optimization problem in Eq. (1) can be reduced to the problem for a fixed norm of size . We then perform a simple and fast binary search to minimize .
Adversarial criterion
Our attacks move along the adversarial boundary to minimize the distance to the clean input. We assume that this boundary can be defined by a differentiable equality constraint , i.e. the manifold that defines the boundary is given by the set of inputs . No other assumptions about the adversarial boundary are being made. Common choices for are targeted or untargeted adversarials, defined by perturbations that switch the model prediction from the groundtruth label to either a specified target label (targeted scenario) or any other label (untargeted scenario). More precisely, let
be the classconditional logprobabilities predicted by model
on the input . Then is the criterion for targeted adversarials and for untargeted adversarials.The direction of the boundary in step at point is defined as the derivative of ,
(2) 
Hence, any step for which will move the perturbation onto the adversarial boundary (if the linearity assumption holds exactly). In Eq. (1), we defined for brevity. Finally, we note that in the targeted and untargeted scenarios, we compute gradients for the same loss found to be most effective in Carlini and Wagner (2016). In our case, this loss is naturally derived from a geometric perspective of the adversarial boundary.
Starting point
The algorithm always starts from a point that is typically far away from the clean image and lies in the adversarial region. There are several straightforward ways to find such starting points, e.g. by (1) sampling random noise inputs, (2) choosing a real sample that is part of the adversarial region (e.g. is classified as a given target class) or (3) choosing the output of another adversarial attack.
In all experiments presented in this paper, we choose the starting point as the closest sample (in terms of the norm) to the clean input which was classified differently (in untargeted settings) or classified as the desired target class (in targeted settings) by the given model. After finding a suitable starting point, we perform a binary search with a maximum of 10 steps between the clean input and the starting point to find the adversarial boundary. From this point, we perform an iterative descent along the boundary towards the clean input. Algorithm 1 provides a compact summary of the attack procedure.
4 Methods
We extensively compare the proposed attack against current stateofthe art attacks in a range of different scenarios. This includes six different models (varying in model architecture, defense mechanism and data set), two different adversarial categories (targeted and untargeted) and two different metrics ( and ). In addition, we perform a largescale hyperparameter tuning for all attacks we compare against in order to be as fair as possible. The full analysis pipeline is built on top of Foolbox (Rauber et al., 2017) and will be published soon.
Attacks
We compare against the two attacks which are considered to be the current stateoftheart in and according to the recently published guidelines (Carlini et al., 2019):

[itemsep=1ex, leftmargin=0.5cm]

Projected Gradient Descent (PGD) (Madry et al., 2018). Iterative gradient attack that optimizes by minimizing a crossentropy loss under a fixed norm constraint enforced in each step.

C&W (Carlini and Wagner, 2016).
iterative gradient attack that relies on the Adam optimizer, a tanhnonlinearity to respect pixelconstraints and a loss function that weighs a classification loss with the distance metric to be minimized.
Models
We test all attacks on all models regardless as to whether the models have been specifically defended against the distance metric the attacks are optimizing. The sole goal is to evaluate all attacks on a maximally broad set of different models to ensure their wide applicability. For all models, we used the official implementations of the authors as available in the Foolbox model zoo (Rauber et al., 2017).

[itemsep=1ex, leftmargin=0.5cm]

Kolter & Wong (Kolter and Wong, 2017): Provable defense that considers a convex outer approximation of the possible hidden activations within an ball to optimize a worstcase adversarial loss over this region. MNIST claims: 94.2% ( perturbations ).
Adversarial categories
We test all attacks in two common attack scenarios: untargeted and targeted attacks. In other words, perturbed inputs are classified as adversarials if they are classified differently from the groundtruth label (untargeted) or are classified as a given target class (targeted).
Hyperparameter tuning
We ran PGD, C&W and our attacks on each model/attack combination and each sample with five repetitions and eight different hyperparameter settings. For each attack, we only varied the step sizes and left all other hyperparameters constant. We tested for PGD, for C&W and for our attacks. For C&W, we set the number of steps to 200 and binary search steps to 9. All other hyperparameters were left at their default values^{2}^{2}2In the next update of the manuscript we additional optimize C&W over its initial tradeoffconstant. Our preliminary results on this, however, do not show any substantial differences to the results presented here..
Evaluation
The success of an attack is typically quantified as the attack success rate within a given norm ball. In other words, the attack is allowed to perturb the clean input with a maximum norm of and one measures the classification accuracy of the model on the perturbed inputs. The smaller the classification accuracy the better performed the attack. PGD (Madry et al., 2018), the current stateoftheart attack on , is highly adapted to this scenario and expects as an input.
This contrasts with most attacks like C&W (Carlini and Wagner, 2016) which are designed to find minimal adversarial perturbations. In such scenarios, it is more natural to measure the success of an attack as the median over the adversarial perturbation sizes across all tested samples (Schott et al., 2019). The smaller the median perturbations the better the attack.
Our attacks also seek minimal adversarials and thus lend themselves to both evaluation schemes. To make the comparison to the current stateoftheart as fair as possible, we adopt the success rate criterion on and the median perturbation distance on .
All results reported have been evaluated on 1000 validation samples^{3}^{3}3Except for the results on ResNet50, which have been run on 100 samples due to timeconstraints. We will report the full results in the next version of the paper.. For the evaluation, we chose for each model and each attack scenario such that the best attack performance reaches roughly 50% accuracy. This makes it easier to compare the performance of different attacks (compared to thresholds at which model accuracy is close to zero or close to clean performance). In the untargeted scenario, we chose in the untargeted and in the targeted scenarios for MadryMNIST, Kolter & Wong, Distillation, MadryCIFAR, Logitpairing and ResNet50, respectively.
5 Results
5.1 Attack success
In both targeted as well as untargeted attack scenarios, our attacks surpass the current stateoftheart on every single model we tested, see Table 1 (untargeted) and Table 2 (targeted). While the gains are small on some models like Distillation or MadryCIFAR, we reach quite substantial gains on others: on MadryMNIST, our untargeted attack reaches median perturbation sizes of 1.15 compared to 3.46 for C&W. In the targeted scenario, the difference is even more pronounced (1.70 vs 5.15). On , our attack further reduces the model accuracy by 0.1% to 9.1% relative to PGD. Adversarial examples produced by our attacks are visualized in Figure 2.
MNIST  CIFAR10  ImageNet  

MadryMNIST  K&W  Distillation  MadryCIFAR  LP  ResNet50  
PGD  59.3%  74.5%  30.3%  49.9%  24.1%  53% 
Ours  50.2%  68.4%  29.6%  49.8%  19.3%  41% 
C&W  3.46  2.87  1.10  0.76  0.10  0.15 
Ours  1.15  1.62  1.07  0.72  0.09  0.13 
MNIST  CIFAR10  ImageNet  

MadryMNIST  K&W  Distillation  MadryCIFAR  LP  ResNet50  
PGD  65.5%  46.3%  52.8%  39.1%  1.5%  47% 
Ours  57.7%  38.7%  50.1%  37.3%  0.6%  42% 
C&W  5.15  4.21  2.10  1.21  0.54  0.46 
Ours  1.70  2.32  2.06  1.16  0.52  0.4 
5.2 Query efficiency
On , our attack is drastically more query efficient than C&W, see the querydistortion curves in Figure 3. Each curve represents the maximal attack success (either in terms of model accuracy or median perturbation size) as a function of query budget. For each query (i.e. each point of the curve) and each model, we select the optimal hyperparameter. This ensures that the we tease out how good each attack can perform in limitedquery scenarios. We find that our attack generally requires only about 10 to 20 queries to get close to convergence while C&W often needs several hundred iterations.
Similarly, our attack generally surpasses PGD in terms of attack success after around 10 queries. The first few queries are typically required by our attack to find a suitable point on the adversarial boundary. This gives PGD a slight advantage at the very beginning.
5.3 Hyperparameter robustness
In Figure 4, we show the results of an ablation study. In the full case (8 params + 5 reps), we run all attacks with all eight hyperparameter values and with five repetitions for 1000 steps on each sample and model. We then choose the smallest adversarial input across all hyperparameter values and all repetitions. This is the baseline we compare all ablations against. The results are as follows:

Like PGD or C&W, our attacks experience only a 4% performance drop if a single hyperparameter is used instead of eight.

Our attacks experience around 15%  19% drop in performance for a single hyperparameter and only one instead of five repetitions, similar to PGD and C&W.

We can even choose the same trust region hyperparameter across all models with no further drop in performance. C&W, in comparison, experiences a further 16% drop in performance, meaning it is more sensitive to permodel hyperparameter tuning.

Our attack is extremely insensitive to suboptimal hyperparameter tuning: changing the optimal trust region two orders of magnitude up or down changes performance by less than 15%. In comparison, just one order of magnitude deteriorates C&W performance by almost 50%. Larger deviations from the optimal learning rate disarm C&W completely. PGD is less sensitive than C&W but still experiences large drops if the learning rate gets too small.
6 Discussion & Conclusion
An important obstacle slowing down the search for robust machine learning models is the lack of reliable evaluation tools: out of roughly two hundred defenses proposed and evaluated in the literature, less than a handful are widely accepted as being effective. A more reliable evaluation of adversarial robustness has the potential to more clearly distinguish effective defenses from ineffective ones, thus providing more signal and thereby accelerating progress towards robust models.
In this paper, we introduced a novel class of gradientbased attacks that outperforms the current stateoftheart in terms of attack success, query efficiency and reliability on and . By moving along the adversarial boundary, our attacks stay in a region with fairly reliable gradient information. Other methods like C&W which move through regions far away from the boundary might get stuck due to obfuscated gradients, a common issue for robustness evaluation (Athalye et al., 2018b).
Further extensions to other metrics like are possible as long as the optimization problem Eq. (1) can be solved efficiently. We are currently working on an extension towards and which we will add in the next iteration of the manuscript. Extensions to other adversarial criteria are trivial as long as the boundary between the adversarial and the nonadversarial region can be described by a differentiable equality constraint. This makes the attack more suitable to scenarios other than targeted or untargeted classification tasks.
Taken together, our methods set a new standard for adversarial attacks that is useful for practitioners and researchers alike to find more robust machine learning models.
Acknowledgments
This work has been funded, in part, by the German Federal Ministry of Education and Research (BMBF) through the Bernstein Computational Neuroscience Program Tübingen (FKZ: 01GQ1002) as well as the German Research Foundation (DFG CRC 1233 on “Robust Vision”) and the BMBF competence center for machine learning (FKZ 01IS18039A). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRSIS) for supporting J.R., M.K. and I.U.; J.R. acknowledges support by the Bosch Forschungsstiftung (Stifterverband, T113/30057/17); M.B. acknowledges support by the Centre for Integrative Neuroscience Tübingen (EXC 307); W.B. and M.B. were supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior / Interior Business Center (DoI/IBC) contract number D16PC00003.
References
 Athalye et al. (2018a) Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. CoRR, abs/1802.00420, 2018a. URL http://arxiv.org/abs/1802.00420.
 Athalye et al. (2018b) Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. CoRR, abs/1802.00420, 2018b. URL http://arxiv.org/abs/1802.00420.
 Brendel et al. (2018) W. Brendel, J. Rauber, and M. Bethge. Decisionbased adversarial attacks: Reliable attacks against blackbox machine learning models. In International Conference on Learning Representations, 2018. URL https://arxiv.org/abs/1712.04248.
 Carlini and Wagner (2016) Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. CoRR, abs/1608.04644, 2016. URL http://arxiv.org/abs/1608.04644.
 Carlini et al. (2019) Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian J. Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. CoRR, abs/1902.06705, 2019. URL http://arxiv.org/abs/1902.06705.
 Chen et al. (2017) PinYu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and ChoJui Hsieh. Ead: elasticnet attacks to deep neural networks via adversarial examples. arXiv preprint arXiv:1709.04114, 2017.
 Domahidi et al. (2013) Alexander Domahidi, Eric ChunPu Chu, and Stephen P. Boyd. Ecos: An socp solver for embedded systems. 2013 European Control Conference (ECC), pages 3071–3076, 2013.

Dong et al. (2018)
Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and
Jianguo Li.
Boosting adversarial attacks with momentum.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2018.  Engstrom et al. (2018) Logan Engstrom, Andrew Ilyas, and Anish Athalye. Evaluating and understanding the robustness of adversarial logit pairing. CoRR, abs/1807.10272, 2018. URL http://arxiv.org/abs/1807.10272.
 Goodfellow et al. (2015) Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pages 770–778, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
 Kannan et al. (2018) Harini Kannan, Alexey Kurakin, and Ian J. Goodfellow. Adversarial logit pairing. CoRR, abs/1803.06373, 2018. URL http://arxiv.org/abs/1803.06373.
 Kolter and Wong (2017) J. Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. CoRR, abs/1711.00851, 2017. URL http://arxiv.org/abs/1711.00851.
 Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
 Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
 Modas et al. (2019) Apostolos Modas, SeyedMohsen MoosaviDezfooli, and Pascal Frossard. Sparsefool: a few pixels make a big difference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 MoosaviDezfooli et al. (2016) SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 O’Donoghue et al. (2016) B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting and homogeneous selfdual embedding. Journal of Optimization Theory and Applications, 169(3):1042–1068, June 2016. URL http://stanford.edu/~boyd/papers/scs.html.
 Papernot et al. (2015) Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. CoRR, abs/1511.04508, 2015. URL http://arxiv.org/abs/1511.04508.
 Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pages 372–387. IEEE, 2016.
 Rauber et al. (2017) Jonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox v0.8.0: A python toolbox to benchmark the robustness of machine learning models. CoRR, abs/1707.04131, 2017. URL http://arxiv.org/abs/1707.04131.
 Schott et al. (2019) Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1EHOsC9tX.
 Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6199.
 Uesato et al. (2018) Jonathan Uesato, Brendan O’Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarial risk and the dangers of evaluating against weak attacks. arXiv preprint arXiv:1802.05666, 2018.
 Wang et al. (2018) Shiqi Wang, Yizheng Chen, Ahmed Abdou, and Suman Jana. Mixtrain: Scalable training of formally robust neural networks. arXiv preprint arXiv:1811.02625, 2018.
 Zheng et al. (2018) Tianhang Zheng, Changyou Chen, and Kui Ren. Distributionally adversarial attack. CoRR, abs/1808.05537, 2018. URL http://arxiv.org/abs/1808.05537.