1 Introduction
The issue of adversarial robustness and attacks (Szegedy et al., 2014; Goodfellow et al., 2015)
, i.e., generating small input perturbations that lead to mispredictions, is an important problem with a large body of recent work that affects all current deep learning models. Unfortunately, reliable evaluation of proposed defenses is an elusive and challenging task: many defenses seem to initially be effective, only to be circumvented later by new attacks designed specifically with that defense in mind
(Carlini and Wagner, 2017b; Athalye et al., 2018; Tramer et al., 2020).To address this challenge, two recent works approach the problem from different perspectives. Tramer et al. (2020) outlines an approach for manually crafting adaptive attacks that exploit the weak points of each defense. Here, a domain expert starts with an existing attack, such as PGD (Madry et al., 2018) (denoted as in Figure 1), and adapts it based on knowledge of the defense’s inner workings. Common modifications include: (i) tuning attack parameters (e.g., number of steps), (ii) replacing network components to simplify the attack (e.g., removing randomization or nondifferentiable components), and (iii) replacing the loss function optimized by the attack. This approach was demonstrated to be effective in breaking all of the considered defenses. However, a downside is that it requires substantial manual effort and is limited by the domain knowledge of the expert – for instance, each of the defenses came with an adaptive attack which was insufficient, in retrospect.
At the same time, Croce and Hein (2020b) proposed to assess adversarial robustness using an ensemble of four diverse attacks illustrated in Figure 1 (b) – APGD with crossentropy loss (Croce and Hein, 2020b), APGD
with difference in logit ratio loss, FAB
(Croce and Hein, 2020a), and Square Attack (SQR) (Andriushchenko et al., 2020). While these do not require manual effort and have been shown to provide a better robustness estimate for many defenses than the original evaluation, the approach is inherently limited by the fact that the attacks are fixed apriori without any knowledge of the particular defense at hand. This is visualized in Figure
1 (b) where even though the attacks are designed to be diverse, they cover only a small part of the entire space.This work: discovery of adaptive attacks
We present a new method that automates the process of crafting adaptive attacks, combining the best of both prior approaches – the ability to evaluate defenses automatically while producing attacks tuned for the given defense. Our work is based on the key observation that we can identify common techniques used to build existing adaptive attacks and extract them as reusable building blocks in a common framework. Then, given a new model with an unseen defense, we can discover an effective attack by searching over suitable combinations of these building blocks.
To identify reusable techniques, we analyze existing adaptive attacks and organize their components into three groups:

Attack algorithm and parameters: a library of diverse attack techniques (e.g., APGD, FAB, C&W (Carlini and Wagner, 2017a), NES (Wierstra et al., 2008)), together with backbone specific and generic parameters (e.g., input randomization, number of steps, if and how to use expectation over transformation (Athalye et al., 2018)).

Loss functions: that specify different ways of defining the loss function which is optimized by the attack (e.g., crossentropy, hinge loss, logit matching, etc.).
These components collectively formalize an attack search space induced by their different combinations. We also present an algorithm that effectively navigates the search space so to discover an attack. In this way, domain experts are left with the creative task of designing completely new attacks and growing the framework by adding missing attack components, while the tool is responsible for automating many of the tedious and timeconsuming trialanderror steps that domain experts perform manually today.
We implemented our approach in a tool called Adaptive AutoAttack (A) and evaluated it on diverse adversarial defenses. Our results demonstrate that A discovers adaptive attacks that outperform AutoAttack (Croce and Hein, 2020b), the current stateoftheart tool for reliable evaluation of adversarial defenses: A finds attacks that are either stronger, producing 3.0%50.8% additional adversarial examples (10 cases), or on average 2x and up to 5.5x faster while enjoying similar adversarial robustness (13 cases). The source code of A and our scripts for reproducing the experiments are available online at: https://github.com/ethsri/adaptiveautoattack.
2 Automated Discovery of Adaptive Attacks
We use to denote a training dataset where is a natural input (e.g., an image) and is the corresponding label. An adversarial example is a perturbed input , such that: (i) it satisfies an attack criterion , e.g., a class classification model predicts a wrong label, and (ii) the distance between the adversarial input and the natural input is below a threshold under a distance metric (e.g., an norm). Formally, this can be written as:
For example, instantiating this with the norm and misclassification criterion, we obtain the following formulation:
where returns the prediction of the model . Further, in case the model uses a defense to abstain from making predictions whenever an adversarial input is detected, then the formulation is:
where is a detector, and the model makes a prediction when and otherwise it rejects the input. A common way to implement the detector is to perform a statistical test with the goal of differentiating natural and adversarial samples (Grosse et al., 2017; Metzen et al., 2017; Li and Li, 2017).
Problem Statement
Given a model equipped with an unknown set of defenses and a dataset , our goal is to find an adaptive adversarial attack that is best at generating adversarial samples according to the attack criterion and the attack capability :
(1) 
Here, denotes the search space of all possible attacks, where the goal of each attack is to generate an adversarial sample for a given input and model . For example, solving this optimization problem with respect to the misclassification criterion corresponds to optimizing the number of adversarial examples misclassified by the model.
In our work, we consider an implementationknowledge adversary, who has full access to the model’s implementation at inference time (e.g., the model’s computational graph). We chose this threat model as it matches our problem setting – given an unseen model implementation, we want to automatically find an adaptive attack that exploits its weak points, but without the need of a domain expert. We note that this threat model is weaker than a perfectknowledge adversary (Biggio et al., 2013), which assumes a domain expert that also has knowledge about the training dataset^{1}^{1}1We only assume access to the dataset used to evaluate adversarial robustness (typically the test dataset), but not to training and validation datasets. and algorithm, as this information is difficult, or even not possible, to recover from the model’s implementation only.
Key Challenges
To solve the optimization problem from Equation 1, we address two key challenges:

Defining a suitable attacks search space such that it is expressible enough to cover a range of existing adaptive attacks.

Searching over the space efficiently such that a strong attack is found within a reasonable time.
3 Adaptive Attacks Search Space
We define the adaptive attack search space by analyzing existing adaptive attacks and identifying common techniques used to break adversarial defenses. Formally, the adaptive attack search space is given by , where consists of sequences of backbone attacks along with their loss functions, selected from a space of loss functions , and consists of network transformations. Semantically, given an input and a model , the goal of adaptive attack is to return an adversarial example by computing . That is, it first transforms the model by applying the transformation , and then executes the attack on the surrogate model . Note that the surrogate model is used only to compute the candidate adversarial example, not to evaluate it. That is, we generate an adversarial example for , and then check whether it is also adversarial for . Since may be adversarial for , but not for , the adaptive attack must maximize the transferability of the generated candidate adversarial samples.
3.1 Attack Algorithm & Parameters ()
The attack search space consists of a sequence of adversarial attacks. We formalize the search space with the grammar:
(Attack Search Space)  

::=  ; 
, n  
, n n  
Attack params loss 
where:

: composes two attacks, which are executed independently and return the first adversarial sample in the defined order. That is, given input , the attack returns if is an adversarial example, and otherwise it returns .

: enables the attack’s randomized components (if any). The randomization corresponds to using random seed and/or via selecting a starting point within , uniformly at random.

, n: uses expectation over transformation, a technique designed to compute gradients for models with randomized components (Athalye et al., 2018).

, n: repeats the attack times. Note that repeat is useful only if randomization is enabled.

n: executes the attack with a time budget of n seconds.

Attack params loss : is a backbone attack Attack executed with parameters params and loss function loss. Our tool A supports FGSM (Goodfellow et al., 2015), PGD (Madry et al., 2018), DeepFool (MoosaviDezfooli et al., 2016), C&W (Carlini and Wagner, 2017a), NES (Wierstra et al., 2008), APGD (Croce and Hein, 2020b), FAB (Croce and Hein, 2020a) and SQR (Andriushchenko et al., 2020), where params correspond to the standard parameters defined by these attacks, such as eta, beta and n_iter for FAB. We provide the full list of parameters, including their ranges and priors in the supplementary material. We define the loss functions in Section 3.3.
3.2 Network Transformations ()
A common approach that aims to improve the robustness of neural networks against adversarial attacks is to incorporate an explicit defense mechanism in the neural architecture. These defenses often obfuscate gradients to render iterativeoptimization methods ineffective
(Athalye et al., 2018). However, these defenses can be successfully circumvented by (i) choosing a suitable attack algorithm, such as score and decisionbased attacks (included in ), or (ii) by changing the neural architecture (defined next).At a highlevel, the network transformation search space takes as input a model and transforms it to another model , which is easier to attack. To achieve this, the network can be expressed as a directed acyclic graph, where each vertex denotes an operator (e.g., convolution, residual blocks, etc.), and edges correspond to data dependencies. Note that the computational graph includes both the forward and backward versions of each operation, which can be changed independently of each other. In our work, we include two types of network transformations:

Layer Removal, which removes an operator from the graph. To automate this process, the operator can be removed as long as its input and output dimensions are the same, regardless of its functionality.

Backward Pass Differentiable Approximation (BPDA) (Athalye et al., 2018), which replaces the backward version of an operator with a differentiable approximation of the function. In our search space we include three different function approximations: (i) an identity function, (ii) a convolution layer with kernel size 1, and (iii)
a twolayer convolutional layer with ReLU activation in between. The weights in the latter two cases are learned through approximating the forward function using the test dataset.
3.3 Loss Function ()
Selecting the right objective function to optimize is an important design decision for creating strong adaptive attacks. Indeed, the recent work of Tramer et al. (2020) uses 9 different objective functions to break 13 defenses, showing the importance of this step. We formalize the space of possible loss functions using the following grammar:
(Loss Function Search Space)  

::=  Loss, n Z 
Loss Z  
Loss, n  Loss Z  
Z ::=  logits probs 
Loss ::=  CrossEntropy HingeLoss L1 
DLR LogitMatching 
The grammar formalizes four different aspects:
Targeted vs Untargeted. The loss can be either untargeted, where the goal is to change the classification to any other label , or targeted, where the goal is to predict a concrete label . Even though the untargeted loss is less restrictive, it is not always easier to optimize in practice. As a result, the search space contains both. When using
Loss, n together with misclassification criterion, the attack will consider the top n classes with the highest probability as the targets.
Loss Formulation. Next is the concrete loss formulation, as summarized in Figure 2. These include loss functions used in existing adaptive attacks, as well as the recently proposed difference in logit ratio loss (Croce and Hein, 2020b).
Logits vs. Probabilities. In our search space, loss functions can be instantiated both with logits as well as with probabilities. Note that some loss functions are specifically designed for one of the two options, such as C&W (Carlini and Wagner, 2017a) or DLR (Croce and Hein, 2020b) that specifically consider only logits. While such knowledge can be used to reduce the search space, it is not necessary as long as the search algorithm is powerful enough to recognize that such a combination leads to poor results.
Loss Replacement. Because the key idea behind many of the defenses is to find a property that helps to differentiate between adversarial and natural images, one can also define the optimization objective in the same way. These featurelevel attacks (Sabour et al., 2016) avoid the need to directly optimize the complex objective defined by the adversarial defense and have been effective at circumventing them. As an example, the logit matching loss (shown in Figure 2) minimizes the difference of logits between adversarial sample and a natural sample of the target class (selected at random from the dataset). Instead of logits, the same idea can also be applied to other statistics, such as internal representations computed by a pretrained model or KLdivergence between label probabilities.
4 Search Algorithm
We now describe our search algorithm that optimizes the problem statement from Equation 1. Since we do not have access to the underlying distribution, we approximate Equation 1 using the dataset as follows:
(2) 
where is an attack, denotes untargeted crossentropy loss of on the input, and
is a hyperparameter. The intuition behind
is that it acts as a tiebreaker in case the criterion alone is not enough to differentiate between multiple attacks. While this is unlikely to happen when evaluating on large datasets, it is quite common when using only a small number of samples. Obtaining good estimates in such cases is especially important for achieving scalability since performing the search directly on the full dataset would be prohibitively slow.Search Algorithm
We present our search algorithm in Algorithm 1. We start by searching through the space of network transformations to find a suitable surrogate model (line 1). This is achieved by taking the default attack (in our implementation, we set to APGD), and then evaluating all locations where BPDA can be used, and subsequently evaluating all layers that can be removed. Even though this step is exhaustive, it takes only a fraction of the runtime in our experiments, and no further optimization was necessary.
Next, we search through the space of attacks . As this search space is enormous, we employ three techniques to improve scalability and attack quality. First, to generate a sequence of attacks, we perform a greedy search (lines 316). That is, in each step, we find an attack with the best score on the samples not circumvented by any of the previous attacks (line 4). Second, we use a parameter estimator model to select the suitable parameters (line 8). In our work, we use Tree of Parzen Estimators (Bergstra et al., 2011), but the concrete implementation can vary. Once the parameters are selected, they are evaluated using the function (line 9), the result is stored in the trial history (line 10), and the estimator is updated (line 11). Third, because evaluating the adversarial attacks can be expensive, and the dataset is typically large, we employ successive halving technique (Karnin et al., 2013; Jamieson and Talwalkar, 2016). Concretely, instead of evaluating all the trials on the full dataset, we start by evaluating them only on a subset of samples (line 5). Then, we improve the score estimates by iteratively increasing the dataset size (line 13), reevaluating the scores (line 14), and retaining a quarter of the trials with the best score (line 15). We repeat this process to find a single best attack from , which is then added to the sequence of attacks (line 16).
5 Evaluation
We now evaluate A on 23 models with diverse defenses and compare the results to AutoAttack (Croce and Hein, 2020b) and to several existing handcrafted attacks. AutoAttack is a stateoftheart tool designed for reliable evaluation of adversarial defenses that improved the originally reported results for many existing defenses by up to 10%. Our key result is that A finds stronger or similar attacks than AutoAttack for virtually all defenses:

In 10 cases, the attacks found by A are significantly stronger than AutoAttack, resulting in 3.0% to 50.8% additional adversarial examples.

In the other 13 cases, A’s attacks are typically 2x and up to 5.5x faster while enjoying similar attack quality.
The A tool
The implementation of A
is based on PyTorch
(Paszke et al., 2019), the implementations of FGSM, PGD, NES, and DeepFool are based on FoolBox (Rauber et al., 2017) version 3.0.0, C&W is based on ART (Nicolae et al., 2018) version 1.3.0, and the attacks APGD, FAB, and SQR are from (Croce and Hein, 2020b). We use AutoAttack’s rand version if a defense has a randomization component, and otherwise we use its standard version. To allow for a fair comparison, we extended AutoAttack with backward pass differential approximation (BPDA), so we can run it on defenses with nondifferentiable components; without this, all gradientbased attacks would fail.Unless stated otherwise, we instantiate Algorithm 1 by setting: the attack sequence length , the number of trials , the initial dataset size , and we use a time budget of to seconds per sample depending on the model size. We use TPE (Bergstra et al., 2011) for parameter estimation, which is implemented as part of the Hyperopt framework (Bergstra et al., 2013). All of the experiments are performed using a single RTX 2080 Ti GPU.
Robust Accuracy (1  Rerr)  Runtime (min)  Search (min)  
Croce and Hein (2020b)  (Our Work)  
CIFAR10, ,  AA  A  AA  A  Speedup  A  
A1  Stutz et al. (2020)  77.64  26.87  50.77  101  205  0.49  659 
A2  Madry et al. (2018)  44.78  44.69  0.09  25  20  1.25  88 
A3  Buckman et al. (2018)  2.29  1.96  0.33  9  7  1.29  116 
A4  Das et al. (2017) + Lee et al. (2018)  0.59  0.11  0.48  6  2  3.00  40 
A5  Metzen et al. (2017)  6.17  3.04  3.13  21  13  1.62  80 
A6  Guo et al. (2018)  22.30  12.14  10.16  19  17  1.12  99 
A7  Ensemble of A3, A4, A6  4.14  3.94  0.20  28  24  1.17  237 
A8  Papernot et al. (2015)  2.85  2.71  0.14  4  4  1.00  84 
A9  Xiao et al. (2020)  19.82  11.11  8.71  49  22  2.23  189 
A10  Xiao et al. (2020)  64.91  17.70  47.21  157  2,280  0.07  1,548 
CIFAR10, ,  
B11  Wu et al. (2020)  60.05  60.01  0.04  706  255  2.77  690 
B12  Wu et al. (2020)  56.16  56.18  0.02  801  145  5.52  677 
B13  Zhang and Wang (2019)  36.74  37.11  0.37  381  302  1.26  726 
B14  Grathwohl et al. (2020)  5.15  5.16  0.01  107  114  0.94  749 
B15  Xiao et al. (2020)  5.40  2.31  3.09  95  146  0.65  828 
B16  Wang et al. (2019)  50.84  50.81  0.03  734  372  1.97  755 
B17  Wang et al. (2020)  50.94  50.89  0.05  742  486  1.53  807 
B18  Sehwag et al. (2020)  57.19  57.16  0.03  671  429  1.56  691 
B19  B11 + Defense in A4  60.72  60.04  0.68  621  210  2.96  585 
B20  B14 + Defense in A4  15.27  5.24  10.03  261  79  3.30  746 
B21  B11 + Random Rotation  49.53  41.99  7.54  255  462  0.55  900 
B22  B14 + Random Rotation  22.29  13.45  8.84  114  374  0.30  1,023 
B23  Hu et al. (2019)  6.25  3.07  3.18  110  56  1.96  502 
model available from the authors, model with nondifferentiable components. 
Evaluation Metric
Following Stutz et al. (2020), we use the robust test error (Rerr) metric to combine the evaluation of defenses with and without detectors. Rerr is defined as:
(3) 
where is a detector that accepts a sample if , and evaluates to one if causes a misprediction and to zero otherwise. The numerator counts the number of samples that are both accepted and lead to a successful attack (including cases where the original is incorrect), and the denominator counts the number of samples not rejected by the detector. A defense without a detector (i.e., ) reduces Equation 3 to the standard Rerr. Finally, we define robust accuracy simply as Rerr.
Comparison to AutoAttack
Our main results, summarized in Table 1, show the robust accuracy (lower is better) and runtime of both AutoAttack (AA) and A over the 23 defenses. For example, for A9 our tool finds an attack that leads to lower robust accuracy (11.1% for A vs. 19.8% for AA) and is more than twice as fast (22 min for A vs. 49 min for AA). Overall, A significantly improves upon AA or provides similar but faster attacks.
We note that the attacks from AA are included in our search space (although without the knowledge of their best parameters and sequence), and so it is expected that A performs at least as well as AA, provided sufficient exploration time. The only case where the exploration time was not sufficient was for B14 where our attack is slightly slower (114 min for A vs. 107 min for AA), yet still achieves the same robust accuracy (5.16% for A vs. 5.15% for AA). Importantly, A often finds better attacks: for 10 defenses, A reduces the robust accuracy by 3% to 50% compared to that of AA. In what follows, we discuss the results in more detail and highlight important insights.
Defenses based on Adversarial Training. Defenses A2, B11, B12, B16, B17 and B18 are based on variations of adversarial training. We observe that, even though AA has been designed with these defenses in mind, A obtains very close results. Moreover, A improves upon AA as it discovers attacks that achieve similar robustness while bringing 1.5–5.5 speedups. Closer inspection reveals that AA includes two attacks, FAB and SQR, which are not only expensive but also ineffective on these defenses. A improves the runtime by excluding them from the generated adaptive attack.
Obfuscation Defenses. Defenses A4, A9, A10, B15, B19, and B20 are based on gradient obfuscation. A discovers stronger attacks that reduce the robust accuracy for all defenses by up to 47.21%. Here, removing the obfuscated defenses in A4, B19, and B20 provides better gradient estimation for the attacks. Further, the use of more suitable loss functions strengthens the discovered attacks and improves the evaluation results for A9 and B15.
Randomized Defenses. For the randomized input defenses A9, B21, and B22, A discovers attacks that, compared to AA’s rand version, further reduce robustness by 8.71%, 7.54%, and 8.84%, respectively. This is achieved by using stronger yet more costly parameter settings, attacks with different backbones (APGD, PGD) and 7 different loss functions (as listed in Appendix F).
Detector based Defenses. For A1, A5, and B23 defended with detectors, A improves over AA by reducing the robustness by 50.77%, 3.13%, and 3.18%, respectively. This is because none of the attacks discovered by A are included in AA. Namely, A found SQR and APGD for A1, untargeted FAB for A5 (FAB in AA is targeted), and PGD for B23.
Comparison to Handcrafted Adaptive Attacks
Given a new defense, the main strength of our approach is that it directly benefits from all existing techniques included in the search space. While the search space can be easily extended, it is also inherently incomplete. Here, we illustrate this point by comparing our approach to three handcrafted adaptive attacks not included in the search space.
As a first example, A1 (Stutz et al., 2020) proposes an adaptive attack PGDConf with backtracking that leads to robust accuracy of 36.9%, which can be improved to 31.6% by combining PGDConf with blackbox attacks. A finds APGD and Z = probs. This combination is interesting since the hinge loss maximizing the difference between the top two predictions, in fact, reflects the PGDConf objective function. Further, similarly to the manually crafted attack by A1, a different blackbox attack included in our search space, SQR, is found to complement the strength of APGD. When using (sequence of three attacks), such combination leads to 46.36% robust accuracy. However, by increasing the number of attacks to , the robust accuracy drops further to 26.87%, which is a stronger result than the one reported in the original paper. In this case, our search space and the search algorithm are powerful enough to not only replicate the main ideas of Stutz et al. (2020) but also to improve its evaluation when allowing for a larger attack budget. Note that this improvement is possible even without including the backtracking used by PGDConf as a building block in our search space. In comparison, the robust accuracy reported by AA is only 77.64%.
As a second example, B15 is known to be susceptible to NES which achieves 0.16% robust accuracy (Tramer et al., 2020). In our experiment, we limit the time budget for the attack so that the expensive NES cannot be found. The result shows that the SQR attack in the search space is effective enough to achieve 2.31% robustness evaluation.
As a third example, to break B23, Tramer et al. (2020)
designed an adaptive attack that linearly interpolates between the original and the adversarial samples using PGD. This technique breaks the defense and achieves 0% robust accuracy. In comparison, we find PGD
, which achieves 3.07% robust accuracy. In this case, the fact that PGD is a relatively weak attack is an advantage – it successfully bypasses the detector by not generating overconfident predictions.Ablation Studies
Similar to existing handcrafted adaptive attacks, all three components included in the search space were important for generating strong adaptive attacks for a variety of defenses. Here we briefly discuss their importance while including the full experiment results in the supplementary material.
Attack & Parameters. We demonstrate the importance of parameters by comparing PGD, C&W, DF, and FGSM with default library parameters to the best configuration found when available parameters are included in the search space. The attacks found by A are on average 5.5% stronger than the best attack among the four attacks on models A2A10.
Loss Formulation. To evaluate the effect of modeling different loss functions, we remove them from the search space and keep only the original loss function defined for each attack. The search score drops by 3% on average for A2A10 without the loss formulation.
Network Processing. In B20, the main reason for achieving 10% decrease in robust accuracy is the removal of the gradient obfuscated defense Reverse Sigmoid. In contrast, for A7, B21, B22, the randomized input processing steps are candidates for removal, but A keeps this step as removing these components yields worse results.
Further, in Table 2 we show the effect of different BPDA instantiations included in our search space. For A3, since the nondifferentiable layer is nonlinear thermometer encoding, it is better to use a function with nonlinear activation to approximate it. For A4, B19, B20, the defense is image JPEG compression and identity network is the best algorithm since the networks can overfit when training on limited data.
BPDA Type  A3  A4  B19  B20 

identity  18.5  9.6  70.5  84.0 
1x1 convolution  8.9  10.3  70.8  84.9 
2 layer conv + ReLU  3.7  14.9  74.1  86.2 
6 Related Work
The most closely related work to ours is AutoAttack (Croce and Hein, 2020b), which improves the evaluation of adversarial defenses by proposing an ensemble of four fixed attacks. Further, the key to stronger attacks was a new algorithm APGD, which improves upon PGD by halving the step size dynamically based on the loss at each step. In our work, we improve over AutoAttack in three keys aspects: (i) we formalize a search space of adaptive attacks, rather than using a fixed ensemble, (ii) we design a search algorithm that discovers the best adaptive attacks automatically, significantly improving over the results of AutoAttack, and (iii) our search space is extensible and allows reusing building blocks from one attack by other attacks, effectivelly expressing new attack instantiations. For example, the idea of dynamically adapting the step size is not tied to APGD, but it is a general concept applicable to any stepbased algorithm.
Our work is also closely related to the recent advances in AutoML, such as in the domain of neural architecture search (NAS) (Zoph and Le, 2017; Elsken et al., 2019). Similar to our work, the core challenge in NAS is an efficient search over a large space of parameters and configurations, and therefore many techniques can also be applied to our setting. This includes BOHB (Falkner et al., 2018), ASHA (Li et al., 2018)
, using gradient information coupled with reinforcement learning
(Zoph and Le, 2017) or continuous search space formulation (Liu et al., 2019). Even though finding completely novel neural architectures is often beyond the reach, NAS is still very useful and finds many stateoftheart models. This is also true in our setting – while human experts will continue to play a key role in defining new types of adaptive attacks, as we show in our work, it is already possible to automate many of the intermediate steps.7 Conclusion
We presented the first tool that aims to automatically find strong adaptive attacks specifically tailored to a given adversarial defense. Our key insight is that we can identify reusable techniques used in existing attacks and formalize them into a search space. Then, we can phrase the challenge of finding new attacks as an optimization problem of finding the strongest attack over this search space.
Our approach automates the tedious and timeconsuming trialanderror steps that domain experts perform manually today, allowing them to focus on the creative task of designing new attacks. By doing so, we also immediately provide a more reliable evaluation of new and existing defenses, many of which have been broken only after their proposal because the authors struggled to find an effective attack by manually exploring the vast space of techniques.
We implemented our approach in a tool called A and demonstrated that it outperforms the stateoftheart tool AutoAttack (Croce and Hein, 2020b). Importantly, even though our current search space contains only a subset of existing techniques, our evaluation shows that A can partially rediscover or even improve upon some handcrafted adaptive attacks not yet included in our search space.
References
 Square attack: a queryefficient blackbox adversarial attack via random search. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 484–501. Cited by: §1, 6th item.
 Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 274–283. Cited by: 1st item, 2nd item, §1, 3rd item, 2nd item, §3.2.
 Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 1214 December 2011, Granada, Spain, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger (Eds.), pp. 2546–2554. Cited by: §4, §5.
 Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013, JMLR Workshop and Conference Proceedings, Vol. 28, pp. 115–123. Cited by: §5.
 Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Cited by: §2.
 Thermometer encoding: one hot way to resist adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: Table 1.
 Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), Vol. , pp. 39–57. Cited by: 1st item, Figure 2, 6th item, §3.3.

Adversarial examples are not easily detected: bypassing ten detection methods.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, AISec ’17, New York, NY, USA, pp. 3–14. External Links: ISBN 9781450352024 Cited by: §1.  Minimally distorted adversarial examples with a fast adaptive boundary attack. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 2196–2205. Cited by: §1, 6th item.
 Reliable evaluation of adversarial robustness with an ensemble of diverse parameterfree attacks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 2206–2216. Cited by: Automated Discovery of Adaptive Attacks on Adversarial Defenses, Figure 1, §1, §1, Figure 2, 6th item, §3.3, §3.3, §5, Table 1, §5, §6, §7.
 Keeping the bad guys out: protecting and vaccinating deep learning with jpeg compression. arXiv preprint arXiv:1705.02900. Cited by: Table 1.
 Neural architecture search: a survey. External Links: 1808.05377 Cited by: §6.
 BOHB: robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1436–1445. Cited by: §6.
 Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1, 6th item.

Your classifier is secretly an energy based model and you should treat it like one
. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, Cited by: Table 1.  On the (statistical) detection of adversarial examples. CoRR abs/1702.06280. External Links: 1702.06280 Cited by: §2.
 Countering adversarial images using input transformations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: Table 1.
 A new defense against adversarial images: turning a weakness into a strength. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 1633–1644. Cited by: Table 1.
 Nonstochastic best arm identification and hyperparameter optimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 911, 2016, A. Gretton and C. C. Robert (Eds.), JMLR Workshop and Conference Proceedings, Vol. 51, pp. 240–248. Cited by: §4.
 Almost optimal exploration in multiarmed bandits. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013, JMLR Workshop and Conference Proceedings, Vol. 28, pp. 1238–1246. Cited by: §4.
 Defending against machine learning model stealing attacks using deceptive perturbations. arXiv preprint arXiv:1806.00054. Cited by: Table 1.
 Massively parallel hyperparameter tuning. arXiv preprint arXiv:1810.05934. Cited by: §6.
 Adversarial examples detection in deep networks with convolutional filter statistics. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pp. 5775–5783. Cited by: §2.
 DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, Cited by: §6.
 Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §1, 6th item, Table 1.
 On detecting adversarial perturbations. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §2, Table 1.

DeepFool: a simple and accurate method to fool deep neural networks.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 2574–2582. Cited by: 6th item.  Adversarial robustness toolbox v0.2.2. CoRR abs/1807.01069. Cited by: Table 8, §5.
 Distillation as a defense to adversarial perturbations against deep neural networks. CoRR abs/1511.04508. Cited by: Table 1.
 PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: §5.
 Foolbox: a python toolbox to benchmark the robustness of machine learning models. In Reliable Machine Learning in the Wild Workshop, 34th International Conference on Machine Learning, Cited by: Table 8, §5.
 Adversarial manipulation of deep representations. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §3.3.
 Hydra: pruning adversarially robust neural networks. Advances in Neural Information Processing Systems (NeurIPS) 7. Cited by: Table 1.
 Confidencecalibrated adversarial training: generalizing to unseen attacks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 9155–9166. Cited by: Appendix A, §5, §5, Table 1.
 Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1.
 On adaptive attacks to adversarial example defenses. arXiv preprint arXiv:2002.08347. Cited by: Figure 1, 2nd item, §1, §1, §3.3, §5, §5.
 ResNets ensemble via the feynmankac formalism to improve natural and robust accuracies. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 1655–1665. Cited by: Table 1.
 Improving adversarial robustness requires revisiting misclassified examples. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, Cited by: Table 1.

Natural evolution strategies.
In
2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence)
, pp. 3381–3387. Cited by: 1st item, 6th item.  Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems 33. Cited by: Table 1.
 Enhancing adversarial defense by kwinnerstakeall. In International Conference on Learning Representations, Cited by: Table 1.
 Defense against adversarial attacks using feature scatteringbased adversarial training. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 1829–1839. Cited by: Table 1.
 Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §6.
Appendix A Evaluation Metrics Details
We use the following criteria in the formulation:
For both, we remove the misclassified clean input as a preprocessing step, such that the evaluation is performed only on the subset of correctly classified samples (i.e. ).
Sequence of Attacks
Sequence of attacks defined in Section 3.1 is a way to calculate the perexample worstcase evaluation, and the four attack ensemble in AutoAttack is equivalent to sequence of four attacks [APGD, APGD, FAB, SQR]. Algorithm 2 elaborates how the sequence of attacks is evaluated. That is, the attacks are performed in the order they were defined and the first sample that satisfies the criterion is returned.
Robust Test Error (Rerr)
Rerr defined in Equation 3 from Section 5 has intractable maximization problem in the denominator, so Equation 4 is the empirical equation used to give an upper bound evaluation of Rerr. This empirical evaluation is the same as the evaluation in Stutz et al. (2020).
(4) 
Detectors
For a network with a detector , the criterion function is misclassification with the detectors, and it is applied in line 3 in Algorithm 2. This formulation enables perexample worstcase evaluation for detector defenses.
Randomized Defenses
If has randomized component, in Equation 4
means to draw a random sample from the distribution. In the evaluation metrics, we report the mean of adversarial samples evaluated 10 times using
.Appendix B Search Space of
b.1 Loss function space
Cross Entropy (CE), HingeLoss (Hinge), Difference in logit ratio (DLR), Logit Matching (LM) are the five loss functions used in our experiments. For Hinge, the confidence value is set to infinity as to encourage stronger adversarial examples, and can be a loss parameter in future work.
Recall from Section 3.3 that the loss function search space is defined as:
(Loss Function Search Space)  

::=  Loss, n Z 
Loss Z  
Loss, n  Loss Z  
Z ::=  logits probs 
To refer to different settings, we use the following notation:

U: for the loss,

T: for the loss,

D: for the loss

L: for using logits, and

P: for using probs
For example, we use DLRUL to denote DLR loss with logits. The loss space in evaluation is shown in Table 3. Effectively, the search space includes all the possible combinations expect that the crossentropy loss supports only probability. Note that although is designed for logits, and is designed for targeted attacks, the search space still makes other possibilities an option (i.e., it is up to the search algorithm to learn which combinations are useful and which are not).
Loss  

Targeted  ✓  ✓  ✓  ✓  ✓ 
Logit/Prob  P  ✓  ✓  ✓  ✓ 
Attack  Randomize  EOT  Repeat  Loss  Targeted  logit/prob 

FGSM  True  ✓  ✓  ✓  
PGD  True  ✓  ✓  ✓  
DeepFool  False  ✓  D  ✓  
APGD  True  ✓  ✓  ✓  
C&W  False    {U, T}  L  
FAB  True    {U, T}  L  
SQR  True  ✓  ✓  ✓  
NES  True  ✓  ✓  ✓ 
b.2 Attack Algorithm & Parameters Space
Recall the attack space defined in Section 3.1 as:
::=  ; 

, n  
, n n  
Attack params loss 
, , are the generic parameters, and for params are attack specific parameters. The type of every parameter is either integer or float. An integer ranges from to inclusive is denoted as . A float range from to inclusive is denoted as . Besides value range, prior is needed for parameter estimator model (TPE in our case), which is either uniform (default) or log uniform (denoted with ). For example, means an integer value ranges from to with log uniform prior; means a float value ranges from to with uniform prior.
Generic parameters and the supported loss for each attack algorithm are defined in Table 4. The algorithm returns a deterministic result if is False, and otherwise the results might differ due to randomization. Randomness can come from either perturbing the initial input or randomness in the attack algorithm. Input perturbation is deterministic if the starting input is the original input or an input with fixed disturbance, and it is randomized if the starting input is chosen uniformly at random within the adversarial capability. For example, the first iteration of FAB uses the original input but the subsequent inputs are randomized (if the randomization is enabled). Attack algorithms like SQR, which is based on random search, has randomness in the algorithm itself. The deterministic version of such randomized algorithms is obtained by fixing the initial random seed.
The definition of for FGSM, PGD, NES, APGD, FAB, DeepFool, C&W is whether to start from the original input or uniformly at random select a point within the adversarial capability. For SQR random means whether to fix the seed. We generally set to be True to allow repeating the attacks for stronger attack strength, yet we set DeepFool and C&W to False as they are minimization attacks designed with the original inputs as the starting inputs.
The attack specific parameters are specified in Table 5, and the ranges are chosen to be representative by setting reasonable upper and lower bounds to include the default values of parameters. Note that DeepFool algorithm uses the loss D to take difference between the predictions of two classes by design (i.e., loss). C&W uses the hinge loss, and FAB uses loss similar to DeepFool. For C&W and FAB, we just take the library implementation of the loss (i.e. without our loss function space formulation).
Attack  Parameter  Range and prior 

PGD  step  
rel_stepsize  
C&W  confidence  
max_iter  
binary_search_steps  
learning_rate  
max_halving  
max_doubling  
NES  step  
rel_stepsize  
n_samples  
APGD  rho  
n_iter  
FAB  n_iter  
eta  
beta  
SQR  n_queries  
p_init 
Timelimit(s)  Attack1  Loss1  Attack2  Loss2  Attack3  Loss3  

a1  2  SQR  DLRUL  SQR  DLRTL  APGD  HingeUP 
a2  0.5  APGD  HingeTP  APGD  L1DP  APGD  CETP 
a3  0.5  APGD  HingeUL  APGD  DLRTL  APGD  CEDP 
a4  0.5  APGD  CETP  APGD  DLRUL  APGD  L1TP 
a5  0.5  FAB  –FL  APGD  LMUP  DeepFool  DLRDL 
a6  0.5  APGD  HingeUP  APGD  HingeUP  PGD  DLRTP 
a7  0.5  APGD  L1DL  APGD  DLRUL  APGD  HingeTL 
a8  0.5  APGD  DLRTP  APGD  DLRUL  APGD  HingeTL 
a9  1  APGD  L1UP  APGD  CEUP  APGD  CEDP 
a10  30  NES  HingeUP         
b11  3  APGD  HingeTP  DeepFool  L1DL  PGD  CEDP 
b12  3  APGD  HingeUL  APGD  CEDP  APGD  HingeTP 
b13  3  FAB  –FL  APGD  L1TL  FAB  –FL 
b14  3  APGD  L1DP  APGD  CEFP  APGD  DLRTL 
b15  3  SQR  HingeUL  SQR  L1UL  SQR  CEUL 
b16  3  APGD  L1DP  C&W  HingeUL  PGD  HingeTL 
b17  3  APGD  HingeUL  APGD  DLRTL  APGD  DLRTL 
b18  3  APGD  HingeTL  APGD  CEUP  C&W  –UL 
b19  3  APGD  HingeUL  APGD  DLRTL  FGSM  CEUP 
b20  3  APGD  HingeUL  APGD  DLRTL  FGSM  DLRUP 
b21  3  PGD  DLRUP  FGSM  L1UP  FGSM  DLRUL 
b22  3  APGD  L1TL  PGD  L1UP  PGD  L1UP 
b23  2  PGD  L1TP  APGD  CETP  APGD  L1UL 

b.3 Search space conditioned on network property
Properties of network defenses (e.g. randomized, detector, obfuscation) can be used to reduce the search space. In our work, EOT is set to be for deterministic networks. Repeat is set to be for randomized networks, following the practise of AutoAttack setting repeat to in its rand version. Logit Matching is enabled only when detectors are present since the loss is considered as a loss to bypass detectors.
Appendix C Discovered Adaptive Attacks
Our 23 benchamrks presented in Table 2 are selected to contain diverse defenses. Table 7 shows the network transformation result, and Table 6 shows the searched attacks and losses during the attack search.
Network transformation Related Defenses
In the benchmark, there are defenses that are related to the network transformations. JPEG compression (JPEG) is to use image compression algorithm on the input so that the network is nondifferentiable and the adversarial disturbances are reduced. Reverse sigmoid (RS) is a special layer added to the logit output of the model in order to obfuscate the gradient. Thermometer Encoding (TE) is an input encoding technique to shatter the linearity of inputs, and this encoding is nondifferentiable. Random rotation (RR) is in the family of randomized defense which rotates the input image by a random degree each time. Table 7 shows where the defenses appear and what network processing strategies are applied.
Diversity of Attacks
From table 6, the majority of attack algorithms searched are APGD, which shows the attack is indeed a strong universal attack. The second or third attack can be a weak attack like FGSM, and a major reason is that many attacks tie at the criterion evaluation and the noise in the untargeted CE loss tiebreaker sometimes determines the choice of attack. The loss functions show variety, yet Hinge and DLR appears more often. This challenges the common practise of using CE as the loss function by default.
Removal Policies  BPDA Policies  

a3    TEC 
a4  JPEG1 RS1  JPEGI 
a6  RR0   
a7  JPEG1 RS1 RR1  TEC, JPEGI 
b19  JPEG0 RS0  JPEGI 
b20  JPEG1 RS1  JPEGI 
b21  RR0   
b22  RR0   
Appendix D Time Complexity
This section gives the worstcase time analysis for Algorithm 1. Attack time is the total time spend in line 4 to remove all nonrobust samples. This step counts as attack time because it is where the robustness of the network being evaluated. Search time is the time spend in SHA iterations in line 9, 14 where the timing critical function is called. The time analysis for network transformation in line 1 is excluded as it incurs only a small runtime overhead in practice. We use the time constraint per attack per sample denoted as .
For , the worstcase is when the attacks use the full time budget on all the samples (denoted as ). This gives the bound shown in Equation 5.
(5) 
For , we first derive the bound for a single attack search, and then the bound for attacks search is times the value. In line 9, the maximum time to perform attacks on samples is . In line 14, the cost of the first iteration is as there are attacks and samples. By design, the cost of SHA iteration is halved for every subsequent iteration, which leads to the total time for a single attack search is . Therefore, the search time bound is shown in Equation 6.
(6) 
In evaluation we use , which leads to . This means the total search time is bounded by the time bound of executing a sequence of attacks.
The empirical search time scales roughly linearly with and sublinearly with . These search parameters are used to control the tradeoff between search time and search quality.
Appendix E AttackScore Distribution during Search
The analysis of attackscore distribution can be useful to understand the search process. Figure 3 shows the distribution on network A2. In this experiment, the number of trials is and the number of samples is , the time budget is . The scores with negative values are the trials got timeout. From the results, we can see that:

The expensive attacks like NES timeout because a small is used. The parameter range can potentially affect the search, as we see FGSM timesout because repeat parameter can be very large.

The sensitivity of parameters to scores varies for different attack algorithms. For examples, PGD has a large variance of scores, but APGD is very stable by design.

TPE algorithm samples more attack algorithms with high scores which enables TPE to choose better attack parameters during the SHA stage.

The top attacks have similar performance, which means the searched attack should have low variance in attack strength. In practice, the variance among the best searched attacks is typically small.
Appendix F Ablation Study
Here we provide details on the ablation study in Section 5.
f.1 Attack Algorithm & Parameters
In the experiment setup, the search space includes four attacks (FGSM, PGD, DeepFool, C&W) with their generic and specific parameters shown in Table 4 and Table 5 respectively. The loss search space only contains the loss in library implementation, and the network transformation space contains only BPDA. Robust accuracy (Racc) is used as the evaluation metric. The best Racc among FGSM, PGD, DeepFool, C&W with library default parameters are calculated, and they are compared with the Racc from the searched attack.
The result in Table 8 shows the average robustness improvement is 5.5%, up to 17.3%. PGD evaluation can be much stronger after the tuning, which reflects the fact that insufficient parameter tuning in PGD is a common cause to overestimate the robustness in literature.
Library Impl.  A  

Net  Racc  Attack  Racc  Attack  
A2  47.1  C&W  47.0  0.1  PGD 
A3  13.4  PGD  13.4  6.8  PGD 
A4  35.9  DeepFool  35.9  5.6  PGD 
A5  6.6  DeepFool  6.6  0.0  DeepFool 
A6  14.5  PGD  8.4  6.1  PGD 
A7  35.0  PGD  17.3  17.7  PGD 
A8  6.9  C&W  6.6  0.3  C&W 
A9  25.4  PGD  14.7  10.7  PGD 
A10  64.7  FGSM  62.4  2.3  PGD 
f.2 Loss
Figure 4 shows the comparison between TPE with loss formulation and TPE with default loss. The search space with default loss means the space containing only L1 and CE loss, with only untargeted loss and logit output. The result shows 3.0% of the final score improvement with loss formulation.
f.3 TPE algorithm vs Random
Figure 4 shows the comparison between TPE search and random search. TPE finds better scores by an average of 1.3% and up to 8.0% (A6) depending on the network.
Comments
There are no comments yet.