Automated Discovery of Adaptive Attacks on Adversarial Defenses

by   Chengyuan Yao, et al.

Reliable evaluation of adversarial defenses is a challenging task, currently limited to an expert who manually crafts attacks that exploit the defense's inner workings, or to approaches based on ensemble of fixed attacks, none of which may be effective for the specific defense at hand. Our key observation is that custom attacks are composed from a set of reusable building blocks, such as fine-tuning relevant attack parameters, network transformations, and custom loss functions. Based on this observation, we present an extensible framework that defines a search space over these reusable building blocks and automatically discovers an effective attack on a given model with an unknown defense by searching over suitable combinations of these blocks. We evaluated our framework on 23 adversarial defenses and showed it outperforms AutoAttack, the current state-of-the-art tool for reliable evaluation of adversarial defenses: our discovered attacks are either stronger, producing 3.0 additional adversarial examples (10 cases), or are typically 2x faster while enjoying similar adversarial robustness (13 cases).



There are no comments yet.


page 1

page 2

page 3

page 4


On Adaptive Attacks to Adversarial Example Defenses

Adaptive attacks have (rightfully) become the de facto standard for eval...

Model-Agnostic Meta-Attack: Towards Reliable Evaluation of Adversarial Robustness

The vulnerability of deep neural networks to adversarial examples has mo...

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

The field of defense strategies against adversarial attacks has signific...

Bypassing Feature Squeezing by Increasing Adversary Strength

Feature Squeezing is a recently proposed defense method which reduces th...

Guided Adversarial Attack for Evaluating and Enhancing Adversarial Defenses

Advances in the development of adversarial attacks have been fundamental...

Protecting Neural Networks with Hierarchical Random Switching: Towards Better Robustness-Accuracy Trade-off for Stochastic Defenses

Despite achieving remarkable success in various domains, recent studies ...

Clipping free attacks against artificial neural networks

During the last years, a remarkable breakthrough has been made in AI dom...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The issue of adversarial robustness and attacks (Szegedy et al., 2014; Goodfellow et al., 2015)

, i.e., generating small input perturbations that lead to mispredictions, is an important problem with a large body of recent work that affects all current deep learning models. Unfortunately, reliable evaluation of proposed defenses is an elusive and challenging task: many defenses seem to initially be effective, only to be circumvented later by new attacks designed specifically with that defense in mind 

(Carlini and Wagner, 2017b; Athalye et al., 2018; Tramer et al., 2020).

To address this challenge, two recent works approach the problem from different perspectives. Tramer et al. (2020) outlines an approach for manually crafting adaptive attacks that exploit the weak points of each defense. Here, a domain expert starts with an existing attack, such as PGD (Madry et al., 2018) (denoted as in Figure 1), and adapts it based on knowledge of the defense’s inner workings. Common modifications include: (i) tuning attack parameters (e.g., number of steps), (ii) replacing network components to simplify the attack (e.g., removing randomization or non-differentiable components), and (iii) replacing the loss function optimized by the attack. This approach was demonstrated to be effective in breaking all of the considered defenses. However, a downside is that it requires substantial manual effort and is limited by the domain knowledge of the expert – for instance, each of the defenses came with an adaptive attack which was insufficient, in retrospect.

At the same time, Croce and Hein (2020b) proposed to assess adversarial robustness using an ensemble of four diverse attacks illustrated in Figure 1 (b) – APGD with cross-entropy loss (Croce and Hein, 2020b), APGD

with difference in logit ratio loss, FAB 

(Croce and Hein, 2020a), and Square Attack (SQR) (Andriushchenko et al., 2020)

. While these do not require manual effort and have been shown to provide a better robustness estimate for many defenses than the original evaluation, the approach is inherently limited by the fact that the attacks are fixed apriori without any knowledge of the particular defense at hand. This is visualized in Figure 

1 (b) where even though the attacks are designed to be diverse, they cover only a small part of the entire space.





(b) Ensemble of fixed attacks (Croce and Hein, 2020b)

 fixed attack

(a) Handcrafted adaptive attacks (Tramer et al., 2020)

manual step

 best adaptive attack


ensemble diversity


optimize params



+modified loss



+weighted loss

(c) Adaptive attack search (Our Work)

search step

search space











Figure 1: High-level illustration and comparison of recent works and ours. Adaptive attacks (a) rely on a human expert to manually adapt an existing attack to exploit the weak points of each defense. AutoAttack (b) evaluates defenses using an ensemble of diverse attacks. Our work (c) defines a search space of adaptive attacks (denoted as ) and performs search steps automatically.

This work: discovery of adaptive attacks

We present a new method that automates the process of crafting adaptive attacks, combining the best of both prior approaches – the ability to evaluate defenses automatically while producing attacks tuned for the given defense. Our work is based on the key observation that we can identify common techniques used to build existing adaptive attacks and extract them as reusable building blocks in a common framework. Then, given a new model with an unseen defense, we can discover an effective attack by searching over suitable combinations of these building blocks.

To identify reusable techniques, we analyze existing adaptive attacks and organize their components into three groups:

  • Attack algorithm and parameters: a library of diverse attack techniques (e.g., APGD, FAB, C&W (Carlini and Wagner, 2017a), NES (Wierstra et al., 2008)), together with backbone specific and generic parameters (e.g., input randomization, number of steps, if and how to use expectation over transformation (Athalye et al., 2018)).

  • Network transformations: producing an easier to attack surrogate model using techniques including variants of BPDA (Athalye et al., 2018) to break gradient obfuscation, and layer removal (Tramer et al., 2020) to eliminate obfuscation layers such as redundant softmax operator.

  • Loss functions: that specify different ways of defining the loss function which is optimized by the attack (e.g., cross-entropy, hinge loss, logit matching, etc.).

These components collectively formalize an attack search space induced by their different combinations. We also present an algorithm that effectively navigates the search space so to discover an attack. In this way, domain experts are left with the creative task of designing completely new attacks and growing the framework by adding missing attack components, while the tool is responsible for automating many of the tedious and time-consuming trial-and-error steps that domain experts perform manually today.

We implemented our approach in a tool called Adaptive AutoAttack (A) and evaluated it on diverse adversarial defenses. Our results demonstrate that A discovers adaptive attacks that outperform AutoAttack (Croce and Hein, 2020b), the current state-of-the-art tool for reliable evaluation of adversarial defenses: A finds attacks that are either stronger, producing 3.0%-50.8% additional adversarial examples (10 cases), or on average 2x and up to 5.5x faster while enjoying similar adversarial robustness (13 cases). The source code of A and our scripts for reproducing the experiments are available online at:

2 Automated Discovery of Adaptive Attacks

We use to denote a training dataset where is a natural input (e.g., an image) and is the corresponding label. An adversarial example is a perturbed input , such that: (i) it satisfies an attack criterion , e.g., a -class classification model predicts a wrong label, and (ii) the distance between the adversarial input  and the natural input  is below a threshold under a distance metric  (e.g., an norm). Formally, this can be written as:


goal (criterion)

Adversarial Attack

For example, instantiating this with the norm and misclassification criterion, we obtain the following formulation:

Misclassification Attack

where returns the prediction of the model . Further, in case the model uses a defense to abstain from making predictions whenever an adversarial input is detected, then the formulation is:

Misclassification Attack with Detector

where is a detector, and the model makes a prediction when and otherwise it rejects the input. A common way to implement the detector  is to perform a statistical test with the goal of differentiating natural and adversarial samples (Grosse et al., 2017; Metzen et al., 2017; Li and Li, 2017).

Problem Statement

Given a model equipped with an unknown set of defenses and a dataset , our goal is to find an adaptive adversarial attack that is best at generating adversarial samples according to the attack criterion and the attack capability :


Here, denotes the search space of all possible attacks, where the goal of each attack is to generate an adversarial sample for a given input  and model . For example, solving this optimization problem with respect to the misclassification criterion corresponds to optimizing the number of adversarial examples misclassified by the model.

In our work, we consider an implementation-knowledge adversary, who has full access to the model’s implementation at inference time (e.g., the model’s computational graph). We chose this threat model as it matches our problem setting – given an unseen model implementation, we want to automatically find an adaptive attack that exploits its weak points, but without the need of a domain expert. We note that this threat model is weaker than a perfect-knowledge adversary (Biggio et al., 2013), which assumes a domain expert that also has knowledge about the training dataset111We only assume access to the dataset used to evaluate adversarial robustness (typically the test dataset), but not to training and validation datasets. and algorithm, as this information is difficult, or even not possible, to recover from the model’s implementation only.

Key Challenges

To solve the optimization problem from Equation 1, we address two key challenges:

  • Defining a suitable attacks search space such that it is expressible enough to cover a range of existing adaptive attacks.

  • Searching over the space efficiently such that a strong attack is found within a reasonable time.

We start by formalizing the attack space in Section 3 and then describe our search algorithm in Section 4.

3 Adaptive Attacks Search Space

We define the adaptive attack search space by analyzing existing adaptive attacks and identifying common techniques used to break adversarial defenses. Formally, the adaptive attack search space is given by , where consists of sequences of backbone attacks along with their loss functions, selected from a space of loss functions , and consists of network transformations. Semantically, given an input and a model , the goal of adaptive attack is to return an adversarial example by computing . That is, it first transforms the model by applying the transformation , and then executes the attack on the surrogate model . Note that the surrogate model is used only to compute the candidate adversarial example, not to evaluate it. That is, we generate an adversarial example  for , and then check whether it is also adversarial for . Since may be adversarial for , but not for , the adaptive attack must maximize the transferability of the generated candidate adversarial samples.

3.1 Attack Algorithm & Parameters ()

The attack search space consists of a sequence of adversarial attacks. We formalize the search space with the grammar:

(Attack Search Space)
::= ;
, n
, n n
Attack params loss 


  • : composes two attacks, which are executed independently and return the first adversarial sample in the defined order. That is, given input , the attack returns if is an adversarial example, and otherwise it returns .

  •  : enables the attack’s randomized components (if any). The randomization corresponds to using random seed and/or via selecting a starting point within , uniformly at random.

  •  , n: uses expectation over transformation, a technique designed to compute gradients for models with randomized components (Athalye et al., 2018).

  •  , n: repeats the attack times. Note that repeat is useful only if randomization is enabled.

  •    n: executes the attack with a time budget of n seconds.

  • Attack  params loss : is a backbone attack Attack executed with parameters params and loss function loss. Our tool A supports FGSM (Goodfellow et al., 2015), PGD (Madry et al., 2018), DeepFool (Moosavi-Dezfooli et al., 2016), C&W (Carlini and Wagner, 2017a), NES (Wierstra et al., 2008), APGD (Croce and Hein, 2020b), FAB (Croce and Hein, 2020a) and SQR (Andriushchenko et al., 2020), where params correspond to the standard parameters defined by these attacks, such as eta, beta and n_iter for FAB. We provide the full list of parameters, including their ranges and priors in the supplementary material. We define the loss functions in Section 3.3.

3.2 Network Transformations ()

A common approach that aims to improve the robustness of neural networks against adversarial attacks is to incorporate an explicit defense mechanism in the neural architecture. These defenses often obfuscate gradients to render iterative-optimization methods ineffective 

(Athalye et al., 2018). However, these defenses can be successfully circumvented by (i) choosing a suitable attack algorithm, such as score and decision-based attacks (included in ), or (ii) by changing the neural architecture (defined next).

At a high-level, the network transformation search space takes as input a model and transforms it to another model , which is easier to attack. To achieve this, the network can be expressed as a directed acyclic graph, where each vertex denotes an operator (e.g., convolution, residual blocks, etc.), and edges correspond to data dependencies. Note that the computational graph includes both the forward and backward versions of each operation, which can be changed independently of each other. In our work, we include two types of network transformations:

  • Layer Removal, which removes an operator from the graph. To automate this process, the operator can be removed as long as its input and output dimensions are the same, regardless of its functionality.

  • Backward Pass Differentiable Approximation (BPDA) (Athalye et al., 2018), which replaces the backward version of an operator with a differentiable approximation of the function. In our search space we include three different function approximations: (i) an identity function, (ii) a convolution layer with kernel size 1, and (iii)

    a two-layer convolutional layer with ReLU activation in between. The weights in the latter two cases are learned through approximating the forward function using the test dataset.

3.3 Loss Function ()

Selecting the right objective function to optimize is an important design decision for creating strong adaptive attacks. Indeed, the recent work of Tramer et al. (2020) uses 9 different objective functions to break 13 defenses, showing the importance of this step. We formalize the space of possible loss functions using the following grammar:

(Loss Function Search Space)
::= Loss, n Z
Loss Z
Loss, n -  Loss  Z
Z ::= logits probs
Loss ::= CrossEntropy HingeLoss L1
DLR LogitMatching

The grammar formalizes four different aspects:

(Carlini and Wagner, 2017a)

(Croce and Hein, 2020b)

Figure 2: Loss functions used as part of the loss search space .

Targeted vs Untargeted. The loss can be either untargeted, where the goal is to change the classification to any other label , or targeted, where the goal is to predict a concrete label . Even though the untargeted loss is less restrictive, it is not always easier to optimize in practice. As a result, the search space contains both. When using

Loss, n together with misclassification criterion, the attack will consider the top n classes with the highest probability as the targets.

Loss Formulation. Next is the concrete loss formulation, as summarized in Figure 2. These include loss functions used in existing adaptive attacks, as well as the recently proposed difference in logit ratio loss (Croce and Hein, 2020b).

def AdaptiveAttackSearch

       Input: dataset , model , attack search space , number of trials , initial dataset size , attack sequence length , criterion function , initial parameter estimator model , default attack  Output: adaptive attack from achieving the highest on Search for surrogate model using default attack
Initialize attack to be no attack, which returns the input image
for  do Run iterations to get sequence of attacks
1             Remove non-robust samples
Initial dataset with samples

for  do Select candidate adaptive attacks
2                   Best unseen parameters according to the model

update model with
3            while  do Successive halving (SHA)
4                   Double the dataset size
Re-evaluate attacks on the larger dataset
keep attacks with the best score
5            best attack in 
6      return
Algorithm 1 A search algorithm that given a model with unknown defense, discovers an adaptive attack from the attack search space with the best (i.e., an attack that leads to worse adversarial robustness).

Logits vs. Probabilities. In our search space, loss functions can be instantiated both with logits as well as with probabilities. Note that some loss functions are specifically designed for one of the two options, such as C&W (Carlini and Wagner, 2017a) or DLR (Croce and Hein, 2020b) that specifically consider only logits. While such knowledge can be used to reduce the search space, it is not necessary as long as the search algorithm is powerful enough to recognize that such a combination leads to poor results.

Loss Replacement. Because the key idea behind many of the defenses is to find a property that helps to differentiate between adversarial and natural images, one can also define the optimization objective in the same way. These feature-level attacks (Sabour et al., 2016) avoid the need to directly optimize the complex objective defined by the adversarial defense and have been effective at circumventing them. As an example, the logit matching loss (shown in Figure 2) minimizes the difference of logits between adversarial sample and a natural sample of the target class  (selected at random from the dataset). Instead of logits, the same idea can also be applied to other statistics, such as internal representations computed by a pre-trained model or KL-divergence between label probabilities.

4 Search Algorithm

We now describe our search algorithm that optimizes the problem statement from Equation 1. Since we do not have access to the underlying distribution, we approximate Equation 1 using the dataset as follows:


where is an attack, denotes untargeted cross-entropy loss of on the input, and

is a hyperparameter. The intuition behind

is that it acts as a tie-breaker in case the criterion alone is not enough to differentiate between multiple attacks. While this is unlikely to happen when evaluating on large datasets, it is quite common when using only a small number of samples. Obtaining good estimates in such cases is especially important for achieving scalability since performing the search directly on the full dataset would be prohibitively slow.

Search Algorithm

We present our search algorithm in Algorithm 1. We start by searching through the space of network transformations to find a suitable surrogate model (line 1). This is achieved by taking the default attack  (in our implementation, we set to APGD), and then evaluating all locations where BPDA can be used, and subsequently evaluating all layers that can be removed. Even though this step is exhaustive, it takes only a fraction of the runtime in our experiments, and no further optimization was necessary.

Next, we search through the space of attacks . As this search space is enormous, we employ three techniques to improve scalability and attack quality. First, to generate a sequence of attacks, we perform a greedy search (lines 3-16). That is, in each step, we find an attack with the best score on the samples not circumvented by any of the previous attacks (line 4). Second, we use a parameter estimator model to select the suitable parameters (line 8). In our work, we use Tree of Parzen Estimators (Bergstra et al., 2011), but the concrete implementation can vary. Once the parameters are selected, they are evaluated using the function (line 9), the result is stored in the trial history (line 10), and the estimator is updated (line 11). Third, because evaluating the adversarial attacks can be expensive, and the dataset is typically large, we employ successive halving technique (Karnin et al., 2013; Jamieson and Talwalkar, 2016). Concretely, instead of evaluating all the trials on the full dataset, we start by evaluating them only on a subset of samples (line 5). Then, we improve the score estimates by iteratively increasing the dataset size (line 13), re-evaluating the scores (line 14), and retaining a quarter of the trials with the best score (line 15). We repeat this process to find a single best attack from , which is then added to the sequence of attacks (line 16).

5 Evaluation

We now evaluate A on 23 models with diverse defenses and compare the results to AutoAttack (Croce and Hein, 2020b) and to several existing handcrafted attacks. AutoAttack is a state-of-the-art tool designed for reliable evaluation of adversarial defenses that improved the originally reported results for many existing defenses by up to 10%. Our key result is that A finds stronger or similar attacks than AutoAttack for virtually all defenses:

  • In 10 cases, the attacks found by A are significantly stronger than AutoAttack, resulting in 3.0% to 50.8% additional adversarial examples.

  • In the other 13 cases, A’s attacks are typically 2x and up to 5.5x faster while enjoying similar attack quality.

The A tool

The implementation of A

is based on PyTorch 

(Paszke et al., 2019), the implementations of FGSM, PGD, NES, and DeepFool are based on FoolBox (Rauber et al., 2017) version 3.0.0, C&W is based on ART (Nicolae et al., 2018) version 1.3.0, and the attacks APGD, FAB, and SQR are from (Croce and Hein, 2020b). We use AutoAttack’s rand version if a defense has a randomization component, and otherwise we use its standard version. To allow for a fair comparison, we extended AutoAttack with backward pass differential approximation (BPDA), so we can run it on defenses with non-differentiable components; without this, all gradient-based attacks would fail.

Unless stated otherwise, we instantiate Algorithm 1 by setting: the attack sequence length , the number of trials , the initial dataset size , and we use a time budget of to seconds per sample depending on the model size. We use TPE  (Bergstra et al., 2011) for parameter estimation, which is implemented as part of the Hyperopt framework (Bergstra et al., 2013). All of the experiments are performed using a single RTX 2080 Ti GPU.

Robust Accuracy (1 - Rerr) Runtime (min) Search (min)
Croce and Hein (2020b) (Our Work)
CIFAR-10, , AA A AA A Speed-up A
A1 Stutz et al. (2020) 77.64 26.87 -50.77 101 205 0.49 659
A2 Madry et al. (2018) 44.78 44.69 -0.09 25 20 1.25 88
A3 Buckman et al. (2018)   2.29   1.96 -0.33 9 7 1.29 116
A4 Das et al. (2017) + Lee et al. (2018)   0.59   0.11 -0.48 6 2 3.00 40
A5 Metzen et al. (2017)   6.17   3.04 -3.13 21 13 1.62 80
A6 Guo et al. (2018) 22.30 12.14 -10.16 19 17 1.12 99
A7 Ensemble of A3, A4, A6   4.14   3.94 -0.20 28 24 1.17 237
A8 Papernot et al. (2015)   2.85   2.71 -0.14 4 4 1.00 84
A9 Xiao et al. (2020) 19.82 11.11 -8.71 49 22 2.23 189
A10 Xiao et al. (2020) 64.91 17.70 -47.21 157 2,280 0.07 1,548
CIFAR-10, ,
B11 Wu et al. (2020) 60.05 60.01 -0.04 706 255 2.77 690
B12 Wu et al. (2020) 56.16 56.18 0.02 801 145 5.52 677
B13 Zhang and Wang (2019) 36.74 37.11 0.37 381 302 1.26 726
B14 Grathwohl et al. (2020)   5.15   5.16 0.01 107 114 0.94 749
B15 Xiao et al. (2020)   5.40   2.31 -3.09 95 146 0.65 828
B16 Wang et al. (2019) 50.84 50.81 -0.03 734 372 1.97 755
B17 Wang et al. (2020) 50.94 50.89 -0.05 742 486 1.53 807
B18 Sehwag et al. (2020) 57.19 57.16 -0.03 671 429 1.56 691
B19 B11 + Defense in A4 60.72 60.04 -0.68 621 210 2.96 585
B20 B14 + Defense in A4 15.27   5.24 -10.03 261 79 3.30 746
B21 B11 + Random Rotation 49.53 41.99 -7.54 255 462 0.55 900
B22 B14 + Random Rotation 22.29 13.45 -8.84 114 374 0.30 1,023
B23 Hu et al. (2019)   6.25   3.07 -3.18 110 56 1.96 502
model available from the authors, model with non-differentiable components.
Table 1: Comparison of AutoAttack (AA) and our approach (A) on 23 defenses. A1 uses , number of attacks . A10 uses time budget of 30 seconds per sample and only a single attacks . Additional details and description of each defense, discovered adaptive attacks, and network processing techniques are included in supplementary material.

Evaluation Metric

Following Stutz et al. (2020), we use the robust test error (Rerr) metric to combine the evaluation of defenses with and without detectors. Rerr is defined as:


where is a detector that accepts a sample if , and evaluates to one if causes a misprediction and to zero otherwise. The numerator counts the number of samples that are both accepted and lead to a successful attack (including cases where the original is incorrect), and the denominator counts the number of samples not rejected by the detector. A defense without a detector (i.e., ) reduces Equation 3 to the standard Rerr. Finally, we define robust accuracy simply as Rerr.

Comparison to AutoAttack

Our main results, summarized in Table 1, show the robust accuracy (lower is better) and runtime of both AutoAttack (AA) and A over the 23 defenses. For example, for A9 our tool finds an attack that leads to lower robust accuracy (11.1% for A vs. 19.8% for AA) and is more than twice as fast (22 min for A vs. 49 min for AA). Overall, A significantly improves upon AA or provides similar but faster attacks.

We note that the attacks from AA are included in our search space (although without the knowledge of their best parameters and sequence), and so it is expected that A performs at least as well as AA, provided sufficient exploration time. The only case where the exploration time was not sufficient was for B14 where our attack is slightly slower (114 min for A vs. 107 min for AA), yet still achieves the same robust accuracy (5.16% for A vs. 5.15% for AA). Importantly, A often finds better attacks: for 10 defenses, A reduces the robust accuracy by 3% to 50% compared to that of AA. In what follows, we discuss the results in more detail and highlight important insights.

Defenses based on Adversarial Training. Defenses A2, B11, B12, B16, B17 and B18 are based on variations of adversarial training. We observe that, even though AA has been designed with these defenses in mind, A obtains very close results. Moreover, A improves upon AA as it discovers attacks that achieve similar robustness while bringing 1.5–5.5 speedups. Closer inspection reveals that AA includes two attacks, FAB and SQR, which are not only expensive but also ineffective on these defenses. A improves the runtime by excluding them from the generated adaptive attack.

Obfuscation Defenses. Defenses A4, A9, A10, B15, B19, and B20 are based on gradient obfuscation. A discovers stronger attacks that reduce the robust accuracy for all defenses by up to 47.21%. Here, removing the obfuscated defenses in A4, B19, and B20 provides better gradient estimation for the attacks. Further, the use of more suitable loss functions strengthens the discovered attacks and improves the evaluation results for A9 and B15.

Randomized Defenses. For the randomized input defenses A9, B21, and B22, A discovers attacks that, compared to AA’s rand version, further reduce robustness by 8.71%, 7.54%, and 8.84%, respectively. This is achieved by using stronger yet more costly parameter settings, attacks with different backbones (APGD, PGD) and 7 different loss functions (as listed in Appendix F).

Detector based Defenses. For A1, A5, and B23 defended with detectors, A improves over AA by reducing the robustness by 50.77%, 3.13%, and 3.18%, respectively. This is because none of the attacks discovered by A are included in AA. Namely, A found SQR and APGD for A1, untargeted FAB for A5 (FAB in AA is targeted), and PGD for B23.

Comparison to Handcrafted Adaptive Attacks

Given a new defense, the main strength of our approach is that it directly benefits from all existing techniques included in the search space. While the search space can be easily extended, it is also inherently incomplete. Here, we illustrate this point by comparing our approach to three handcrafted adaptive attacks not included in the search space.

As a first example, A1 (Stutz et al., 2020) proposes an adaptive attack PGD-Conf with backtracking that leads to robust accuracy of 36.9%, which can be improved to 31.6% by combining PGD-Conf with blackbox attacks. A finds APGD and Z = probs. This combination is interesting since the hinge loss maximizing the difference between the top two predictions, in fact, reflects the PGD-Conf objective function. Further, similarly to the manually crafted attack by A1, a different blackbox attack included in our search space, SQR, is found to complement the strength of APGD. When using (sequence of three attacks), such combination leads to 46.36% robust accuracy. However, by increasing the number of attacks to , the robust accuracy drops further to 26.87%, which is a stronger result than the one reported in the original paper. In this case, our search space and the search algorithm are powerful enough to not only replicate the main ideas of Stutz et al. (2020) but also to improve its evaluation when allowing for a larger attack budget. Note that this improvement is possible even without including the backtracking used by PGD-Conf as a building block in our search space. In comparison, the robust accuracy reported by AA is only 77.64%.

As a second example, B15 is known to be susceptible to NES which achieves 0.16% robust accuracy (Tramer et al., 2020). In our experiment, we limit the time budget for the attack so that the expensive NES cannot be found. The result shows that the SQR attack in the search space is effective enough to achieve 2.31% robustness evaluation.

As a third example, to break B23, Tramer et al. (2020)

designed an adaptive attack that linearly interpolates between the original and the adversarial samples using PGD. This technique breaks the defense and achieves 0% robust accuracy. In comparison, we find PGD

, which achieves 3.07% robust accuracy. In this case, the fact that PGD is a relatively weak attack is an advantage – it successfully bypasses the detector by not generating overconfident predictions.

Ablation Studies

Similar to existing handcrafted adaptive attacks, all three components included in the search space were important for generating strong adaptive attacks for a variety of defenses. Here we briefly discuss their importance while including the full experiment results in the supplementary material.

Attack & Parameters. We demonstrate the importance of parameters by comparing PGD, C&W, DF, and FGSM with default library parameters to the best configuration found when available parameters are included in the search space. The attacks found by A are on average 5.5% stronger than the best attack among the four attacks on models A2-A10.

Loss Formulation. To evaluate the effect of modeling different loss functions, we remove them from the search space and keep only the original loss function defined for each attack. The search score drops by 3% on average for A2-A10 without the loss formulation.

Network Processing. In B20, the main reason for achieving 10% decrease in robust accuracy is the removal of the gradient obfuscated defense Reverse Sigmoid. In contrast, for A7, B21, B22, the randomized input processing steps are candidates for removal, but A keeps this step as removing these components yields worse results.

Further, in Table 2 we show the effect of different BPDA instantiations included in our search space. For A3, since the non-differentiable layer is non-linear thermometer encoding, it is better to use a function with non-linear activation to approximate it. For A4, B19, B20, the defense is image JPEG compression and identity network is the best algorithm since the networks can overfit when training on limited data.

BPDA Type A3 A4 B19 B20
identity 18.5 9.6 70.5 84.0
1x1 convolution 8.9 10.3 70.8 84.9
2 layer conv + ReLU 3.7 14.9 74.1 86.2
Table 2: The robust accuracy (1 - Rerr) of networks with different BPDA policies evaluated by APGD with 50 iterations.

6 Related Work

The most closely related work to ours is AutoAttack (Croce and Hein, 2020b), which improves the evaluation of adversarial defenses by proposing an ensemble of four fixed attacks. Further, the key to stronger attacks was a new algorithm APGD, which improves upon PGD by halving the step size dynamically based on the loss at each step. In our work, we improve over AutoAttack in three keys aspects: (i) we formalize a search space of adaptive attacks, rather than using a fixed ensemble, (ii) we design a search algorithm that discovers the best adaptive attacks automatically, significantly improving over the results of AutoAttack, and (iii) our search space is extensible and allows reusing building blocks from one attack by other attacks, effectivelly expressing new attack instantiations. For example, the idea of dynamically adapting the step size is not tied to APGD, but it is a general concept applicable to any step-based algorithm.

Our work is also closely related to the recent advances in AutoML, such as in the domain of neural architecture search (NAS) (Zoph and Le, 2017; Elsken et al., 2019). Similar to our work, the core challenge in NAS is an efficient search over a large space of parameters and configurations, and therefore many techniques can also be applied to our setting. This includes BOHB (Falkner et al., 2018), ASHA (Li et al., 2018)

, using gradient information coupled with reinforcement learning 

(Zoph and Le, 2017) or continuous search space formulation (Liu et al., 2019). Even though finding completely novel neural architectures is often beyond the reach, NAS is still very useful and finds many state-of-the-art models. This is also true in our setting – while human experts will continue to play a key role in defining new types of adaptive attacks, as we show in our work, it is already possible to automate many of the intermediate steps.

7 Conclusion

We presented the first tool that aims to automatically find strong adaptive attacks specifically tailored to a given adversarial defense. Our key insight is that we can identify reusable techniques used in existing attacks and formalize them into a search space. Then, we can phrase the challenge of finding new attacks as an optimization problem of finding the strongest attack over this search space.

Our approach automates the tedious and time-consuming trial-and-error steps that domain experts perform manually today, allowing them to focus on the creative task of designing new attacks. By doing so, we also immediately provide a more reliable evaluation of new and existing defenses, many of which have been broken only after their proposal because the authors struggled to find an effective attack by manually exploring the vast space of techniques.

We implemented our approach in a tool called A and demonstrated that it outperforms the state-of-the-art tool AutoAttack (Croce and Hein, 2020b). Importantly, even though our current search space contains only a subset of existing techniques, our evaluation shows that A can partially re-discover or even improve upon some handcrafted adaptive attacks not yet included in our search space.


  • M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein (2020) Square attack: a query-efficient black-box adversarial attack via random search. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 484–501. Cited by: §1, 6th item.
  • A. Athalye, N. Carlini, and D. A. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 274–283. Cited by: 1st item, 2nd item, §1, 3rd item, 2nd item, §3.2.
  • J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger (Eds.), pp. 2546–2554. Cited by: §4, §5.
  • J. Bergstra, D. Yamins, and D. D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, JMLR Workshop and Conference Proceedings, Vol. 28, pp. 115–123. Cited by: §5.
  • B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Cited by: §2.
  • J. Buckman, A. Roy, C. Raffel, and I. J. Goodfellow (2018) Thermometer encoding: one hot way to resist adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: Table 1.
  • N. Carlini and D. Wagner (2017a) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), Vol. , pp. 39–57. Cited by: 1st item, Figure 2, 6th item, §3.3.
  • N. Carlini and D. Wagner (2017b) Adversarial examples are not easily detected: bypassing ten detection methods. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    AISec ’17, New York, NY, USA, pp. 3–14. External Links: ISBN 9781450352024 Cited by: §1.
  • F. Croce and M. Hein (2020a) Minimally distorted adversarial examples with a fast adaptive boundary attack. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 2196–2205. Cited by: §1, 6th item.
  • F. Croce and M. Hein (2020b) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 2206–2216. Cited by: Automated Discovery of Adaptive Attacks on Adversarial Defenses, Figure 1, §1, §1, Figure 2, 6th item, §3.3, §3.3, §5, Table 1, §5, §6, §7.
  • N. Das, M. Shanbhogue, S. Chen, F. Hohman, L. Chen, M. E. Kounavis, and D. H. Chau (2017) Keeping the bad guys out: protecting and vaccinating deep learning with jpeg compression. arXiv preprint arXiv:1705.02900. Cited by: Table 1.
  • T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey. External Links: 1808.05377 Cited by: §6.
  • S. Falkner, A. Klein, and F. Hutter (2018) BOHB: robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1436–1445. Cited by: §6.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1, 6th item.
  • W. Grathwohl, K. Wang, J. Jacobsen, D. Duvenaud, M. Norouzi, and K. Swersky (2020)

    Your classifier is secretly an energy based model and you should treat it like one

    In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: Table 1.
  • K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. D. McDaniel (2017) On the (statistical) detection of adversarial examples. CoRR abs/1702.06280. External Links: 1702.06280 Cited by: §2.
  • C. Guo, M. Rana, M. Cissé, and L. van der Maaten (2018) Countering adversarial images using input transformations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: Table 1.
  • S. Hu, T. Yu, C. Guo, W. Chao, and K. Q. Weinberger (2019) A new defense against adversarial images: turning a weakness into a strength. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 1633–1644. Cited by: Table 1.
  • K. G. Jamieson and A. Talwalkar (2016) Non-stochastic best arm identification and hyperparameter optimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, A. Gretton and C. C. Robert (Eds.), JMLR Workshop and Conference Proceedings, Vol. 51, pp. 240–248. Cited by: §4.
  • Z. S. Karnin, T. Koren, and O. Somekh (2013) Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, JMLR Workshop and Conference Proceedings, Vol. 28, pp. 1238–1246. Cited by: §4.
  • T. Lee, B. Edwards, I. Molloy, and D. Su (2018) Defending against machine learning model stealing attacks using deceptive perturbations. arXiv preprint arXiv:1806.00054. Cited by: Table 1.
  • L. Li, K. Jamieson, A. Rostamizadeh, E. Gonina, M. Hardt, B. Recht, and A. Talwalkar (2018) Massively parallel hyperparameter tuning. arXiv preprint arXiv:1810.05934. Cited by: §6.
  • X. Li and F. Li (2017) Adversarial examples detection in deep networks with convolutional filter statistics. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 5775–5783. Cited by: §2.
  • H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §6.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §1, 6th item, Table 1.
  • J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff (2017) On detecting adversarial perturbations. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §2, Table 1.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: a simple and accurate method to fool deep neural networks. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 2574–2582. Cited by: 6th item.
  • M. Nicolae, M. Sinn, T. N. Minh, A. Rawat, M. Wistuba, V. Zantedeschi, I. M. Molloy, and B. Edwards (2018) Adversarial robustness toolbox v0.2.2. CoRR abs/1807.01069. Cited by: Table 8, §5.
  • N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and A. Swami (2015) Distillation as a defense to adversarial perturbations against deep neural networks. CoRR abs/1511.04508. Cited by: Table 1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: §5.
  • J. Rauber, W. Brendel, and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. In Reliable Machine Learning in the Wild Workshop, 34th International Conference on Machine Learning, Cited by: Table 8, §5.
  • S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet (2016) Adversarial manipulation of deep representations. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §3.3.
  • V. Sehwag, S. Wang, P. Mittal, and S. Jana (2020) Hydra: pruning adversarially robust neural networks. Advances in Neural Information Processing Systems (NeurIPS) 7. Cited by: Table 1.
  • D. Stutz, M. Hein, and B. Schiele (2020) Confidence-calibrated adversarial training: generalizing to unseen attacks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 9155–9166. Cited by: Appendix A, §5, §5, Table 1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1.
  • F. Tramer, N. Carlini, W. Brendel, and A. Madry (2020) On adaptive attacks to adversarial example defenses. arXiv preprint arXiv:2002.08347. Cited by: Figure 1, 2nd item, §1, §1, §3.3, §5, §5.
  • B. Wang, Z. Shi, and S. J. Osher (2019) ResNets ensemble via the feynman-kac formalism to improve natural and robust accuracies. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 1655–1665. Cited by: Table 1.
  • Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu (2020) Improving adversarial robustness requires revisiting misclassified examples. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: Table 1.
  • D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber (2008) Natural evolution strategies. In

    2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence)

    pp. 3381–3387. Cited by: 1st item, 6th item.
  • D. Wu, S. Xia, and Y. Wang (2020) Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems 33. Cited by: Table 1.
  • C. Xiao, P. Zhong, and C. Zheng (2020) Enhancing adversarial defense by k-winners-take-all. In International Conference on Learning Representations, Cited by: Table 1.
  • H. Zhang and J. Wang (2019) Defense against adversarial attacks using feature scattering-based adversarial training. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 1829–1839. Cited by: Table 1.
  • B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §6.

Appendix A Evaluation Metrics Details

We use the following criteria in the formulation:

Misclassification Attack

Misclassification Attack with Detector

For both, we remove the misclassified clean input as a pre-processing step, such that the evaluation is performed only on the subset of correctly classified samples (i.e. ).

Sequence of Attacks

Sequence of attacks defined in Section 3.1 is a way to calculate the per-example worst-case evaluation, and the four attack ensemble in AutoAttack is equivalent to sequence of four attacks [APGD, APGD, FAB, SQR]. Algorithm 2 elaborates how the sequence of attacks is evaluated. That is, the attacks are performed in the order they were defined and the first sample that satisfies the criterion is returned.

def SeqAttack

       Input: model , data , sequence attacks , network transformation , criterion function Output: for  do
7             =if  then
8                   return
10      return
Algorithm 2 Sequence of attacks

Robust Test Error (Rerr)

Rerr defined in Equation 3 from Section 5 has intractable maximization problem in the denominator, so Equation 4 is the empirical equation used to give an upper bound evaluation of Rerr. This empirical evaluation is the same as the evaluation in Stutz et al. (2020).



For a network with a detector , the criterion function is misclassification with the detectors, and it is applied in line 3 in Algorithm 2. This formulation enables per-example worst-case evaluation for detector defenses.

Randomized Defenses

If has randomized component, in Equation 4

means to draw a random sample from the distribution. In the evaluation metrics, we report the mean of adversarial samples evaluated 10 times using


Appendix B Search Space of

b.1 Loss function space

Cross Entropy (CE), HingeLoss (Hinge), Difference in logit ratio (DLR), Logit Matching (LM) are the five loss functions used in our experiments. For Hinge, the confidence value is set to infinity as to encourage stronger adversarial examples, and can be a loss parameter in future work.

Recall from Section 3.3 that the loss function search space is defined as:

(Loss Function Search Space)
::= Loss, n Z
Loss Z
Loss, n -  Loss  Z
Z ::= logits probs

To refer to different settings, we use the following notation:

  • U: for the loss,

  • T: for the loss,

  • D: for the loss

  • L: for using logits, and

  • P: for using probs

For example, we use DLR-U-L to denote DLR loss with logits. The loss space in evaluation is shown in Table 3. Effectively, the search space includes all the possible combinations expect that the cross-entropy loss supports only probability. Note that although is designed for logits, and is designed for targeted attacks, the search space still makes other possibilities an option (i.e., it is up to the search algorithm to learn which combinations are useful and which are not).

Logit/Prob P
Table 3: Loss functions and their modifiers. ✓ means the loss supports the modifier. P means the loss always uses Probability.
Attack Randomize EOT Repeat Loss Targeted logit/prob
PGD True
DeepFool False D
C&W False - {U, T} L
FAB True - {U, T} L
SQR True
NES True
Table 4: Generic parameters and loss support for each attack in the search space. For the loss column, ”-” means the loss is from the library implementation, and ✓ means the attack supports all the loss functions defined in Table 3. In other columns ✓ means the attack supports all the values, and the attack supports only the indicated set of values otherwise.

b.2 Attack Algorithm & Parameters Space

Recall the attack space defined in Section 3.1 as:

::= ;
, n
, n n
Attack params loss 

,  ,   are the generic parameters, and for params are attack specific parameters. The type of every parameter is either integer or float. An integer ranges from to inclusive is denoted as . A float range from to inclusive is denoted as . Besides value range, prior is needed for parameter estimator model (TPE in our case), which is either uniform (default) or log uniform (denoted with ). For example, means an integer value ranges from to with log uniform prior; means a float value ranges from to with uniform prior.

Generic parameters and the supported loss for each attack algorithm are defined in Table 4. The algorithm returns a deterministic result if is False, and otherwise the results might differ due to randomization. Randomness can come from either perturbing the initial input or randomness in the attack algorithm. Input perturbation is deterministic if the starting input is the original input or an input with fixed disturbance, and it is randomized if the starting input is chosen uniformly at random within the adversarial capability. For example, the first iteration of FAB uses the original input but the subsequent inputs are randomized (if the randomization is enabled). Attack algorithms like SQR, which is based on random search, has randomness in the algorithm itself. The deterministic version of such randomized algorithms is obtained by fixing the initial random seed.

The definition of for FGSM, PGD, NES, APGD, FAB, DeepFool, C&W is whether to start from the original input or uniformly at random select a point within the adversarial capability. For SQR random means whether to fix the seed. We generally set to be True to allow repeating the attacks for stronger attack strength, yet we set DeepFool and C&W to False as they are minimization attacks designed with the original inputs as the starting inputs.

The attack specific parameters are specified in Table 5, and the ranges are chosen to be representative by setting reasonable upper and lower bounds to include the default values of parameters. Note that DeepFool algorithm uses the loss D to take difference between the predictions of two classes by design (i.e., loss). C&W uses the hinge loss, and FAB uses loss similar to DeepFool. For C&W and FAB, we just take the library implementation of the loss (i.e. without our loss function space formulation).

Attack Parameter Range and prior
PGD step
C&W confidence
NES step
APGD rho
FAB n_iter
SQR n_queries
Table 5: List of attack specific parameters. The parameter names correspond to the names in the library implementation
Timelimit(s) Attack1 Loss1 Attack2 Loss2 Attack3 Loss3
a2 0.5 APGD Hinge-T-P APGD L1-D-P APGD CE-T-P
a5 0.5 FAB –F-L APGD LM-U-P DeepFool DLR-D-L
a6 0.5 APGD Hinge-U-P APGD Hinge-U-P PGD DLR-T-P
a7 0.5 APGD L1-D-L APGD DLR-U-L APGD Hinge-T-L
a10 30 NES Hinge-U-P - - - -
b11 3 APGD Hinge-T-P DeepFool L1-D-L PGD CE-D-P
b12 3 APGD Hinge-U-L APGD CE-D-P APGD Hinge-T-P
b13 3 FAB –F-L APGD L1-T-L FAB –F-L
b15 3 SQR Hinge-U-L SQR L1-U-L SQR CE-U-L
b16 3 APGD L1-D-P C&W Hinge-U-L PGD Hinge-T-L
b18 3 APGD Hinge-T-L APGD CE-U-P C&W –U-L
b22 3 APGD L1-T-L PGD L1-U-P PGD L1-U-P

Table 6: Timelimit for each network, and attacks and losses result. Due to the cost of A10, only one attack is searched and used. The Loss follows the format: Loss - Targeted - Logit/Prob. The meanings of the abbreviations are defined in Section B.

b.3 Search space conditioned on network property

Properties of network defenses (e.g. randomized, detector, obfuscation) can be used to reduce the search space. In our work, EOT is set to be for deterministic networks. Repeat is set to be for randomized networks, following the practise of AutoAttack setting repeat to in its rand version. Logit Matching is enabled only when detectors are present since the loss is considered as a loss to bypass detectors.

Appendix C Discovered Adaptive Attacks

Our 23 benchamrks presented in Table 2 are selected to contain diverse defenses. Table 7 shows the network transformation result, and Table 6 shows the searched attacks and losses during the attack search.

Network transformation Related Defenses

In the benchmark, there are defenses that are related to the network transformations. JPEG compression (JPEG) is to use image compression algorithm on the input so that the network is non-differentiable and the adversarial disturbances are reduced. Reverse sigmoid (RS) is a special layer added to the logit output of the model in order to obfuscate the gradient. Thermometer Encoding (TE) is an input encoding technique to shatter the linearity of inputs, and this encoding is non-differentiable. Random rotation (RR) is in the family of randomized defense which rotates the input image by a random degree each time. Table 7 shows where the defenses appear and what network processing strategies are applied.

Diversity of Attacks

From table 6, the majority of attack algorithms searched are APGD, which shows the attack is indeed a strong universal attack. The second or third attack can be a weak attack like FGSM, and a major reason is that many attacks tie at the criterion evaluation and the noise in the untargeted CE loss tie-breaker sometimes determines the choice of attack. The loss functions show variety, yet Hinge and DLR appears more often. This challenges the common practise of using CE as the loss function by default.

Removal Policies BPDA Policies
a3 - TE-C
a6 RR-0 -
a7 JPEG-1 RS-1 RR-1 TE-C, JPEG-I
b19 JPEG-0 RS-0 JPEG-I
b20 JPEG-1 RS-1 JPEG-I
b21 RR-0 -
b22 RR-0 -
Table 7: List of network processing strategy used on relevant benchmarks. The format is defense-policy. The defenses are defined in Section C. For layer removal policies, 1 means to remove the layer, 0 means not to remove the layer. For BPDA policies, I means identity, and C means using the network with two convolutions having ReLU activation in between.

Figure 3: Scores of various attack parametrizations from our search space explored when evaluating A2 model using TPE algorithm.

Appendix D Time Complexity

This section gives the worst-case time analysis for Algorithm 1. Attack time is the total time spend in line 4 to remove all non-robust samples. This step counts as attack time because it is where the robustness of the network being evaluated. Search time is the time spend in SHA iterations in line 9, 14 where the timing critical function is called. The time analysis for network transformation in line 1 is excluded as it incurs only a small runtime overhead in practice. We use the time constraint per attack per sample denoted as .

For , the worst-case is when the attacks use the full time budget on all the samples (denoted as ). This gives the bound shown in Equation 5.


For , we first derive the bound for a single attack search, and then the bound for attacks search is times the value. In line 9, the maximum time to perform attacks on samples is . In line 14, the cost of the first iteration is as there are attacks and samples. By design, the cost of SHA iteration is halved for every subsequent iteration, which leads to the total time for a single attack search is . Therefore, the search time bound is shown in Equation 6.


In evaluation we use , which leads to . This means the total search time is bounded by the time bound of executing a sequence of attacks.

The empirical search time scales roughly linearly with and sub-linearly with . These search parameters are used to control the trade-off between search time and search quality.

Appendix E Attack-Score Distribution during Search

The analysis of attack-score distribution can be useful to understand the search process. Figure 3 shows the distribution on network A2. In this experiment, the number of trials is and the number of samples is , the time budget is . The scores with negative values are the trials got time-out. From the results, we can see that:

  • The expensive attacks like NES time-out because a small is used. The parameter range can potentially affect the search, as we see FGSM times-out because repeat parameter can be very large.

  • The sensitivity of parameters to scores varies for different attack algorithms. For examples, PGD has a large variance of scores, but APGD is very stable by design.

  • TPE algorithm samples more attack algorithms with high scores which enables TPE to choose better attack parameters during the SHA stage.

  • The top attacks have similar performance, which means the searched attack should have low variance in attack strength. In practice, the variance among the best searched attacks is typically small.

Appendix F Ablation Study

Here we provide details on the ablation study in Section 5.

f.1 Attack Algorithm & Parameters

In the experiment setup, the search space includes four attacks (FGSM, PGD, DeepFool, C&W) with their generic and specific parameters shown in Table 4 and Table 5 respectively. The loss search space only contains the loss in library implementation, and the network transformation space contains only BPDA. Robust accuracy (Racc) is used as the evaluation metric. The best Racc among FGSM, PGD, DeepFool, C&W with library default parameters are calculated, and they are compared with the Racc from the searched attack.

The result in Table 8 shows the average robustness improvement is 5.5%, up to 17.3%. PGD evaluation can be much stronger after the tuning, which reflects the fact that insufficient parameter tuning in PGD is a common cause to over-estimate the robustness in literature.

Library Impl. A
Net Racc Attack Racc Attack
A2 47.1 C&W 47.0 -0.1 PGD
A3 13.4 PGD 13.4 -6.8 PGD
A4 35.9 DeepFool 35.9 -5.6 PGD
A5 6.6 DeepFool 6.6 0.0 DeepFool
A6 14.5 PGD 8.4 -6.1 PGD
A7 35.0 PGD 17.3 -17.7 PGD
A8 6.9 C&W 6.6 -0.3 C&W
A9 25.4 PGD 14.7 -10.7 PGD
A10 64.7 FGSM 62.4 -2.3 PGD
Table 8: Comparison with library default parameters and the searched best attack. The implementations of FGSM, PGD, and DeepFool are based on FoolBox (Rauber et al., 2017) version 3.0.0, C&W is based on ART (Nicolae et al., 2018) version 1.3.0.

f.2 Loss

Figure 4 shows the comparison between TPE with loss formulation and TPE with default loss. The search space with default loss means the space containing only L1 and CE loss, with only untargeted loss and logit output. The result shows 3.0% of the final score improvement with loss formulation.

f.3 TPE algorithm vs Random

Figure 4 shows the comparison between TPE search and random search. TPE finds better scores by an average of 1.3% and up to 8.0% (A6) depending on the network.

Figure 4: The best score progression measured by the average of 5 independent runs of models A2 to A10.