HybridAttack
None
view repo
In a blackbox setting, the adversary only has API access to the target model and each query is expensive. Prior work on blackbox adversarial examples follows one of two main strategies: (1) transfer attacks use whitebox attacks on local models to find candidate adversarial examples that transfer to the target model, and (2) optimizationbased attacks use queries to the target model and apply optimization techniques to search for adversarial examples. We propose hybrid attacks that combine both strategies, using candidate adversarial examples from local models as starting points for optimizationbased attacks and using labels learned in optimizationbased attacks to tune local models for finding transfer candidates. We empirically demonstrate on the MNIST, CIFAR10, and ImageNet datasets that our hybrid attack strategy reduces cost and improves success rates, and in combination with our seed prioritization strategy, enables batch attacks that can efficiently find adversarial examples with only a handful of queries.
READ FULL TEXT VIEW PDF
Machine learning models have been found to be susceptible to adversarial...
read it
Many optimization methods for generating blackbox adversarial examples ...
read it
We present a scalable, black box, perceptionintheloop technique to fi...
read it
Neural Networks (NNs) are known to be vulnerable to adversarial attacks....
read it
While deep neural networks show unprecedented performance in various tas...
read it
Many deployed learned models are black boxes: given input, returns outpu...
read it
The problem of adversarial examples, evasion attacks on machine learning...
read it
None
Machine learning (ML) models are often prone to misclassifying inputs, known as adversarial examples (AEs), that are crafted by perturbing a normal input in a constrained, but purposeful way. Effective methods for finding adversarial examples have been found in whitebox settings, where an adversary has full access to the target model [39, 17, 8, 30, 24], as well as in blackbox settings, where only API access is available [10, 43, 35, 38, 21, 22]. In this work, we aim to improve our understanding of the expected cost of blackbox attacks in realistic settings. For most scenarios where the target model is only available through an API, the cost of attacks can be quantified by the number of model queries needed to find a desired number of adversarial examples. Blackbox attacks often require a large number of model queries, and each query takes time to execute, in addition to incurring a service charge and exposure risk to the attacker.
Previous blackbox attacks can be grouped into two categories: transfer attacks [36, 35] and optimization attacks [38, 10, 43, 21, 22]. Transfer attacks exploit the observation that adversarial examples often transfer between different models [29, 41, 17, 35, 27]. The attacker generates adversarial examples against local models using whitebox attacks, and hopes they transfer to the target model. Transfer attacks use one query to the target model for each attempted candidate transfer, but suffer from transfer loss as local adversarial examples may not successfully transfer to the target model. Transfer loss can be very high, especially for targeted attacks where the attacker’s goal requires finding examples where the model outputs a particular target class rather than just producing misclassifications.
Optimization attacks formulate the attack goal as a blackbox optimization problem and carry out the attack using a series of queries to the target model [10, 4, 43, 21, 22, 1, 18, 33, 28]. These attacks require many queries, but do not suffer from transfer loss as each seed is attacked interactively using the target model. Optimizationbased attacks can have high attack success rates, even for targeted attacks, but often require many queries for each adversarial example found.
Contributions. Although improving query efficiency and attack success rates for blackbox attacks is an active area of research for both transferbased and optimizationbased attacks, prior works treat the two types of attacks independently and fail to explore possible connections between the two approaches. We investigate three straightforward possibilities for combining transfer and optimizationbased attacks (Section 3), and find that only one is generally useful (Section 4): failed transfer candidates are useful starting points for optimization attacks. This can be used to substantially improve blackbox attacks in terms of both success rates and, most importantly, query cost. Compared to transfer attacks, hybrid attacks can significantly improve the attack success rate by adopting optimization attacks for the nontransfers, which increases persample query cost. Compared to optimization attacks, hybrid attacks significantly reduce query complexity when useful local models are available. For example, for both MNIST and CIFAR10, our hybrid attacks reduce the mean query cost of attacking normallytrained models by over 75% compared to stateoftheart optimization attacks. For ImageNet, the transfer attack only has 3.4% success rate while the hybrid attack approaches 100% success rate.
To improve our understanding of resourcelimited blackbox attacks, we simulate a batch attack scenario where the attacker has access to a large pool of seeds and is motivated to obtain many adversarial examples using limited resources. Alternatively, we can view the batch attacker’s goal as obtaining a fixed number of adversarial examples with fewest queries. We demonstrate that the hybrid attack can be combined with a novel seed prioritization strategy to dramatically reduce the number of queries required in batch attacks (Section 5). For example, for ImageNet, when the attacker is interested in obtaining 10 adversarial examples from a pool of 100 candidate seeds, our seed prioritization strategy can be used to save over 70% of the queries compared to random ordering of the seeds.
Attack  Gradient Estimation 
Queries per Iteration  Whitebox Attack 
ZOO [10]  ,  CW [8]  
Bhagoji et. al [4]  ZOO + random feature group or PCA  FGSM [17], PGD [30]  
AutoZOOM [43]  ,  CW [8]  
NES [21]  ,  PGD  
[22]  NES + time/data dependent info  PGD  
SignHunter [1]  Gradient sign w/ divideandconquer method  PGD  
Cheng et al. [13]  PGD 
In this section, we overview the two main types of blackbox attacks which are combined in our hybrid attack strategies.
Transfer attacks take advantage of the observation that adversarial examples often transfer across models. The attacker runs standard whitebox attacks on local models to find adversarial examples that are expected to transfer to the target model. Most works assume the attacker has access to similar training data to the data used for the target model, or has access to pretrained models for similar data distribution. For attackers with access to pretrained local models, no queries are needed to the target model to train the local models. Other works consider training a local model by querying the target model, sometimes referred to as substitute training [35, 27]. With naïve substitute training, many queries are needed to train a useful local model. Papernot et al. adopt a reservoir sampling approach to reduce the number of queries needed [35]
. Li et al. use active learning to further reduce the query cost
[27]. However, even with these improvements, many queries are still needed and substitute training has had limited effectiveness for complex target models.Although adversarial examples sometimes transfer between models, transfer attacks typically have much lower success rates than optimization attacks, especially for targeted attacks. In our experiments on ImageNet, the highest transfer rate of targeted attacks observed from a single local model is 0.2%, while gradientbased attacks achieve nearly 100% success. Liu et al. improve transfer rates by using an ensemble of local models [29], but still only achieve low transfer rates (3.4% in our ImageNet experiments, see Table 3).
Another line of work aims to improve transferability by modifying the whitebox attacks on the local models. Dong et al. adopt the momentum method to boost the attack process and leads to improved transferability [15]. Xie et al. improve the diversity of attack inputs by considering image transformations in the attack process to improve transferability of existing whitebox attacks [45]. Dong et al. recently proposed a translation invariant optimization method that further improves transferability [16]. We did not incorporate these methods in our experiments, but expect they would be compatible with our hybrid attacks.
Attack  Applicable Norm  Objective Function  Solution Method  
SimBA [18]  Iterate: sample from , first try , then  
Attack [28] 

Compute , then sample from  
Moon et al. [43]  Compute , then 
denotes set of orthonormal candidate vectors,
denotes cross entropy loss of image with original label (untargeted attack) or target label (targeted attack). denotes distribution of parameterized by , denotes ground set of all pixel locations. Variables with are locallyoptimal solutions obtained by solving the corresponding optimization problems.Optimizationbased attacks work by defining an objective function and iteratively perturbing the input to optimize that objective function. We first consider optimization attacks where the query response includes full prediction scores, and categorize those ones that involve estimating the gradient of the objective function using queries to the target model, and those that do not depend on estimating gradients. Finally, we also briefly review restricted blackbox attacks, where attackers obtain even less information from each model query, in the extreme, learning just the label prediction for the test input.
Gradient Attacks. Gradientbased blackbox attacks numerically estimate the gradient of the target model, and execute standard whitebox attacks using those estimated gradients. Table 1 compares several gradient blackbox attacks.
The first attack of this type was the ZOO (zerothorder optimization) attack, introduced by Chen et al. [10]. It adopts the finitedifference method with dimensionwise estimation to approximate gradient values, and uses them to execute a CarliniWagner (CW) whitebox attack [8]. The attack runs for hundreds to thousands of iterations and takes queries per CW optimization iteration, where is the dimensionality. Hence, the query cost is extremely high for larger images (e.g., over 2M queries on average for ImageNet).
Following this work, several researchers have sought more queryefficient methods for estimating gradients for executing blackbox gradient attacks. Bhagoji et al. propose reducing query cost of dimensionwise estimation by randomly grouping features or estimating gradients along with the principal components given by principal component analysis (PCA)
[4]. Tu et al.’s AutoZOOM attack uses twopoint estimation based on random vectors and reduces the query complexity per CW iteration from to without losing much accuracy on estimated gradients [43]. Ilyas et al.’s NES attack [21] uses a natural evolution strategy (which is in essence still random vectorbased gradient estimation) [44], to estimate the gradients for use in projected gradient descent (PGD) attacks [30].Ilyas et al.’s attack incorporates time and data dependent information into the NES attack [22]. AlDujaili et al.’s SignHunter adopts a divideandconquer approach to estimate the sign of the gradient and is empirically shown to be superior to the attack in terms of query efficiency and attack success rate [1]. Cheng et al. recently proposed improving the attack by incorporating gradients from surrogate models as priors when estimating the gradients [13]. For our experiments (Section 4.2), we use AutoZOOM and NES as representative stateoftheart blackbox attacks.^{1}^{1}1We also tested on ImageNet, but found it less competitive to the earlier attacks and therefore, do not include the results in this paper. We have not evaluated SignHunter and the attack of Cheng et al. [13], but plan to include more results in the future versions and have released an opensource framework to enable other attacks to be tested using our methods.
Gradientfree Attacks. Researchers have also explored searchbased blackbox attacks using heuristic
methods that are not based on gradients, which we call gradientfree attacks. One line of work directly applies known heuristic blackbox optimization techniques, and is not competitive with the gradientbased blackbox attacks in terms of query efficiency. Alzantot et al.
[2]develop a genetic programming strategy, where the fitness function is defined similarly to CW loss
[8], using the prediction scores from queries to the blackbox model. A similar genetic programming strategy was used to perform targeted blackbox attacks on audio systems [40]. Narodytska et al. [34] use a local neighbor search strategy, where each iteration perturbs the most significant pixel. Since the reported query efficiency of these methods is not competitive with results for gradientbased attacks, we did not consider these attacks in our experiments.Several recent gradientfree blackbox attacks (summarized in Table 2) have been proposed that can significantly outperform the gradientbased attacks. Guo et al.’s SimBA [18] iteratively adds or subtracts a random vector sampled from a predefined set of orthonormal candidate vectors to generate adversarial examples efficiently. Li et al.’s Attack [28]
formulates the adversarial example search process as identifying a probability distribution from which random samples are likely to be adversarial. Moon et al. formulate the
norm blackbox attack with perturbation as a problem of selecting a set of pixels with perturbation and applying the perturbation to the remaining pixels, such that the objective function defined for misclassification becomes a set maximization problem. Efficient submodular optimization algorithms are then used to solve the set maximization problem efficiently [33]. These attacks became available after we started our experiments, so are not included in our experiments. However, our hybrid attack strategy is likely to work for these new attacks as it boosts the optimization attacks by providing better starting points, which we expect is beneficial for most attack algorithms.Restricted Blackbox Attacks. All the previous attacks assume the adversary can obtain complete prediction scores from the blackbox model. Much less information might be revealed at each model query, however, such as just the top few confidence scores or, at worst, just the output label.
Ilyas et al. [21], in addition to their main results of NES attack with full prediction scores, also consider scenarios where prediction scores of the top
classes or only the model prediction label are revealed. In the case of partial prediction scores, attackers start from an instance in the target class (or class other than the original class) and gradually move towards the original image with the estimated gradient from NES. For the labelonly setting, a surrogate loss function is defined to utilize the strategy of partial prediction scores. Brendel et al.
[5] propose a labelonly blackbox attack, which starts from an example in the target class and performs a random walk from that target example to the seed example. This random walk procedure often requires many queries. Following this work, several researchers have worked to reduce the high query cost of random walk strategies. Cheng et al. formulate a labelonly attack as an optimization problem, reducing the query cost significantly compared to the random walk [12]. Chen et al. also formulate the labelonly attack as an optimization problem and show this significantly improves query efficiency [9]. Brunner et al. [6] improve upon the random walk strategy by additionally considering domain knowledge of image frequency, region masks and gradients from surrogate models.In our experiments, we assume attackers have access to full prediction scores, but we believe our methods are also likely to help in settings where attackers obtain less information from each query. This is because the hybrid attack boosts gradient attacks by providing better starting points and is independent from the specific attack methods or the types of query feedback from the blackbox model.
Our hybrid attacks combine the transfer and optimization methods for searching for adversarial examples. Here, we introduce the threat model of our attack, state the hypotheses underlying the attacks, and presents the general hybrid attack algorithm. We evaluate the hypotheses and attacks in Section 4.
Threat Model. In the blackbox attack setting, the adversary does not have direct access to the target model or knowledge of its parameters, but can use API access to the target model to obtain prediction confidence scores for a limited number of submitted queries. We assume the adversary has access to pretrained local models for the same task as the target model. These could be directly available or produced from access to similar training data and knowledge of the model architecture of the target model. The assumption of having access to pretrained local models is a common assumption for research on transferbased attacks. A few works on substitute training [35, 27] have used weaker assumptions such as only having access to a small amount of training data, but have only been effective so far for very small datasets.
Hypotheses. Our approach stems from three hypotheses about the nature of adversarial examples:
Hypothesis 1 (H1): Local adversarial examples are better starting points for optimization attacks than original seeds. Liu et al. observe that for the same classification tasks, different models tend to have similar decision boundaries [29]. Therefore, we hypothesize that, although candidate adversarial examples generated on local models may not fully transfer to the target model, these candidates are still closer to the targeted region than the original seed and hence, make better starting points for optimization attacks.
Hypothesis 2 (H2): Labels learned from optimization attacks can be used to tune local models. Papernot et al. observe that generating examples crossing decision boundaries of local models can produce useful examples for training local models closer to the target model [35]. Therefore, we hypothesize that query results generated through the optimization search queries may contain richer information regarding true target decision boundaries. These new labeled inputs that are the byproduct of an optimization attack can then be used to finetune the local models to improve their transferability.
Hypothesis 3 (H3): Local models can help direct gradient search. Since different models tend to have similar decision boundaries for the same classification tasks, we hypothesize that gradient information obtained from local models may also help better calibrate the estimated gradient of gradient based blackbox attacks on target model.
We are not able to find any evidence to support the third hypothesis (H3), which is consistent with Liu et al.’s results [29]
. They observed that, for ImageNet models, the gradients of local and target models are almost orthogonal to each other. We also tested this for MNIST and CIFAR10, conducting whitebox attacks on local models and storing the intermediate images and the corresponding gradients. We found that the local and target models have almost orthogonal gradients (cosine similarity close to zero) and therefore, a naïve combination of gradients of local and target model is not feasible. One possible explanation is the noisy nature of gradients of deep learning models, which causes the gradient to be highly sensitive to small variations
[3]. Although the cosine similarity is low, two recent works have attempted to combine the local gradients and the estimated gradient of the blackbox model by a linear combination [13, 6]. However, Brunner et al. observe that straightforward incorporation of local gradients does not improve targeted attack efficiency much [6]. Cheng et al. successfully incorporated local gradients into untargeted blackbox attacks, however, they do not consider the more challenging targeted attack scenario and it is still unclear if local gradients can help in more challenging cases [6]. Hence, we do not investigate this further in this paper and leave it as an open question if there are more sophisticated ways to exploit local model gradients.Attack Method. Our hybrid attacks combine transfer and optimization attacks in two ways based on the first two hypotheses: we use a local ensemble to select better starting points for an optimization attack, and use the labeled inputs obtained in the optimization attack to tune the local models to improve transferability. Algorithm 1 provides a general description of the attack. The attack begins with a set of seed images
, which are natural images that are correctly classified by the target model, and a set of local models,
. The attacker’s goal is to find a set of successful adversarial examples (satisfying some attacker goal, such as being classified in a target class with a limited perturbation below starting from a natural image in the source class).The attack proceeds by selecting the next seed to attack (line 1). Section 4 considers the case where the attacker only selects seeds randomly; Section 5 considers ways more sophisticated resourceconstrained attackers may improve efficiency by prioritizing seeds. Next, the attack uses the local models to find a candidate adversarial example for that seed. When the local adversarial example is found, we first check its transferability and if the seed directly transfers, we proceed to attack the next seed. If the seed fails to directly transfer, the blackbox optimization attack is then executed starting from that candidate. The original seed is also passed into the blackbox attack (line 1) since the adversarial search space is defined in terms of the original seed , not the starting point found using the local models, . This is because the space of permissible inputs is defined based on distance from the original seed, which is a natural image. Constraining with respect to the space of original seed is important because we need to make sure the perturbations from our method are still visually indistinguishable from the natural image. If the blackbox attack succeeds, it returns a successful adversarial example, , which is added to the returned set. Regardless of success, the blackbox attack produces inputlabel pairs () during the search process which can be used to tune the local models (line 1), as described in Section 4.6.
In this section, we report on experiments to validate our hypothesis, and evaluate the hybrid attack methods. Section 4.1 describes the experimental setup; Section 4.2 describes the attack configuration; Section 4.3 describes the attack goal; Section 4.4 reports on experiments to test the first hypothesis from Section 3 and measure the effectiveness of hybrid attacks; Section 4.5 improves the attack for targeting robust models, and Section 4.6 evaluates the second hypothesis, showing the impact of tuning the local models using the label byproducts. For all of these, we focus on comparing the cost of the attack measured as the average number of queries needed per adversarial example found across a set of seeds. In Section 5, we revisit the overall attack costs in light of batch attacks that can prioritize which seeds to attack.
We evaluate our attacks on three popular image classification datasets and a variety of stateoftheart models.
MNIST. MNIST [26] is a dataset of 70,000 greyscale images of handwritten digits (0–9), split into 60,000 training and 10,000 testing samples. For our normal (not adversarially trained) MNIST models, we use the pretrained MNIST models of Bhagoji et al. [4], which typically consist of convolutional layers and fully connected layers. We use their MNIST model A as the target model, and models B–D as local ensemble models. To consider the more challenging scenario of attacking a blackbox robust model, we use Madry’s robust MNIST model, which demonstrates strong robustness even against the best whitebox attacks (maintaining over 88% accuracy for attacks with ) [30].
CIFAR10. CIFAR10 [23] consists of 60,000 RGB images, with 50,000 training and 10,000 testing samples for object classification (10 classes in total). We train a standard DenseNet model and obtain a test accuracy of 93.1%, which is close to stateoftheart performance. To test the effectiveness of our attack on robust models, we use Madry’s CIFAR10 Robust Model [30]. Similarly, we also use the normal CIFAR10 target model and the standard DenseNet (StdDenseNet) model interchangeably. For our normal local models, we adopt three simple LeNet structures [25], varying the number of hidden layers and hidden units.^{2}^{2}2We also tested with deep CNN models as our local ensembles. However, they provide only slightly better performance compared to simple CIFAR10 models, while the finetuning cost is much higher. For simplicity, we name the three normal models NA, NB and NC where NA has the fewest parameters and NC has the most parameters. To deal with the lower effectiveness of attacks on robust CIFAR10 model (Section 4.4), we also adversarially train two deep CIFAR10 models (DenseNet, ResNet) similar to the Madry robust model as robust local models. The adversariallytrained DenseNet and ResNet models are named RDenseNet and RResNet.
ImageNet. ImageNet [14] is a dataset closer to realworld images with 1000 categories, commonly used for evaluating stateoftheart deep learning models. We adopt the following pretrained ImageNet models for our experiments: ResNet50 [19], DenseNet [20], VGG16, and VGG19 [37] (all from https://keras.io/applications/). We take DenseNet as the target blackbox model and the remaining models as the local ensemble.
For the hybrid attack, since we have both the target model and local model, we have two main design choices: (1) which whitebox attacks to use for the local models , and (2) which optimization attacks to use for the target model.
Local Model Configurations. We choose an ensemble of local models in our hybrid attacks. This design choice is motivated by two facts: First, different models tend to have significantly different direct transfer rates to the same target model (see Figure 1), when evaluated individually. Therefore, taking an ensemble of several models helps avoid ending up with a single local model with a very low direct transfer rate. Second, consistent with the findings of Liu et al. [29] on attacking an ensemble of local models, for MNIST and CIFAR10, we find that the ensemble of normal local models yields the highest transfer rates when the target model is a normally trained model (note that this does not hold for robust target model, as shown in Figure 1 and discussed further in Section 4.5). We validate the importance of normal local ensemble against normal target model by considering different combinations of local models (i.e., ) and checking their corresponding transfer rates and the average query cost. We adopt the same approach as proposed by Liu et al. [29] to attack multiple models simultaneously, where the attack loss is defined as the sum of the individual model loss. In terms of transfer rate, we observe that a single CIFAR10 or MNIST normal model can achieve up to 53% and 35% targeted transfer rate respectively, while an ensemble of local models can achieve over 63% and 60% transfer rate. In terms of the average query cost against normal target models, compared to a single model, an ensemble of local models on MNIST and CIFAR10 can save on average 53% and 45% of queries, respectively. Since the ensemble of normal local models provides the highest transfer rate against normal target models, to be consistent, we use that configuration in all our experiments attacking normal models. We perform whitebox PGD [30] attacks (100 iterative steps) on the ensemble loss. We choose the PGD attack as it gives a high transfer rate compared to the fast gradient sign method (FGSM) method [17].
Optimization Attacks. We use two stateoftheart gradient estimation based attacks in our experiments: NES, a natural evolution strategy based attack [21]
and AutoZOOM, an autoencoderbased zerothorder optimization attack
[43] (see Section 2.2). These two methods are selected as all of them are shown to improve upon [10] significantly in terms of query efficiency and attack success rate. We also tested with the attack, an improved version of the NES attack that additionally incorporates time and data dependent information [22]. However, we find that is not competitive with the other two attacks in our attack scenario and therefore we do not include its results here.^{3}^{3}3For example, for the targeted attack on ImageNet, the baseline attack only has 88% success rate and average query cost of 51,745, which are much worse than the NES and AutoZOOM attacks. Both tested attacks follow an attack method which attempts queries for a given seed until either a successful adversarial example is found or the set maximum query limit is reached, in which case they terminate with a failure. For MNIST and CIFAR10, we set the query limit to be 4000 queries for each seed. AutoZOOM sets the default maximum query limit for each as 2000, however as we consider a harder attacker scenario (selecting least likely class as the target class), we decide to double the maximum query limit. NES does not contain evaluation setups for MNIST and CIFAR10 and therefore, we choose to enforce the same maximum query limit as AutoZOOM.^{4}^{4}4By running the original AutoZOOM attack with a 4000 query limit compared to their default setting of 2000, we found 17.2% and 25.4% more adversarial samples out of 1000 seeds for CIFAR10 and MNIST respectively. For ImageNet, we set the maximum query limit as 10,000 following the default setting used in the NES paper [21].Dataset  Target  Transfer  Gradient  Success (%)  Queries/Seed  Queries/AE  Queries/Search  
Model  Rate (%)  Attack  Base  Ours  Base  Ours  Base  Ours  Base  Ours  
MNIST  Normal (T)  61.6  AutoZOOM  90.9  98.8  1,495  294  1,645  298  3,320  789 
NES  76.6  88.7  2,548  899  3,326  1,014  8210  3,316  
Robust (U)  2.9  AutoZOOM  7.2  7.3  3,757  3,747  52,182  51,328  87,156  85,167  
NES  4.5  5.7  3,901  3,806  86,695  66,775  159,844  135,933  
CIFAR10  Normal (T)  63.3  AutoZOOM  92.2  98.1  1,131  272  1,227  277  2,165  779 
NES  99.1  99.9  1,088  347  1,098  347  1,628  944  
Robust (U)  9.5  AutoZOOM  64.4  65.2  1,700  1,649  2,640  2,529  3,091  2,961  
NES  38.0  37.9  2,815  2,777  7,408  7,326  9,796  9,776  
ImageNet  Normal (T)  3.4  AutoZOOM  95.4  98.0  42,310  29,484  44,354  30,089  45,166  31,174 
NES  100.0  100.0  18,797  14,430  18,797  14,430  19,030  14,939 
For MNIST and CIFAR10, we randomly select 100 images from each of the 10 classes for 1000 total images, against which we perform all blackbox attacks. For ImageNet, we randomly sample 100 total impages across all 1000 classes.
Target Class. We evaluate targeted attacks on the normal MNIST, CIFAR10, and ImageNet models. Targeted attacks are more challenging and are generally of more practical interest. For the MNIST and CIFAR10 datasets, all of the selected instances belong to one particular original class and we select as the target class the least likely class of the original class given a prediction model, which should be the most challenging class to target. We define the least likely class of a class as the class which is most frequently the class with the lowest predicted probability across all instances of the class. For ImageNet, we choose the least likely class of each image as the target class. For the robust models for MNIST and CIFAR10, we evaluate untargeted attacks as these models are designed to resist untargeted attacks [31, 32]. Untargeted attacks against these models are significantly more difficult than targeted attacks against the normal models.
Attack Distance Metric and Magnitude. We measure the perturbation distance using , which is the most widely used attacker strength metric in blackbox adversarial examples research. Since the AutoZOOM attack is designed for attacks, we transform it into an attack by clipping the attacked image into the ball ( space) of the original seed in each optimization iteration. Note that the original AutoZOOM loss function is defined as , where is for misclassification (targeted or untargeted) and is for perturbation magnitude minimization. In our transformation to norm, we only optimize and clip the to ball of the original seed. NES is naturally an attack. For MNIST, we choose following the setting in Bhagoji et al. [4]. For CIFAR10, we set , following the same setting in early version of NES paper [21]. For ImageNet, we set , as used by Ilyas et al. [21].
We test the hypothesis that local models produce useful candidates for blackbox attacks by measuring the mean cost to find an adversarial example starting from both the original seed and from a candidate found using the local ensemble. Since only 100 ImageNet instances are selected, we average over 5 runs to produce more stable ImageNet results. Table 3 summarizes our results.
In nearly all cases, the cost is reduced by starting from the candidates instead of the original seeds, where candidates are generated by attacking local ensemble models. We measure the cost by the mean number of queries to the target model per adversarial example found. This is computed by dividing the total number of model queries used over the full attack on 1,000 (MNIST, CIFAR10) or 100 (ImageNet) seeds by the number of successful adversarial examples found. The overall cost is reduced by as much as 82% (AutoZOOM attack on the normal MNIST model), and for both the AutoZOOM and for NES attack methods we see the cost drops by at least one third for all of the attacks on normal models (the improvements for robust models are not significant, which we return to in Section 4.5). The cost drops for two reasons: some candidates transfer directly (which makes the query cost for that seed 1); others do not transfer directly but are useful starting points for the gradient attacks. To further distinguish the two factors, we include the mean query cost for adversarial examples found from nontransfering seeds as the last two columns in Table 3. This reduction is significant for all the attacks across the normal models, up to 76% (AutoZOOM attack on normal MNIST models).
Target Model  Transfer Rate (%)  Attack  Target Model  Hybrid Success (%)  Cost Reduction (%)  Fraction Better (%)  
Normal3  Robust2  Normal3  Robust2  Normal3  Robust2  Normal3  Robust2  
Normal  63.3  17.0  AutoZOOM  Normal  98.1  95.6  77.3  68.0  98.7  87.4 
Robust  65.2  68.7  4.2  21.0  74.4  95.6  
Robust  9.5  40.9  NES  Normal  99.9  99.3  68.4  31.4  95.8  80.4 
Robust  37.9  45.8  1.1  33.7  85.3  96.8 
The hybrid attack also offers success rates higher than the gradient attacks (and much higher success rates that transferonly attacks), but with query cost reduced because of the directly transferable examples and boosting effect on gradient attacks from nontransferable examples. For the AutoZOOM and NES attacks on normallytrained MNIST models, the attack failure rates drop dramatically (from 9.1% to 1.2% for AutoZOOM, and from 23.4% to 11.3% for NES), as does the mean query cost (from 1,645 to 298 for AutoZOOM, and from 3,326 to 1,014 for NES). Even excluding the direct transfers, the saving in queries is significant (from 3,320 to 789 for AutoZOOM, and from 8,210 to 3,316 for NES). The candidate starting points are nearly always better than the original seed. For the two attacks on MNIST, there were only at most 22 seeds out of 1,000 where the original seed was a better starting point than the candidate; the worst result is for the AutoZOOM attack against the robust CIFAR10 model where 256 out of 1,000 of the local candidates are worse starting points than the corresponding original seed.
The results in Table 3 show substantial improvements from hybrid attacks on normal models, but fail to provide improvements against the robust models. The improvements against robust models are less than 5% for both attacks on both targets, except for NES against MNIST where there is 23% improvement. We speculate that this is due to differences in the vulnerability space between normal and robust models, which means that the candidate adversarial examples found against the normal models in the local ensemble do not provide useful starting points for attacks against a robust model. This is consistent with Tsipras et al.’s finding that robust models for image classification tasks capture key features of images while normal models capture relatively noisy features [42]. Because of the differences in extracted features, adversarial examples against robust models require perturbing key features (of the target domain) while adversarial examples can be found against normal models by perturbing irrelevant features. This would explain why we did not see improvements from the hybrid attack when targeting robust models. To validate our hypothesis on the different attack surfaces, we repeat the experiments on attacking the CIFAR10 robust model but replace the normal local models with robust local models, which are adversarially trained DenseNet and ResNet models mentioned in Section 4.1.^{5}^{5}5We did not repeat the experiments with robust MNIST local models because, without worrying about separately training robust local models, we can simply improve the attack performance significantly by tuning the local models during the hybrid attack process (see Table 6 in Section 4.6). The tuning process transforms the normal local models into more robust ones (details in Section 4.6).
Table 4 compares the direct transfer rates for adversarial example candidates found using ensembles of normal and robust models against both types of target models. We see that using robust models in the local ensemble increases the direct transfer rate against the robust model from to (while reducing the transfer rate against the normal target model). We also find that the candidate adversarial examples found using robust local models also provide better starting points for gradient blackbox attacks. For example, with the AutoZOOM attack, the mean cost reduction with respect to the baseline mean query (2,632) is significantly improved (from to ). We also observe a significant increase of fraction better (percentages of seeds that starting from the local adversarial example is better than starting from the original seed) from to , and a slight increase in the overall success rate of the hybrid attack (from 65.2% to 68.7%). When an ensemble of robust local models is used to attack normal target models, however, the attack efficiency degrades significantly, supporting our hypothesis that robust and normal models have different attack surfaces.
Universal Local Ensemble. The results above validate our hypothesis that the different attack surfaces of robust and normal models causes the ineffectiveness against the robust CIFAR10 model in Table 3. Therefore, to achieve better performance, depending on the target model type, the attacker should selectively choose the local models. However, in practice, attackers may not know if the target model is robustly trained, so cannot predetermine the best local models. We next explore if a universal local model ensemble exists that works well for both normal and robust target models.
To look for the best local ensemble, we consider all 31 different combination of the 5 local models (3 normal and 2 robust) and measure their corresponding direct transfer rates against both normal and robust target models. For clarity in presentation, we only present results for the five individual models, an ensemble of the two robust models (i.e., RDenseNet,RResNet), an ensemble of three normal models (i.e., NA,NB,NC), and ensembles of all five local models. These five models contain the ensembles that have highest or lowest transfer rates to the target models and transfer rates of all other ensembles fit between the reported highest and lowest values. Individual models are selected to measure the impact of each model separately. Ensembles of robust models and ensembles of normal models are selected to measure the impact of grouping models with same types (i.e., robust or normal). The ensembles of five models are selected to check whether the mixture of robust and normal models can work well against both the robust and normal target models. The results are reported in Figure 1.
None of the ensemble combinations we test had relatively high direct transfer rates against both normal and robust target models. Ensemble models that have good performance against robust targets have poor performance against normal targets (e.g., ensemble of RResNet and RDenseNet have 40.9% transfer rate to robust target while only 17.0% to normal target), and models have good performance against normal targets are bad against robust targets (e.g., ensemble of NA, NB and NC have 63.3% transfer rate to the normal target while only 9.5% to the robust target). Some ensembles are mediocre against both (e.g., ensemble of NA and NB, not included in Figure 1, has 36.0% transfer rate to the normal target and 11.3% to the robust target).
One possible reason for the failure of ensembles to apply to both types of target, is that when whitebox attacks are applied on the mixed ensembles, the attacks still “focus” on the normal models as normal models are easier to attack (i.e., to significantly decrease the loss function). Biasing towards normal models makes the candidate adversarial example less likely to transfer to a robust target model. This conjecture is supported by the observation that although the mixtures of normal and robust models mostly fail against robust target models, they still have reasonable transfer rates to normal target models (e.g., ensemble of 5 local models has 63.5% transfer rate to normal CIFAR10 target model while only 9.5% transfer rate to the robust target model). It might be interesting to explore if one can explicitly enforce the attack to focus more on the robust model when attacking the mixture of normal and robust models.
In practice, attackers can dynamically adapt their local ensemble based on observed results, trying different local ensembles against a particular target for the first set of attempts and measuring their transfer rate, and then selecting the one that worked best for future attacks. This simulation process adds overhead and complexity to the attack, but may still be worthwhile when the transfer success rates vary so much for different local ensembles.
For our subsequent experiments on CIFAR10 models, we use local ensembles of all normal (Normal3: NA, NB, NC) and all robust models (Robust2: RResNet, RDenseNet) as these two give higher transfer rate to normal target and robust target models respectively.
Model  Gradient Attack  Transfer Rate (%)  
Static  Tuned  
MNIST Normal (T)  AutoZOOM  60.6  64.4 
NES  60.6  75.7  
MNIST Normal (U)  AutoZOOM  3.2  4.0 
NES  3.2  4.2  
CIFAR10 Normal (T)  AutoZOOM  63.3  9.3 
NES  63.3  34.2  
CIFAR10 Normal (U)  AutoZOOM  9.5  8.1 
NES  9.5  10.8 
Model  Gradient  Queries/AE  Success Rate (%)  Transfer Rate (%)  
Attack  Static  Tuned  Static  Tuned  Static  Tuned  
MNIST Normal (T)  AutoZOOM  298  189  98.8  99.3  60.6  75.1 
NES  1,014  724  88.7  91.9  60.6  76.7  
MNIST Robust (U)  AutoZOOM  51,328  43,402  7.3  8.5  3.2  4.7 
NES  66,775  51,866  5.7  7.8  3.2  5.2  
CIFAR10 Normal (T)  AutoZOOM  277  456  98.1  96.5  63.3  18.7 
NES  347  436  99.9  99.4  63.3  40.7  
CIFAR10 Robust (U)  AutoZOOM  2,529  2,586  65.2  64.7  9.5  8.1 
NES  7,326  7,225  37.9  37.9  9.5  10.8 
To test the hypothesis that the labels learned from optimization attacks can be used to tune local models, we measure the impact of tuning on the local models’ transfer rate.
During blackbox gradient attacks, there are two different types of inputlabel pairs generated. One type is produced by adding small magnitudes of random noise to the current image to estimate target model gradients. The other type is generated by perturbing the current image in the direction of estimated gradients. We only use the latter inputlabel pairs as they contain richer information about the target model boundary since the perturbed image moves towards the decision boundary. These byproducts of the blackbox attack search can be used to retrain the local models (line 1 in Algorithm 1). The newly generated image and label pairs are added to the original training set to form the new training set, and the local models are finetuned on the new training set. As more images are attacked, the training set size can quickly explode. To avoid this, when the size of new training set exceeds a certain threshold , we randomly sample of the training data and conduct finetuning using the sampled training set. For MNIST and CIFAR10, we set the threshold as the standard training data size (60,000 for MNIST and 50,000 for CIFAR10). At the beginning of hybrid attack, the training set consists of the original seeds available to the attacker with their groundtruth labels (i.e., 1,000 seeds for MNIST and CIFAR10 shown in Section 4.2).
Algorithm 1 shows the local model being updated after every seed, but considering the computational cost required for tuning, we only update the model periodically. For MNIST, we update the model after every 50 seeds; for CIFAR10, we update after 100 seeds (we were not able to conduct the tuning experiments for the ImageNet models because of the high cost of each attack and of retraining). To check the transferability of the tuned local models, we independently sample 100 unseen images from each of the 10 classes, use the local model ensemble to find candidate adversarial examples, and test the candidate adversarial examples on the blackbox target model to measure the transfer rate.
We first test whether the local model can be finetuned by the label byproducts of baseline gradient attacks (Baseline attack + H2) by checking the transfer rate of local models before and after the finetuning process. We then test whether attack efficiency of hybrid attack can be boosted by finetuning local models during the attack process (Baseline attack + H1 + H2) by reporting their average query cost and attack success rate. The first experiment helps us to check applicability of H2 without worrying about possible interactions between H2 with other hypotheses. The second experiment evaluates how much attackers can benefit from finetuning the local models in combination with hybrid attacks.
We report the results of the first experiment in Table 5. For the MNIST model, we observe increases in the transfer rate of local models by finetuning using the byproducts of both attack methods—the transfer rate increases from 60.6% to 75.7% for NES, and from 60.6% to 64.1% for AutoZOOM. Even against the robust MNIST models, the transfer rate improves from the initial value of 3.2% to 4.0% (AutoZOOM) and 4.2% (NES). However, for CIFAR10 dataset, we observe a significant decrease in transfer rate. For the normal CIFAR10 target model, the original transfer rate is as high as 63.3%, but with finetuning, the transfer rate decrease significantly (decreased to 9.3% and 34.2% for AutoZOOM and NES respectively). A similar trend is also observed for the robust CIFAR10 target model. These results suggest that the examples used in the attacks are less useful as training examples for the CIFAR10 model than the original training set.
Our second experiment, reported in Table 6, combines the model tuning with the hybrid attack. Through our experiments, we observe that for MNIST models, the transfer rate also increases significantly by finetuning the local models. For the MNIST normal models, the (targeted) transfer rate increases from the original 60.6% to 75.1% and 76.7% for AutoZOOM and NES, respectively. The improved transfer rate is also higher than the results reported in first experiment. For the AutoZOOM attack, in the first experiment, the transfer rate can only be improved from 60.6% to 64% while in the second experiment, it is improved from 60.6% to 75.1%. Therefore, there might be some boosting effects by taking local AEs as starting points for gradient attacks. For the Madry robust model on MNIST, the low (untargeted) transfer rate improves by a relatively large amount, from the original 3.2% to 4.7% for AutoZOOM and 5.2% for NES (still a low transfer rate, but a 63% relative improvement over the original local model). The local models become more robust during the finetuning process. For example, with the NES attack, the local model attack success rate (attack success is defined as compromising all the local models) decreases significantly from the original 96.6% to 25.2%, which indicates the tuned local models are more resistant to the PGD attack. The improvements in transferability, obtained as a free byproduct of the gradient attack, also lead to substantial cost reductions for the attack on MNIST, as seen in Table 6. For example, for the AutoZOOM attack on the MNIST normal model, the mean query cost is reduced by 37%, from 298 to 189 and the attack success rate is also increased slightly, from 98.8% for static local models to 99.3% for tuned local models. We observe similar patterns for robust MNIST model and demonstrate that Hypothesis 2 also holds on the MNIST dataset.
However, for CIFAR10, we still find no benefits from the tuning. Indeed, the transfer rate decreases, reducing both the attack success rate and increasing its mean query cost (Table 6
). We do not have a clear understanding of the reasons the CIFAR10 tuning fails, but speculate it is related to the difficulty of training CIFAR10 models. The results returned from gradientbased attacks are highly similar to a particular seed and may not be diverse enough to train effective local models. This is consistent with Carlini et al.’s findings that MNIST models tend to learn well from outliers (e.g., unnatural images) whereas more realistic datasets like CIFAR10 tend to learn well from more prototypical (e.g., natural) examples
[7]. Therefore, finetuning CIFAR10 models using label byproducts, which are more likely to be outliers, may diminish learning effectiveness. Potential solutions to this problem include tuning the local model with mixture of normal seeds and attack byproducts. One may also consider keeping some fraction of model ensembles fixed during the finetuning process such that when byproducts mislead the tuning process, these fixed models can mitigate the problem. We leave further exploration of this for future work.Section 4 evaluates attacks assuming an attacker wants to attack every seed from some fixed set of initial seeds. In more realistic attack scenarios, each query to the model has some cost or risk to the attacker, and the attacker’s goal is to find as many adversarial examples as possible using a limited total number of queries. Carlini et al. show that, defenders can identify purposeful queries for adversarial examples based on past queries and therefore, detection risk will increase significantly when many queries are made [11]. We call these attack scenarios batch attacks. To be efficient in these resourcelimited settings, attackers should prioritize “easytoattack” seeds.
A seed prioritization strategy can easily be incorporated into the hybrid attack algorithm by defining the selectSeed function used in step 1 in Algorithm 1 to return the most promising seed:
To clearly present the hybrid attack strategy in the batch setting, we present a twophase strategy: in the first phase, local model information is utilized to find likelytotransfer seeds; in the second phase, target model information is used to select candidates for optimization attacks. This split reduces the generality of the attack, but simplifies our presentation and analysis. Since direct transfers have such low cost (that is, one query when they succeed) compared to the optimization attacks, constraining the attack to try all the transfer candidates first does not compromise efficiency. More advanced attacks might attempt multiple transfer attempts per seed, in which case the decision may be less clear when to switch to an optimization attack. We do not consider such attacks here.
Local Ensemble  Metric  First AE  Top 1%  Top 2%  Top 5% 
Normal3  Transfer (PGDSteps)  
Transfer (Random)  
Robust2  Transfer (PGDSteps)  
Transfer (Random) 
Since the first phase seeks to find direct transfers, it needs to execute without any information from the target model. The goal is to order the seeds by likelihood of finding a direct transfer before any query is done to the model. As before, we do assume the attacker has access to pretrained local models, so can use those models both to find candidates for transfer attacks and to prioritize the seeds.
Within the transfer attack phase, we use a prioritization strategy based on the number of PGDSteps of the local models to predict the transfer likelihood of each image. We explored using other metrics based on local model information such as local model attack loss and local prediction score gap (difference in the prediction confidence of highest and second highest class), but did not find significant differences in the prioritization performance compared to PGDStep. Hence, we only present results using PGDSteps here.
Prioritizing based on PGD Steps. We surmised that the easier it is to find an adversarial example against the local models for a seed, the more likely that seed has a large vulnerability region in the target model. One way to measure this difficult is the number of PGD steps used to find a successful local adversarial example and we prioritize seeds that require less number of PGD steps. To be more specific, we first group images by their number of successfully attacked local models (e.g., out of local models), and then prioritize images in each group based on their number of PGD steps used to find the adversarial examples that compromises the local models. We prioritize adversarial examples that succeed against more of the local models (i.e., larger value of ) with the assumption that adversarial examples succeed on more local models tend to have higher chance to transfer to the “unknown” target model. Above prioritization strategy is the combination of the metrics of number of successfully compromised local models and PGD steps. We also independently tested the impact of each of the two metrics, and found that the PGDstep based metrics perform better than the number of successfully attacked models, and our current metric of combining the number of PGD steps and the number of successfully attacked models is more stable compared to just using the PGD steps.
Results. Our prioritization strategy in the first phase sorts images and each seed is queried once to obtain direct transfers. We compare with the baseline of random selection of seeds where the attacker queries each seed once in random order to show the advantage of prioritizing seeds based on PGDSteps.
Figure 2 shows the results of untargeted attack ^{6}^{6}6We do not specify the blackbox attacks (e.g., AutoZOOM untargeted attack) here because the attack in the first phase is executed against the local models and is independent from the chosen blackbox attacks (e.g., AutoZOOM and NES) against the target models in the second phase. on the Madry robust CIFAR10 model for both normal and robust local model ensembles. All results are averaged over five runs. In all cases, we observe that, checking transferability with prioritized order in the first phase is significantly better than checking the transferability in random order. More quantitative information is given in Table 7. For the untargeted attack on robust CIFAR10 model with the three normal local models (NA, NB, NC), when attacker is interested in obtaining of the total 1,000 seeds, checking transferability with prioritized order reduces the cost substantially—with prioritization, it takes 20.4 queries on average, compared to 100.8 with random order. We observed similar patterns for other datasets and models.
The transfer attack used in the first phase is query efficient, but has low success rate. Hence, when it does not find enough adversarial examples, the attack continues by attempting the optimization attacks on the remaining images. In this section, we show that the cost of the optimization attacks on these images varies substantially, and then evaluate prioritization strategies to identify lowcost seeds.
Query Cost Variance of Nontransfers.
Figure 3 shows the query distributions of nontransferable images for MNIST, CIFAR10 and ImageNet using the NES attack starting from local adversarial examples (similar patterns are observed for the AutoZOOM attack). For ImageNet, when images are sorted by query cost, the top 10% of 97 images (excluding 3 direct transfers and 0 failed adversarial examples from the original 100 images) only takes on average 1,522 queries while the mean query cost of all 100 images is 14,828. So, an attacker interested in obtaining only 10% of the total 100 seeds using this prioritization reduces their cost by 90% compared to targeting seeds randomly. For CIFAR10, the impact is even higher, reducing the mean query cost for obtaining adversarial examples for 10% of the seeds remaining after the transfer phase by approximately 95% (from 933 to 51) over the random ordering.Prioritization Strategies. These results show the potential cost savings from prioritizing seeds in batch attacks, but to be able to exploit the variance we need a way to identify lowcost seeds in advance. We consider two different strategies for estimating the attack cost to implement the estimator for the EstimatedAttackCost function. The first uses same local information as adopted in the first phase: lowcost seeds tend to have lower PGD steps in the local attacks. The drawback of prioritizing all seeds only based on local model information is that local models may not produce useful estimates of the cost of attacking the target model. Hence, our second prioritization strategy uses information obtained from the single query to the target model that is made for each seed in the first phase. This query results in obtaining a target model prediction score for each seed, which we use to prioritize the remaining seeds in the second phase. Specifically, we find that lowcost seeds tend to have lower loss function values, defined with respect to the target model. The assumption that an input with a lower loss function value is closer to the attacker’s goal is the same assumption that forms the basis of the optimization attacks.
Taking a targeted attack as an example, we compute the loss similarly to the loss function used in AutoZOOM [43]. For a given input and target class , the loss is calculated as
where denotes the prediction score distribution of a seed. So, is the model’s prediction of the probability that is in class . Similarly, for an untargeted attack with original label , the loss is defined as . Here, the input is the candidate starting point for an optimization attack. Thus, for hybrid attacks that start from a local candidate adversarial example, , of the original seed , attack loss is computed with respect to instead of . For the baseline attack that starts from the original seed , the loss is computed with respect to .
Results. We evaluate the prioritization for the second phase using the same evaluation setups as in Section 5.1. We compare the two prioritization strategies (based on local PGD steps and the target model loss) to random ordering of seeds to evaluate their effectiveness in identifying lowcost seeds. The baseline attacks (AutoZOOM and NES, starting from the original seeds) do not have a first phase transfer stage, so we defer the comparison results to next subsection, which shows performance of the combined twophase attack.
Local Ensemble  Metric  Additional 1%  Additional 2%  Additional 5%  Additional 10%  













Figure 4 shows the results for untargeted AutoZOOM attacks on the robust CIFAR10 model using both ensembles of local and robust models (results for the NES attack are not shown, but exhibit similar patterns). The target model information estimates the attack cost better than the local model information, while both prioritization strategies achieve much better performance than the random order. Quantitative results are found in Table 8. For example, for the untargeted AutoZOOM attack on robust CIFAR10 model with the Normal3 local ensemble, an attacker who wants to obtain ten new adversarial examples (in addition to the 101 direct transfers found in the first phase) can find them using on average 1,248 queries using target model information in the second phase, compared to 3,465 queries when using only local ensemble information, and 26,336 using random ordering.


Top 1%  Top 2%  Top 5%  Top 10%  



TwoPhase Strategy  
Random Scheduling  



TwoPhase Strategy  
Random Scheduling 
Target Model  Prioritization Method  Top 1%  Top 2%  Top 5%  Top 10%  

Retroactive Optimal  
Target Loss based Strategy  
Random Scheduling  

Retroactive Optimal  
Target Loss based Strategy  
Random Scheduling 
To further validate effectiveness of the seed prioritized twophase strategy, we evaluate the full attack combining both phases. Based on our analysis in the previous subsections, we use the best prioritization strategies for each phase: PGDStep in the first phase and target loss value in the second phase. For the baseline attack, we simply adopt the target loss value to prioritize seeds. We evaluate the effectiveness in comparison with two degenerate strategies:
retroactive optimal (oracle) — this strategy is not realizable, but provides an upper bound for the seed prioritization. It assumes the attackers have prior knowledge of the true rank of each seed. That is, we assume an oracle selectSeed function that always returns the best remaining seed.
random — the attacker selects candidate seeds in a random order and conducts optimization attacks exhaustively (until either success or the query limit is reached) on each seed before trying the next one. This represents traditional blackbox attacks that just attack every seed.
Due to the space limitations, we only present results here for the untargeted AutoZOOM attacks on the robust CIFAR10 model with normal local models. This is a setting where the performance gain for the hybrid attack is least significant compared to other models (see Table 3), so represents the most challenging scenario for our attack. We also show results for the targeted AutoZOOM attack on standard ImageNet models. In this setting, with our hybrid attack, prioritization performance of the heuristic strategy is significantly better than the baseline strategy of random ordering. In contrast, with the baseline attack, prioritization performance of the heuristic strategy is not significant over the baseline of random ordering, and hence, the attack setting provides interesting observation between the hybrid and baseline attacks in the batch attack setting (we will show the performance of baseline attack shortly). All other results of the two blackbox attacks on different datasets, different target models and different local models (only for the CIFAR10 dataset) show similar patterns and complete results are released along with our open source code.
As shown in Figure 5, the seed prioritized twophase strategy approaches the performance to the (unrealizable) retroactive optimal strategy and outperforms random scheduling strategy significantly. Table 9 shows the number of queries needed to successfully attack the top 1%, top 2%, top 5% and 10% of the total candidate seeds (1000 images for CIFAR10 and 100 images for ImageNet). For the robust CIFAR10 target model, we observe that in order to obtain 10 new adversarial examples (1%), our twophase strategy only costs on average 20 queries (not far off the 10 required by retroactive optimal), while random ordering takes 20,054 queries. Similarly, for attacks on ImageNet dataset, obtaining the first new adversarial example (1%), our twophase strategy costs in average 28 queries while random scheduling takes in average 15,046 queries (here, the retroactive optimal strategy takes only a single query since it can always find the direct transfer).
To provide more insights on the impact of seeds prioritization, we also present results of baseline AutoZOOM attack on the robust CIFAR10 target model, following the same target selection as in hybrid attack. We show the baseline AutoZOOM attack on standard ImageNet model because performance of seed prioritization based on target loss value is not significant compared to random scheduling in this case. The results are shown in Figure 6 and Table 10.
For attacks on the robust CIFAR10 model, performance of the target loss strategy is much better than the random scheduling strategy. For example, in order to obtain 1% of the total 1,000 seeds, the target loss prioritization strategy costs 1,070 queries on average, while the random strategy consumes on average 25,005 queries, which is a 96% query savings. We also note that the retroactive optimal strategy is very effective in this case and outperforms other strategies significantly by only taking 34 queries.
Against the ImageNet model, however, the target loss based strategy offers little improvement over random scheduling when the query budget of the attacker is large (Figure 5(b)). On the other hand, if we check the prioritization performance on hybrid attack against ImageNet model (Figure 4(b)), the performance gain of the twophase strategy over random prioritization is still very significant. We believe the main reason for this is because the baseline attack starts from the original seeds, which are natural images and ImageNet models tend to overfit to these natural images. Because of this, target loss value computed with respect to these images is less helpful in predicting their actual attack cost, which leads to poor prioritization performance. In contrast, the hybrid attack starts from local adversarial example candidates, which deviate from the natural distribution so ImageNet models are less likely to overfit to these images. Because of this, the target loss computed with respect to these seeds is better correlated with the true attack cost and the prioritization performance is improved significantly.
In this paper, we focus on improving our understanding of blackbox attacks to machine learning classifiers. We propose a hybrid attack strategy, which combines recent transferbased and optimizationbased attacks. By evaluating on multiple datasets, we show our hybrid attack strategy improves stateoftheart results significantly in terms of the average query cost and attack success rate, and hence provides more accurate estimation of cost of blackbox adversaries. We further consider a more practical attack setting, where the attacker has limited resources and aims to find many adversarial examples with a fixed number of queries. We show that a simple seed prioritization strategy can dramatically improve the overall efficiency of hybrid attacks.
Implementations of our attacks and all of our models and datasets are available under an open source license from (https://github.com/suyeecav/HybridAttack).
This work was supported by grants from the National Science Foundation (#1619098, #1804603, and #1850479) and research awards from Baidu and Intel, and cloud computing grants from Amazon. We also want to acknowledge the anonymous reviewers for their suggestions.
The Genetic and Evolutionary Computation Conference
, Cited by: §2.2.Exploring the space of blackbox attacks on deep neural networks
. InEuropean Conference on Computer Vision
, Cited by: Table 1, §1, §2.2, §4.1, §4.3.10th ACM Workshop on Artificial Intelligence and Security
, Cited by: Table 1, §1, §1, §1, §2.2, §4.2.IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §4.1.The MNIST database of handwritten digits
. Note: http://yann.lecun.com/exdb/mnist/ Cited by: §4.1.Parsimonious blackbox adversarial attacks via efficient combinatorial optimization
. In International Conference on Machine Learning, Cited by: §1, §2.2.Robustness may be at odds with accuracy
. In International Conference on Learning Representations, Cited by: §4.5.
Comments
There are no comments yet.