1 Introduction
The ubiquity of machine learning provides adversaries with both opportunities and incentives to develop strategic approaches to fool learning systems and achieve their malicious goals. Many attack strategies devised so far to generate adversarial examples to fool learning systems have been in the whitebox setting, where adversaries are assumed to have access to the learning model (
Szegedy et al. (2014); Goodfellow et al. (2015); Carlini & Wagner (2017); MoosaviDezfooli et al. (2015)). However, in many realistic settings, adversaries may only have blackbox access to the model, i.e. they have no knowledge about the details of the learning system such as its parameters, but they may have query access to the model’s predictions on input samples, including class probabilities. For example, we find this to be the case in some popular commercial AI offerings, such as those from IBM, Google and Clarifai. With access to query outputs such as class probabilities, the training loss of the target model can be found, but without access to the entire model, the adversary cannot access the gradients required to carry out whitebox attacks.Most existing blackbox attacks on DNNs have focused on transferability based attacks (Papernot et al. (2016); MoosaviDezfooli et al. (2016); Papernot et al. (2017)), where adversarial examples crafted for a local surrogate model can be used to attack the target model to which the adversary has no direct access. The exploration of other blackbox attack strategies is thus somewhat lacking so far in the literature. In this paper, we design powerful new blackbox attacks using limited query access to learning systems which achieve adversarial success rates close to that of whitebox attacks. These blackbox attacks help us understand the extent of the threat posed to deployed systems by adversarial samples. The code to reproduce our results can be found at https://github.com/sunblazeucb/blackboxattacks.
New blackbox attacks. We propose novel Gradient Estimation attacks on DNNs, where the adversary is only assumed to have query access to the target model. These attacks do not need any access to a representative dataset or any knowledge of the target model architecture. In the Gradient Estimation attacks, the adversary adds perturbations proportional to the estimated gradient, instead of the true gradient as in whitebox attacks (Goodfellow et al. (2015); Kurakin et al. (2016)
). Since the direct Gradient Estimation attack requires a number of queries on the order of the dimension of the input, we explore strategies for reducing the number of queries to the target model. We also experimented with Simultaneous Perturbation Stochastic Approximation (SPSA) and Particle Swarm Optimization (PSO) as alternative methods to carry out querybased blackbox attacks but found Gradient Estimation to work the best.
Queryreduction strategies We propose two strategies: random feature grouping and principal component analysis (PCA) based query reduction. In our experiments with the Gradient Estimation attacks on stateoftheart models on MNIST (784 dimensions) and CIFAR10 (3072 dimensions) datasets, we find that they match whitebox attack performance, achieving attack success rates up to 90% for singlestep attacks in the untargeted case and up to 100% for iterative attacks in both targeted and untargeted cases. We achieve this performance with just 200 to 800 queries per sample for singlestep attacks and around 8,000 queries for iterative attacks. This is much fewer than the closest related attack by Chen et al. (2017). While they achieve similar success rates as our attack, the running time of their attack is up to longer for each adversarial sample (see Section 3.5). A further advantage of the Gradient Estimation attack is that it does not require the adversary to train a local model, which could be an expensive and complex process for realworld datasets, in addition to the fact that training such a local model may require even more queries based on the training data.
Attacking realworld systems. To demonstrate the effectiveness of our Gradient Estimation attacks in the real world, we also carry out a practical blackbox attack using these methods against the Not Safe For Work (NSFW) classification and Content Moderation models developed by Clarifai , which we choose due to their socially relevant application. These models have begun to be deployed for realworld moderation (Liu, 2016), which makes such blackbox attacks especially pernicious. We carry out these attacks with no knowledge of the training set. We have demonstrated successful attacks (Figure 1) with just around 200 queries per image, taking around a minute per image. In Figure 1, the target model classifies the adversarial image as ‘safe’ with high confidence, in spite of the content that had to be moderated still being clearly visible. We note here that due to the nature of the images we experiment with, we only show one example here, as the others may be offensive to readers. The full set of images can be found at https://sunblazeucb.github.io/blackboxattacks/.
Comparative evaluation of blackbox attacks. We carry out a thorough empirical comparison of various blackbox attacks (given in Table 7) on both MNIST and CIFAR10 datasets. We study attacks that require zero queries to the learning model, including the addition of perturbations that are either random or proportional to the difference of means of the original and targeted classes, as well as various transferability based blackbox attacks. We show that the proposed Gradient Estimation attacks outperform other blackbox attacks in terms of attack success rate and achieve results comparable with whitebox attacks.
In addition, we also evaluate the effectiveness of these attacks on DNNs made more robust using adversarial training (Goodfellow et al., 2015; Szegedy et al., 2014) and its recent variants including ensemble adversarial training (Tramèr et al., 2017a) and iterative adversarial training (Mądry et al., 2017). We find that although standard and ensemble adversarial training confer some robustness against singlestep attacks, they are vulnerable to iterative Gradient Estimation attacks, with adversarial success rates in excess of 70% for both targeted and untargeted attacks. We find that our methods outperform other blackbox attacks and achieve performance comparable to whitebox attacks. Iterative adversarial training is quite robust against all blackbox attacks we test.
In summary, our contributions include:

We propose new Gradient Estimation blackbox attacks using queries to the target model. We also investigate two methods to make the number of queries needed independent of the input dimensionality.

We conduct a thorough evaluation of 10 different blackbox attack strategies on stateoftheart classifiers on the MNIST and CIFAR10 datasets, and we find that with a small number of queries, our proposed Gradient Estimation attacks outperform transferability based attacks and achieve attack success rates matching those of whitebox attacks.

We carry out practical blackbox attacks on Clarifai’s Not Safe For Work (NSFW) classification and Content Moderation models through public APIs, and show that the generated adversarial examples can mislead these models with high confidence.

Finally, we evaluate these blackbox attacks against stateoftheart adversarial training based defenses on DNNs, and we find that both standard and ensemble adversarial training are not robust against Gradient Estimation attacks. Further, even against iterative adversarial training our methods outperform transferability based attacks.
Related Work. Existing blackbox attacks that do not use a local model were first proposed for convex inducing twoclass classifiers by Nelson et al. (2012). For malware data, Xu et al. (2016)
use genetic algorithms to craft adversarial samples, while
Dang et al. (2017)use hill climbing algorithms. These methods are prohibitively expensive for noncategorical and highdimensional data such as images.
Papernot et al. (2017)proposed using queries to a target model to train a local surrogate model, which was then used to to generate adversarial samples. This attack relies on transferability. To the best of our knowledge, the only previous literature on querybased blackbox attacks in the deep learning setting is independent work by
Narodytska & Kasiviswanathan (2016) and Chen et al. (2017).Narodytska & Kasiviswanathan (2016) propose a greedy local search to generate adversarial samples by perturbing randomly chosen pixels and using those which have a large impact on the output probabilities. Their method uses 500 queries per iteration, and the greedy local search is run for around 150 iterations for each image, resulting in a total of 75,000 queries per image, which is much higher than any of our attacks. Further, we find that our methods achieve higher targeted and untargeted attack success rates on both MNIST and CIFAR10 as compared to their method. Chen et al. (2017)
propose a blackbox attack method named ZOO, which also uses the method of finite differences to estimate the derivative of a function. However, while we propose attacks that compute an adversarial perturbation, approximating FGSM and iterative FGS; ZOO approximates the Adam optimizer, while trying to perform coordinate descent on the loss function proposed by
Carlini & Wagner (2017). Neither of these works demonstrates the effectiveness of their attacks on realworld systems or on stateoftheart defenses.2 Background and Evaluation setup
In this section, we will first introduce the notation we use throughout the paper and then describe the evaluation setup and metrics used in the remainder of the paper. The full set of attacks that was evaluated is given in Table 7 in Appendix C, which also provides a taxonomy for blackbox attacks.
2.1 Notation
A classifier is a function mapping from the domain to the set of classification outputs . ( in the case of binary classification, i.e. is the set of class labels.) The number of possible classification outputs is then . is the set of parameters associated with a classifier. Throughout, the target classifier is denoted as , but the dependence on is dropped if it is clear from the context. denotes the constraint set which an adversarial sample must satisfy. is used to represent the loss function for the classifier with respect to inputs and their true labels . The loss functions we use are the standard crossentropy loss (denoted xent
) and the logit loss from
Carlini & Wagner (2017) (denoted as logit). These are described in Section 3.1.An adversary can generate adversarial example from a benign sample by adding an appropriate perturbation of small magnitude (Szegedy et al., 2014). Such an adversarial example will either cause the classifier to misclassify it into a targeted class (targeted attack), or any class other than the ground truth class (untargeted attack).
Since the blackbox attacks we analyze focus on neural networks in particular, we also define some notation specifically for neural networks. The outputs of the penultimate layer of a neural network
, representing the output of the network computed sequentially over all preceding layers, are known as the logits. We represent the logits as a vector
. The final layer of a neural networkused for classification is usually a softmax layer represented as a vector of probabilities
, with and .2.2 Evaluation setup
The empirical evaluation carried out in Sections 6 and 3 is on stateoftheart neural networks on the MNIST (LeCun & Cortes, 1998) and CIFAR10 (Krizhevsky & Hinton, 2009) datasets. The details of the datasets and the architecture and training procedure for all models are given below.
2.2.1 Datasets
MNIST. This is a dataset of images of handwritten digits (LeCun & Cortes, 1998). There are 60,000 training examples and 10,000 test examples. Each image belongs to a single class from 0 to 9. The images have a dimension of pixels (total of 784) and are grayscale. Each pixel value lies in . The digits are sizenormalized and centered. This dataset is used commonly as a ‘sanitycheck’ or firstlevel benchmark for stateoftheart classifiers. We use this dataset since it has been extensively studied from the attack perspective by previous work.
CIFAR10. This is a dataset of color images from 10 classes (Krizhevsky & Hinton, 2009). The images belong to 10 mutually exclusive classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). There are 50,000 training examples and 10,000 test examples. There are exactly 6,000 examples in each class. The images have a dimension of pixels (total of 1024) and have 3 channels (Red, Green, and Blue). Each pixel value lies in .
2.2.2 Model training details
In this section, we present the architectures and training details for both the normally and adversarially trained variants of the models on both the MNIST and CIFAR10 datasets. The accuracy of each model on benign data is given in Table 4.
MNIST. Each pixel of the MNIST image data is scaled to . We trained four different models on the MNIST dataset, denoted Models A to D, which are used by Tramèr et al. (2017a) and represent a good variety of architectures. For the attacks constrained with the distance, we vary the adversary’s perturbation budget from 0 to 0.4, since at a perturbation budget of 0.5, any image can be made solid gray. The model details for the 4 models trained on the MNIST dataset are as follows:

Model A
(3,382,346 parameters): Conv(64, 5, 5) + Relu, Conv(64, 5, 5) + Relu, Dropout(0.25), FC(128) + Relu, Dropout(0.5), FC + Softmax

Model B (710,218 parameters)  Dropout(0.2), Conv(64, 8, 8) + Relu, Conv(128, 6, 6) + Relu, Conv(128, 5, 5) + Relu, Dropout(0.5), FC + Softmax

Model C (4,795,082 parameters)  Conv(128, 3, 3) + Relu, Conv(64, 3, 3) + Relu, Dropout(0.25), FC(128) + Relu, Dropout(0.5), FC + Softmax

Model D (509,410 parameters)  [FC(300) + Relu, Dropout(0.5)] 4, FC + Softmax
Models A and C have both convolutional layers as well as fully connected layers. They also have the same order of magnitude of parameters. Model B, on the other hand, does not have fully connected layers and has an order of magnitude fewer parameters. Similarly, Model D has no convolutional layers and has fewer parameters than all the other models. Models A, B, and C all achieve greater than 99% classification accuracy on the test data. Model D achieves 97.2% classification accuracy, due to the lack of convolutional layers.
CIFAR10. Each pixel of the CIFAR10 image data is in . We choose three model architectures for this dataset, which we denote as Resnet32, Resnet2810 (ResNet variants (He et al., 2016; Zagoruyko & Komodakis, 2016)), and . For the attacks constrained with the distance, we vary the adversary’s perturbation budget from 0 to 28.
As their name indicates, Resnet32 and Resnet2810 are ResNet variants (He et al., 2016; Zagoruyko & Komodakis, 2016), while Std.CNN is a standard CNN (TensorFlow Authors, b). In particular, Resnet32 is a standard 32 layer ResNet with no width expansion, and Resnet2810 is a wide ResNet with 28 layers with the width set to 10, based on the best performing ResNet from Zagoruyko & Komodakis (TensorFlow Authors, a). The width indicates the multiplicative factor by which the number of filters in each residual layer is increased. Std.CNN is a standard CNN^{1}^{1}1https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10
from Tensorflow
(Abadi et al., 2015)with two convolutional layers, each followed by a maxpooling and normalization layer and two fully connected layers, each of which has weight decay.
Resnet32 is trained for 125,000 steps, Resnet2810 is trained for 167,000 steps and Std.CNN is trained for 100,000 steps on the benign training data. Models Resnet32 and Resnet2810 are much more accurate than Std.CNN. All models were trained with a batch size of 128. The two ResNets achieve close to stateoftheart accuracy (Benenson, 2016) on the CIFAR10 test set, with Resnet32 at 92.4% and Resnet2810 at 94.4%. Std.CNN, on the other hand, only achieves an accuracy of 81.4%, reflecting its simple architecture and the complexity of the task.
2.3 Metrics
Throughout the paper, we use standard metrics to characterize the effectiveness of various attack strategies. For MNIST, all metrics for singlestep attacks are computed with respect to the test set consisting of 10,000 samples, while metrics for iterative attacks are computed with respect to the first 1,000 samples from the test set. For the CIFAR10 data, we choose 1,000 random samples from the test set for singlestep attacks and a 100 random samples for iterative attacks. In our evaluations of targeted attacks, we choose target for each sample uniformly at random from the set of classification outputs, except the true class of that sample.
Attack success rate. The main metric, the attack success rate, is the fraction of samples that meets the adversary’s goal: for untargeted attacks and for targeted attacks with target (Szegedy et al., 2014; Tramèr et al., 2017a)
. Alternative evaluation metrics are discussed in Appendix
A.Average distortion. We also evaluate the average distortion for adversarial examples using average distance between the benign samples and the adversarial ones as suggested by Gu & Rigazio (2014): where is the number of samples. This metric allows us to compare the average distortion for attacks which achieve similar attack success rates, and therefore infer which one is stealthier.
Number of queries. Query based blackbox attacks make queries to the target model, and this metric may affect the cost of mounting the attack. This is an important consideration when attacking realworld systems which have costs associated with the number of queries made.
3 Query based attacks: Gradient Estimation attack
Deployed learning systems often provide feedback for input samples provided by the user. Given query feedback, different adaptive, querybased algorithms can be applied by adversaries to understand the system and iteratively generate effective adversarial examples to attack it. Formal definitions of querybased attacks are in Appendix B. We initially explored a number of methods of using query feedback to carry out blackbox attacks including Particle Swarm Optimization (Kennedy, 2011) and Simultaneous Perturbation Stochastic Approximation (Spall, 1992). However, these methods were not effective at finding adversarial examples for reasons detailed in Section 3.6, which also contains the results obtained.
Given the fact that many whitebox attacks for generating adversarial examples are based on gradient information, we then tried directly estimating the gradient to carry out blackbox attacks, and found it to be very effective in a range of conditions. In other words, the adversary can approximate whitebox Singlestep and Iterative FGSM attacks (Goodfellow et al., 2015; Kurakin et al., 2016) using estimates of the losses that are needed to carry out those attacks. We first propose a Gradient Estimation blackbox attack based on the method of finite differences (Spall, 2005). The drawback of a naive implementation of the finite difference method, however, is that it requires queries per input, where is the dimension of the input. This leads us to explore methods such as random grouping of features and feature combination using components obtained from Principal Component Analysis (PCA) to reduce the number of queries.
Threat model and justification: We assume that the adversary can obtain the vector of output probabilities for any input . The set of queries the adversary can make is then . Note that an adversary with access to the softmax probabilities will be able to recover the logits up to an additive constant, by taking the logarithm of the softmax probabilities. For untargeted attacks, the adversary only needs access to the output probabilities for the two most likely classes.
A compelling reason for assuming this threat model for the adversary is that many existing cloudbased ML services allow users to query trained models (Watson Visual Recognition , Clarifai , Google Vision API ). The results of these queries are confidence scores which can be used to carry out Gradient Estimation attacks. These trained models are often deployed by the clients of these ML as a service (MLaaS) providers (Liu (2016)). Thus, an adversary can pose as a user for a MLaaS provider and create adversarial examples using our attack, which can then be used against any client of that provider.
Comparing against existing blackbox attacks: In the results presented in this section, we compare our attacks against a number of existing blackbox attacks in both ther targeted and untargeted case. Detailed descriptions of these attacks are in Section 6. In particular, we compare against the following attacks that make zero queries to the target model:

Transferability attack (single local model) (Section 6.3.1) using Fast Gradient Sign (FGS) and Iterative FGS (IFGS) samples generated on a single source model for both loss functions. This is denoted as Transfer model FGS/IFGSloss); e.g., Transfer Model A FGSlogit

Transferability attack (local model ensemble) (Section 6.3.2) using FGS and IFGS samples generated on a source model for both loss functions (Transfer models FGS/IFGSloss); e.g., Transfer Model B, Model C IFGSlogit
We also compare against whitebox attacks, descriptions of which are in Section 6.2 and results are in Appendix D.
3.1 Finite difference method for gradient estimation
In this section, we focus on the method of finite differences to carry out Gradient Estimation based attacks. Let the function whose gradient is being estimated be . The input to the function is a dimensional vector , whose elements are represented as , where . The canonical basis vectors are represented as , where is only in the component and everywhere else. Then, a twosided estimation of the gradient of with respect to is given by
(1) 
is a free parameter that controls the accuracy of the estimation. A onesided approximation can also be used, but will be less accurate (Wright & Nocedal, 1999). If the gradient of the function exists, then . The finite difference method is useful for a blackbox adversary aiming to approximate a gradient based attack, since the gradient can be directly estimated with access to only the function values.
3.1.1 Approximate FGS with finite differences
In the untargeted FGS method, the gradient is usually taken with respect to the crossentropy loss between the true label of the input and the softmax probability vector. The crossentropy loss of a network at an input is then , where is the index of the original class of the input. The gradient of is
(2) 
An adversary with query access to the softmax probabilities then just has to estimate the gradient of and plug it into Eq. 2 to get the estimated gradient of the loss. The adversarial sample thus generated is
(3) 
This method of generating adversarial samples is denoted as FDxent. Targeted blackbox adversarial samples generated using the Gradient Estimation method are then
(4) 
The targeted version of the method is denoted as FDxentT.
3.1.2 Estimating the logitbased loss
We also use a loss function based on logits which was found to work well for whitebox attacks by Carlini & Wagner (2017). The loss function is given by
(5) 
where represents the ground truth label for the benign sample and are the logits. is a confidence parameter that can be adjusted to control the strength of the adversarial perturbation. If the confidence parameter is set to 0, the logit loss is . For an input that is correctly classified, the first term is always greater than 0, and for an incorrectly classified input, an untargeted attack is not meaningful to carry out. Thus, the loss term reduces to for relevant inputs.
An adversary can compute the logit values up to an additive constant by taking the logarithm of the softmax probabilities, which are assumed to be available in this threat model. Since the loss function is equal to the difference of logits, the additive constant is canceled out. Then, the finite differences method can be used to estimate the difference between the logit values for the original class , and the second most likely class , i.e., the one given by . The untargeted adversarial sample generated for this loss in the whitebox case is Similarly, in the case of a blackbox adversary with queryaccess to the softmax probabilities, the adversarial sample is
(6) 
Similarly, a targeted adversarial sample is
(7) 
The untargeted attack method is denoted as FDlogit while the targeted version is denoted as FDlogitT.
MNIST  Baseline  Gradient Estimation using Finite Differences  Transfer from Model B  

Singlestep  Iterative  Singlestep  Iterative  
Model  D. of M.  Rand.  FDxent  FDlogit  IFDxent  IFDlogit  FGSxent  FGSlogit  IFGSxent  IFGSlogit 
A  44.8 (5.6)  8.5 (6.1)  51.6 (3.3)  92.9 (6.1)  75.0 (3.6)  100.0 (2.1)  66.3 (6.2)  80.8 (6.3)  89.8 (4.75)  88.5 (4.75) 
B  81.5 (5.6)  7.8 (6.1)  69.2 (4.5)  98.9 (6.3)  86.7 (3.9)  100.0 (1.6)         
C  20.2 (5.6)  4.1 (6.1)  60.5 (3.8)  86.1 (6.2)  80.2 (4.5)  100.0 (2.2)  49.5 (6.2)  57.0 (6.3)  79.5 (4.75)  78.7 (4.75) 
D  97.1 (5.6)  38.5 (6.1)  95.4 (5.8)  100.0 (6.1)  98.4 (5.4)  100.0 (1.2)  76.3 (6.2)  87.6 (6.3)  73.3 (4.75)  71.4 (4.75) 
CIFAR10  Baseline  Gradient Estimation using Finite Differences  Transfer from Resnet2810  
Singlestep  Iterative  Singlestep  Iterative  
Model  D. of M.  Rand.  FDxent  FDlogit  IFDxent  IFDlogit  FGSxent  FGSlogit  IFGSxent  IFGSlogit 
Resnet32  9.3 (440.5)  19.4 (439.4)  49.1 (217.1)  86.0 (410.3)  62.0 (149.9)  100.0 (65.7)  74.5 (439.4)  76.6 (439.4)  99.0 (275.4)  98.9 (275.6) 
Resnet2810  6.7 (440.5)  17.1 (439.4)  50.1 (214.8)  88.2 (421.6)  46.0 (120.4)  100.0 (74.9)         
Std.CNN  20.3 (440.5)  22.2 (439.4)  80.0 (341.3)  98.9 (360.9)  66.0 (202.5)  100.0 (79.9)  37.4 (439.4)  37.7 (439.4)  33.7 (275.4)  33.6 (275.6) 
MNIST  Baseline  Gradient Estimation using Finite Differences  Transfer from Model B  
Singlestep  Iterative  Singlestep  Iterative  
Model  D. of M.  FDxent  FDlogit  IFDxent  IFDlogit  FGSxent  FGSlogit  IFGSxent  IFGSlogit 
A  15.0 (5.6)  30.0 (6.0)  29.9 (6.1)  100.0 (4.2)  99.7 (2.7)  18.3 (6.3)  18.1 (6.3)  54.5 (4.6)  46.5 (4.2) 
B  35.5 (5.6)  29.5 (6.3)  29.3 (6.3)  99.9 (4.1)  98.7 (2.4)         
C  5.84 (5.6)  34.1 (6.1)  33.8 (6.4)  100.0 (4.3)  99.8 (3.0)  14.0 (6.3)  13.8 (6.3)  34.0 (4.6)  26.1 (4.2) 
D  59.8 (5.6)  61.4 (6.3)  60.8 (6.3)  100.0 (3.7)  99.9 (1.9)  16.8 (6.3)  16.7 (6.3)  36.4 (4.6)  32.8 (4.1) 
CIFAR10  Baseline  Gradient Estimation using Finite Differences  Transfer from Resnet2810  
Singlestep  Iterative  Singlestep  Iterative  
Model  D. of M.  FDxent  FDlogit  IFDxent  IFDlogit  FGSxent  FGSlogit  IFGSxent  IFGSlogit 
Resnet32  1.2 (440.3)  23.8 (439.5)  23.0 (437.0)  100.0 (110.9)  100.0 (89.5)  15.8 (439.4)  15.5 (439.4)  71.8 (222.5)  80.3 (242.6) 
Resnet2810  0.9 (440.3)  29.2 (439.4)  28.0 (436.1)  100.0 (123.2)  100.0 (98.3)         
Std.CNN  2.6 (440.3)  44.5 (439.5)  40.3 (434.9)  99.0 (178.8)  95.0 (126.8)  5.6 (439.4)  5.6 (439.4)  5.1 (222.5)  5.9 (242.6) 
3.1.3 Iterative attacks with estimated gradients
The iterative variant of the gradient based attack described in Section 6.2 is a powerful attack that often achieves much higher attack success rates in the whitebox setting than the simple singlestep gradient based attacks. Thus, it stands to reason that a version of the iterative attack with estimated gradients will also perform better than the singlestep attacks described until now. An iterative attack with iterations using the crossentropy loss is:
(8) 
where is the step size and is the constraint set for the adversarial sample. This attack is denoted as IFDxent. If the logit loss is used instead, it is denoted as IFDlogit.
3.1.4 Evaluation of Gradient Estimation using Finite Differences
In this section, we summarize the results obtained using Gradient Estimation attacks with Finite Differences and describe the parameter choices made.
FDlogit and IFDlogit match whitebox attack adversarial success rates: The Gradient Estimation attack with Finite Differences (FDlogit) is the most successful untargeted singlestep blackbox attack for MNIST and CIFAR10 models. It significantly outperforms transferabilitybased attacks (Table 1) and closely tracks whitebox FGS with a logit loss (WB FGSlogit) on MNIST and CIFAR10 (Figure 2). For adversarial samples generated iteratively, the Iterative Gradient Estimation attack with Finite Differences (IFDlogit) achieves 100% adversarial success rate across all models on both datasets (Table 1). We used 0.3 for the value of for the MNIST dataset and 8 for the CIFAR10 dataset. The average distortion for both FDlogit and IFDlogit closely matches their whitebox counterparts, FGSlogit and IFGSlogit as given in Table 8.
FDT and IFDT achieve the highest adversarial success rates in the targeted setting: For targeted blackbox attacks, IFDxentT achieves 100% adversarial success rates on almost all models as shown by the results in Table 2. While FDxentT only achieves about 30% adversarial success rates, this matches the performance of singlestep whitebox attacks such as FGSxentT and FGSlogitT (Table 9). The average distortion for samples generated using gradient estimation methods is similar with that of whitebox attacks.
Parameter choices: We use for FDxent and IFDxent for both datasets, while using for FDlogit and IFDlogit. We find that a larger value of is needed for xent loss based attacks to work. The reason for this is that the probability values used in the xent loss are not as sensitive to changes as in the logit loss, and thus the gradient cannot be estimated since the function value does not change at all when a single pixel is perturbed. For the Iterative Gradient Estimation attacks using Finite Differences, we use and for the MNIST results and and for CIFAR10 throughout. The same parameters are used for the whitebox Iterative FGS attack results given in Appendix D. This translates to 62720 queries for MNIST (40 steps of iteration) and 61440 queries (10 steps of iteration) for CIFAR10 per sample. We find these choices work well, and keep the running time of the Gradient Estimation attacks at a manageable level. However, we find that we can achieve similar adversarial success rates with much fewer queries using query reduction methods which we describe in the next section.
3.2 Query reduction
The major drawback of the approximation based blackbox attacks is that the number of queries needed per adversarial sample is large. For an input with dimension , the number of queries will be exactly for a twosided approximation. This may be too large when the input is highdimensional. So we examine two techniques in order to reduce the number of queries the adversary has to make. Both techniques involve estimating the gradient for groups of features, instead of estimating it one feature at a time.
The justification for the use of feature grouping comes from the relation between gradients and directional derivatives (Hildebrand, 1962) for differentiable functions. The directional derivative of a function is defined as It is a generalization of a partial derivative. For differentiable functions, , which implies that the directional derivative is just the projection of the gradient along the direction . Thus, estimating the gradient by grouping features is equivalent to estimating an approximation of the gradient constructed by projecting it along appropriately chosen directions. The estimated gradient of any function can be computed using the techniques below, and then plugged in to Equations 3 and 6 instead of the finite difference term to create an adversarial sample. Next, we introduce the techniques applied to group the features for estimation.
3.2.1 Query reduction based on random grouping
The simplest way to group features is to choose, without replacement, a random set of features. The gradient can then be simultaneously estimated for all these features. If the size of the set chosen is , then the number of queries the adversary has to make is . When , this reduces to the case where the partial derivative with respect to every feature is found, as in Section 3.1. In each iteration of Algorithm 1, there is a set of indices according to which is determined, with if and only if . Thus, the directional derivative being estimated is , which is an average of partial derivatives. Thus, the quantity being estimated is not the gradient itself, but an indexwise averaged version of it.
3.2.2 Query reduction using PCA components
A more principled way to reduce the number of queries the adversary has to make to estimate the gradient is to compute directional derivatives along the principal components as determined by principal component analysis (PCA) (Shlens, 2014), which requires the adversary to have access to a set of data which is represetative of the training data. PCA minimizes reconstruction error in terms of the norm; i.e., it provides a basis in which the Euclidean distance to the original sample from a sample reconstructed using a subset of the basis vectors is the smallest.
Concretely, let the samples the adversary wants to misclassify be column vectors for and let be the matrix of centered data samples (i.e. , where ). The principal components of
are the normalized eigenvectors of its sample covariance matrix
. Since is a positive semidefinite matrix, there is a decomposition whereis an orthogonal matrix,
, and . Thus, in Algorithm 2 is the matrix whose columns are unit eigenvectors of. The eigenvalue
is the variance of
along the component.In Algorithm 2, is the matrix whose columns are the principal components , where . The quantity being estimated in Algorithm 2 is an approximation of the gradient in the PCA basis:
where the term on the left represents an approximation of the true gradient by the sum of its projection along the top principal components. In Algorithm 2, the weights of the representation in the PCA basis are approximated using the approximate directional derivatives along the principal components.
3.3 Iterative attacks with query reduction
Performing an iterative attack with the gradient estimated using the finite difference method (Equation 1) could be expensive for an adversary, needing queries to the target model, for iterations with the twosided finite difference estimation of the gradient. To lower the number of queries needed, the adversary can use either of the query reduction techniques described above to reduce the number of queries to ( ). These attacks using the crossentropy loss are denoted as IGEQR (RG, xent) for the random grouping technique and IGEQR (PCA, xent) for the PCAbased technique.
3.3.1 Evaluation of Gradient Estimation attacks with query reduction
In this section, we first summarize the results obtained using Gradient Estimation attacks with query reduction and then provide a more detailed analysis of the effect of dimension on attacks with query reduction
Gradient estimation with query reduction maintains high attack success rates: For both datasets, the Gradient Estimation attack with PCA based query reduction (GEQR (PCA, logit)) is effective, with performance close to that of FDlogit with for MNIST (Figure (a)a) and for CIFAR10 (Figure (b)b). The Iterative Gradient Estimation attacks with both Random Grouping and PCA based query reduction (IGEQR (RG, logit) and IGEQR (PCA, logit)) achieve close to 100% success rates for untargeted attacks and above 80% for targeted attacks on Model A on MNIST and Resnet32 on CIFAR10 (Figure 5). Figure 5 clearly shows the effectiveness of the gradient estimation attack across models, datasets, and adversarial goals. While random grouping is not as effective as the PCA based method for Singlestep attacks, it is as effective for iterative attacks.
Effect of dimension on Gradient Estimation attacks: We consider the effectiveness of Gradient Estimation with random grouping based query reduction and the logit loss (GEQR (RG, logit)) on Model A on MNIST data in Figure (a)a, where is the number of indices chosen in each iteration of Algorithm 1. Thus, as increases and the number of groups decreases, we expect attack success to decrease as gradients over larger groups of features are averaged. This is the effect we see in Figure (a)a, where the adversarial success rate drops from 93% to 63% at as increases from 1 to 7. Grouping with translates to 112 queries per MNIST image, down from 784. Thus, in order to achieve high adversarial success rates with the random grouping method, larger perturbation magnitudes are needed.
On the other hand, the PCAbased approach GEQR (PCA, logit) is much more effective, as can be seen in Figure (b)b. Using 100 principal components to estimate the gradient for Model A on MNIST as in Algorithm 2, the adversarial success rate at is 88.09%, as compared to 92.9% without any query reduction. Similarly, using 400 principal components for Resnet32 on CIFAR10 (Figure (c)c), an adversarial success rate of 66.9% can be achieved at . At , the adversarial success rate rises to 80.1%.
Remarks: While decreasing the number of queries does reduce attack success rates, the proposed query reduction methods maintain high attack success rates.
3.4 Adversarial samples
In Figure 4, we show some examples of successful untargeted adversarial samples against Model A on MNIST and Resnet32 on CIFAR10. These images were generated with an constraint of for MNIST and for CIFAR10. Clearly, the amount of perturbation added by iterative attacks is much smaller, barely being visible in the images.
3.5 Efficiency of gradient estimation attacks
In our evaluations, all models were run on a GPU with a batch size of 100. On Model A on MNIST data, singlestep attacks FDxent and FDlogit take and seconds per sample respectively. Thus, these attacks can be carried out on the entire MNIST test set of 10,000 images in about 10 minutes. For iterative attacks with no query reduction, with 40 iterations per sample ( set to 0.01), both IFDxent and IFDxentT taking about seconds per sample. Similarly, IFDlogit and IFDlogitT take about seconds per sample. With query reduction, using IGEQR (PCA, logit) with the time taken is just 0.5 seconds per sample.
For Resnet32 on the CIFAR10 dataset, FDxent, FDxentT, FDlogit and FDlogitT all take roughly 3s per sample. The iterative variants of these attacks with 10 iterations ( set to 1.0) take roughly 30s per sample. Using query reduction, IGEQR (PCA, logit) with with 10 iterations takes just 5s per sample. The time required per sample increases with the complexity of the network, which is observed even for whitebox attacks.
All the above numbers are for the case when queries are not made in parallel. Our attack algorithm allows for queries to be made in parallel as well. We find that a simple parallelization of the queries gives us a speedup. The limiting factor is the fact that the model is loaded on a single GPU, which implies that the current setup is not fully optimized to take advantage of the inherently parallel nature of our attack. With further optimization, greater speedups can be achieved.
Remarks: Overall, our attacks are very efficient and allow an adversary to generate a large number of adversarial samples in a short period of time.
Querybased attack  Attack success  No. of queries  Time per sample (s) 

Finite Diff.  92.9 (6.1)  1568  
Gradient Estimation (RG8)  61.5 (6.0)  196  
Iter. Finite Diff.  100.0 (2.1)  62720  3.5 
Iter. Gradient Estimation (RG8)  98.4 (1.9)  8000  0.43 
Particle Swarm Optimization  84.1 (5.3)  10000  21.2 
SPSA  96.7 (3.9)  8000  1.25 
3.6 Other querybased attacks
We experimented with Particle Swarm Optimization (PSO),^{2}^{2}2Using freely available code from http://pythonhosted.org/pyswarm/ a commonly used evolutionary optimization strategy, to construct adversarial samples as was done by Sharif et al. (2016), but found it to be prohibitively slow for a large dataset, and it was unable to achieve high adversarial success rates even on the MNIST dataset. We also tried to use the Simultaneous Perturbation Stochastic Approximation (SPSA) method, which is similar to the method of Finite Differences, but it estimates the gradient of the loss along a random direction at each step, instead of along the canonical basis vectors. While each step of SPSA only requires 2 queries to the target model, a large number of steps are nevertheless required to generate adversarial samples. A single step of SPSA does not reliably produce adversarial samples. The two main disadvantages of this method are that i) the convergence of SPSA is much more sensitive in practice to the choice of both (gradient estimation step size) and (loss minimization step size), and ii) even with the same number of queries as the Gradient Estimation attacks, the attack success rate is lower even though the distortion is higher.
A comparative evaluation of all the querybased blackbox attacks we experimented with for the MNIST dataset is given in Table 3. The PSO based attack uses class probabilities to define the loss function, as it was found to work better than the logit loss in our experiments. The attack that achieves the best tradeoff between speed and attack success is IGEQR (RG, logit).
4 Attacking defenses
In this section, we evaluate blackbox attacks against different defenses based on adversarial training and its variants. We focus on adversarial training based defenses as they aim to directly improve the robustness of DNNs, and are among the most effective defenses demonstrated so far in the literature.
We find that Iterative Gradient Estimation attacks perform much better than any singlestep blackbox attack against defenses. Nevertheless, in Figure 6, we show that with the addition of an initial random perturbation to overcome “gradient masking” (Tramèr et al., 2017a), the Gradient Estimation attack with Finite Differences is the most effective singlestep blackbox attack on adversarially trained models on MNIST.
Dataset (Model)  Benign  Adv  AdvEns  AdvIter 

MNIST (A)  99.2  99.4  99.2  99.3 
CIFAR10 (Resnet32)  92.4  92.1  91.7  79.1 
4.1 Background and evaluation setup
4.1.1 Adversarial training
Szegedy et al. (2014) and Goodfellow et al. (2015) introduced the concept of adversarial training, where the standard loss function for a neural network is modified as follows:
(9) 
where is the true label of the sample . The underlying objective of this modification is to make the neural networks more robust by penalizing it during training to count for adversarial samples. During training, the adversarial samples are computed with respect to the current state of the network using an appropriate method such as FGSM.
Ensemble adversarial training. Tramèr et al. (2017a) proposed an extension of the adversarial training paradigm which is called ensemble adversarial training. As the name suggests, in ensemble adversarial training, the network is trained with adversarial samples from multiple networks.
Iterative adversarial training. A further modification of the adversarial training paradigm proposes training with adversarial samples generated using iterative methods such as the iterative FGSM attack described earlier (Mądry et al., 2017).
MNIST  Baseline  Gradient Estimation using Finite Differences  Transfer from Model B  
Singlestep  Iterative  Singlestep  Iterative  
Model  D. of M.  Rand.  FDxent  FDlogit  IFDxent  IFDlogit  FGSxent  FGSlogit  IFGSxent  IFGSlogit 
Model A_{adv0.3}  6.5 (5.6)  1.3 (6.1)  10.3 (2.6)  2.8 (5.9)  36.4 (3.1)  76.5 (3.1)  14.6 (6.2)  14.63 (6.3)  16.5 (4.7)  15.9 (4.7) 
Model A_{advens0.3}  2.0 (5.6)  1.2 (6.1)  6.1 (3.5)  6.2 (6.3)  24.2 (4.1)  96.4 (2.7)  3.1 (6.2)  3.1 (6.3)  4.8 (4.7)  4.9 (4.7) 
Model A_{adviter0.3}  3.0 (5.6)  1.0 (6.1)  9.2 (7.4)  7.5 (7.2)  14.5 (0.96)  11.6 (3.5)  11.5 (6.2)  11.0 (6.3)  8.7 (4.7)  8.2 (4.7) 
CIFAR10  Baseline  Gradient Estimation using Finite Differences  Transfer from Resnet2810  
Singlestep  Iterative  Singlestep  Iterative  
Model  D. of M.  Rand.  FDxent  FDlogit  IFDxent  IFDlogit  FGSxent  FGSlogit  IFGSxent  IFGSlogit 
Resnet32 _{adv8}  9.6 (440.5)  10.9 (439.4)  2.4 (232.9)  8.5 (401.9)  69.0 (136.0)  100.0 (73.8)  13.1 (439.4)  13.2 (439.4)  30.2 (275.4)  30.2 (275.6) 
Resnet32 _{advens8}  10.1 (440.5)  10.4 (439.4)  7.7 (360.2)  12.2 (399.8)  95.0 (190.4)  100.0 (85.2)  9.7 (439.4)  9.6 (439.4)  15.9 (275.4)  15.5 (275.6) 
Resnet32 _{adviter8}  22.86 (440.5)  21.41 (439.4)  45.5 (365.5)  47.5 (331.1)  55.0 (397.6)  54.6 (196.3)  23.2 (439.4)  23.1 (439.4)  22.3 (275.4)  22.3 (275.6) 
4.2 Adversarially trained models
We train variants of Model A with the 3 adversarial training strategies described above using adversarial samples based on an constraint of 0.3. Model A_{adv0.3} is trained with FGS samples, while Model A_{adviter0.3} is trained with iterative FGS samples using and . For the model with ensemble training, Model A_{advens0.3} is trained with pregenerated FGS samples for Models A, C, and D, as well as FGS samples. The source of the samples is chosen randomly for each minibatch during training. For all adversarially trained models, each training batch contains 128 samples of which 64 are benign and 64 are adversarial samples (either FGSM or iterative FGSM). This implies that the loss for each is weighted equally during training; i.e., in Eq. 9,
is set to 0.5. Networks using standard and ensemble adversarial training are trained for 12 epochs, while those using iterative adversarial training are trained for 64 epochs.
We train variants of Resnet32 using adversarial samples with an constraint of 8. Resnet32 _{adv8} is trained with FGS samples with the same constraint, and Resnet32 _{ensadv8} is trained with pregenerated FGS samples from Resnet32 and Std.CNN as well as FGS samples. Resnet32 _{adviter8} is trained with iterative FGS samples using and . The adversarial variants of Resnet32 is trained for 80,000 steps
Table 4 shows the accuracy of these models with various defenses on benign test data.
4.3 Singlestep attacks on defenses
In Figure (a)a, we can see that both singlestep blackbox and whitebox attacks have much lower adversarial success rates on Model A_{adv0.3} as compared to Model A. The success rate of the Gradient Estimation attacks matches that of whitebox attacks on these adversarially trained networks as well. To overcome this, we add an initial random perturbation to samples before using the Gradient Estimation attack with Finite Differences and the logit loss (FDlogit). These are then the most effective single step blackbox attacks on Model A_{adv0.3} at with an adversarial success rate of 32.2%, surpassing the Transferability attack (single local model) from B.
In Figure (b)b, we again see that the Gradient Estimation attacks using Finite Differences (FDxent and FDlogit) and whitebox FGS attacks (FGSxent and FGSlogit) against Resnet32. As is increased, the attacks that perform the best are Random Perturbations (Rand.), Differenceofmeans (D. of M.), and Transferability attack (single local model) from Resnet2810 with the latter performing slightly better than the baseline attacks. This is due to the ‘gradient masking’ phenomenon and can be overcome by adding random perturbations as for MNIST. An interesting effect is observed at , where the adversarial success rate is higher than at . The likely explanation for this effect is that the model has overfitted to adversarial samples at . Our Gradient Estimation attack closely tracks the adversarial success rate of whitebox attacks in this setting as well.
Increasing effectiveness of singlestep attacks using initial random perturbation: Since the Gradient Estimation attack with Finite Differences (FDxent and FDlogit) were not performing well due the masking of gradients at the benign sample , we added an initial random perturbation to escape this lowgradient region as in the RANDFGSM attack (Tramèr et al., 2017a). Figure 7 shows the effect of adding an initial constrained perturbation of magnitude 0.05. With the addition of a random perturbation, FDlogit has a much improved adversarial success rate on Model A_{adv0.3}, going up to 32.2% from 2.8% without the perturbation at a total perturbation value of . It even outperforms the whitebox FGS (FGSlogit) with the same random perturbation added. This effect is also observed for Model A_{advens0.3}, but Model A_{adviter0.3} appears to be resistant to singlestep gradient based attacks. Thus, our attacks work well for singlestep attacks on DNNs with standard and ensemble adversarial training, and achieve performance levels close to that of whitebox attacks.
4.4 Iterative attacks on different adversarial training defenses
While singlestep blackbox attacks are less effective at lower than the one used for training, our experiments show that iterative blackbox attacks continue to work well even against adversarially trained networks. For example, the Iterative Gradient Estimation attack using Finite Differences with a logit loss (IFDlogit) achieves an adversarial success rate of 96.4% against Model A_{advens0.3}, while the best transferability attack has a success rate of 4.9%. It is comparable to the whitebox attack success rate of 93% from Table 10. However, Model A_{adviter0.3} is quite robust even against iterative attacks, with the highest blackbox attack success rate achieved being 14.5%.
Further, in Figure 5, we can see that using just 4000 queries per sample, the Iterative Gradient Estimation attack using PCA for query reduction (IGEQR (PCA400, logit)) achieves 100% (untargeted) and 74.5% (targeted) adversarial success rates against Model A_{adv0.3}. Our methods far outperform the other blackbox attacks, as shown in Table 10.
Iterative blackbox attacks perform well against adversarially trained models for CIFAR10 as well. IFDlogit achieves attack success rates of 100% against both Resnet32 _{adv8} and Resnet32 _{advens8} (Table 5), which reduces slightly to 97% when IFDQR (PCA400, logit) is used. This matches the performance of whitebox attacks as given in Table 10. IFDQR (PCA400, logit) also achieves a 72% success rate for targeted attacks at as shown in Figure 5.
The iteratively trained model has poor performance on both benign as well as adversarial samples. Resnet32 _{adviter8} has an accuracy of only 79.1% on benign data, as shown in Table 4. The Iterative Gradient Estimation attack using Finite Differences with crossentropy loss (IFDxent) achieves an untargeted attack success rate of 55% on this model, which is lower than on the other adversarially trained models, but still significant. This is in line with the observation by Mądry et al. (2017) that iterative adversarial training needs models with large capacity for it to be effective. This highlights a limitation of this defense, since it is not clear what model capacity is needed and the models we use already have a large number of parameters.
Remarks. Both singlestep and iterative variants of the Gradient Estimation attacks outperform other blackbox attacks, achieving attack success rates close to those of whitebox attacks even on adversarially trained models.
5 Attacks on Clarifai: a realworld system
Since the only requirement for carrying out the Gradient Estimation based attacks is querybased access to the target model, a number of deployed public systems that provide classification as a service can be used to evaluate our methods. We choose Clarifai , as it has a number of models trained to classify image datasets for a variety of practical applications, and it provides blackbox access to its models and returns confidence scores upon querying. In particular, Clarifai has models used for the detection of Not Safe For Work (NSFW) content, as well as for Content Moderation. These are important applications where the presence of adversarial samples presents a real danger: an attacker, using query access to the model, could generate an adversarial sample which will no longer be classified as inappropriate. For example, an adversary could upload violent images, adversarially modified, such that they are marked incorrectly as ‘safe’ by the Content Moderation model.
We evaluate our attack using the Gradient Estimation method on the Clarifai NSFW and Content Moderation models. An important point to note here is that given the lack of an easily accessible opensource dataset for these tasks to train a local surrogate model, carrying out attacks based on transferability is a challenging task. On the other hand, our attack can directly be used for any image of the adversary’s choice. The Content Moderation model has five categories, ‘safe’, ‘suggestive’, ‘explicit’, ‘drug’ and ‘gore’, while the NSFW model has just two categories, ‘SFW’ and ‘NSFW.’ When we query the API with an image, it returns the confidence scores associated with each category, with the confidence scores summing to 1. We use the random grouping technique in order to reduce the number of queries and take the logarithm of the confidence scores in order to use the logit loss. A large number of successful attack images can be found at https://sunblazeucb.github.io/blackboxattacks/. Due to their possibly offensive nature, they are not included in the paper.
An example of an attack on the Content Moderation API is given in Figure 1, where the original image on the left is clearly of some kind of drug on a table, with a spoon and a syringe. It is classified as a drug by the Content Moderation model with a confidence score of 0.99. The image on the right is an adversarial image generated with 192 queries to the Content Moderation API, with an constraint on the perturbation of . While the image can still clearly be classified by a human as being of drugs on a table, the Content Moderation model now classifies it as ‘safe’ with a confidence score of 0.96.
Remarks. The proposed Gradient Estimation attacks can successfully generate adversarial examples that are misclassified by a realworld system hosted by Clarifai without prior knowledge of the training set or model.
6 Existing blackbox attacks
In this section, we describe existing methods for generating adversarial examples. In all of these attacks, the adversary’s perturbation is constrained using the distance.
6.1 Baseline attacks
Now, we describe two baseline blackbox attacks which can be carried out without any knowledge of or query access to the target model.
6.1.1 Random perturbations
With no knowledge of or the training set, the simplest manner in which an adversary may seek to carry out an attack is by adding a random perturbation to the input (Szegedy et al., 2014; Goodfellow et al., 2015; Fawzi et al., 2015). These perturbations can be generated by any distribution of the adversary’s choice and constrained according to an appropriate norm. If we let be a distribution over , and
is a random variable drawn according to
, then a noisy sample is just . Since random noise is added, it is not possible to generate targeted adversarial samples in a principled manner. This attack is denoted as Rand. throughout.6.1.2 Difference of means
A perturbation aligned with the difference of means of two classes is likely to be effective for an adversary hoping to cause misclassification for a broad range of classifiers (Tramèr et al., 2017b). While these perturbations are far from optimal for DNNs, they provide a useful baseline to compare against. Adversaries with at least partial access to the training or test sets can carry out this attack. An adversarial sample generated using this method, and with constraints, is , where is the mean of the target class and is the mean of the original ground truth class. For an untargeted attack, , where is an appropriately chosen distance function. In other words, the class whose mean is closest to the original class in terms of the Euclidean distance is chosen to be the target. This attack is denoted as D. of M. throughout.
6.1.3 Effectiveness of baseline attacks
In the baseline attacks described above, the choice of distribution for the random perturbation attack and the choice of distance function for the difference of means attack are not fixed. Here, we describe the choices we make for both attacks. The random perturbation
for each sample (for both MNIST and CIFAR10) is chosen independently according to a multivariate normal distribution with mean
, i.e. . Then, depending on the norm constraint, either a signed and scaled version of the random perturbation () or a scaled unit vector in the direction of the perturbation () is added. For an untargeted attack utilizing perturbations aligned with the difference of means, for each sample, the mean of the class closest to the original class in the distance is determined.As expected, adversarial samples generated using Rand. do not achieve high adversarial success rates (Table 1) in spite of having similar or larger average distortion than the other blackbox attacks for both the MNIST and CIFAR10 models. However, the D. of M. method is quite effective at higher perturbation values for the MNIST dataset as can be seen in Figure (a)a. Also, for Models B and D, the D. of M. attack is more effective than FDxent. The D. of M. method is less effective in the targeted attack case, but for Model D, it outperforms the transferability based attack considerably. Its success rate is comparable to the targeted transferability based attack for Model A as well.
The relative effectiveness of the two baseline methods is reversed for the CIFAR10 dataset, however, where Rand. outperforms D. of M. considerably as is increased. This indicates that the models trained on MNIST have normal vectors to decision boundaries which are more aligned with the vectors along the difference of means as compared to the models on CIFAR10.
6.2 Singlestep and Iterative Fast Gradient Methods
Now, we describe two whitebox attack methods, used in transferabilitybased attacks, for which we constructed approximate, gradientfree versions in Section 3. These attacks are based on either iterative or singlestep gradient based minimization of appropriately defined loss functions of neural networks. All results for whitebox attacks are contained in Appendix D. Since these methods all require the knowledge of the model’s gradient, we assume the adversary has access to a local model . Adversarial samples generated for can then be transferred to the target model to carry out a transferabilitybased attack (Papernot et al., 2016; MoosaviDezfooli et al., 2016). An ensemble of local models (Liu et al., 2017) may also be used. Transferabilitybased attacks are described in Section 6.3.
The singlestep Fast Gradient method, first introduced by Goodfellow et al. (2015), utilizes a firstorder approximation of the loss function in order to construct adversarial samples for the adversary’s surrogate local model . The samples are constructed by performing a single step of gradient ascent for untargeted attacks. Formally, the adversary generates samples with constraints (known as the Fast Gradient Sign (FGS) method) in the untargeted attack setting as
(10) 
where is the loss function with respect to which the gradient is taken. The loss function typically used is the crossentropy loss (Goodfellow et al., 2016). Adversarial samples generated using the targeted FGS attack are
(11) 
where T is the target class.
Iterative Fast Gradient methods are simply multistep variants of the Fast Gradient method described above (Kurakin et al., 2016), where the gradient of the loss is added to the sample for iterations, starting from the benign sample, and the updated sample is projected to satisfy the constraints in every step:
(12) 
with . Iterative fast gradient methods thus essentially carry out projected gradient descent (PGD) with the goal of maximizing the loss, as pointed out by Mądry et al. (2017). Targeted adversarial samples generated using iterative FGS are
(13) 
6.2.1 Beyond the crossentropy loss
Prior work by Carlini & Wagner (2017) investigates a variety of loss functions for whitebox attacks based on the minimization of an appropriately defined loss function. In our experiments with neural networks, for untargeted attacks, we use a loss function based on logits which was found to work well for whitebox attacks in Carlini & Wagner (2017). The loss function is given by
(14) 
where represents the ground truth label for the benign sample , are the logits. is a confidence parameter that can be adjusted to control the strength of the adversarial sample. Targeted adversarial samples are generated using the following loss term:
(15) 
6.3 Transferability based attacks
Here we describe blackbox attacks that assume the adversary has access to a representative set of training data in order to train a local model. One of the earliest observations with regards to adversarial samples for neural networks was that they transfer; i.e, adversarial attack samples generated for one network are also adversarial for another network. This observation directly led to the proposal of a blackbox attack where an adversary would generate samples for a local network and transfer these to the target model, which is referred to as a Transferability based attack. Targeted transferability attacks are carried out using locally generated targeted whitebox adversarial samples.
6.3.1 Single local model
These attacks use a surrogate local model to craft adversarial samples, which are then submitted to in order to cause misclassification. Most existing blackbox attacks are based on transferability from a single local model (Papernot et al., 2016; MoosaviDezfooli et al., 2016). The different attack strategies to generate adversarial instances introduced in Section 6.2 can be used here to generate adversarial instances against , so as to attack .
6.3.2 Ensemble of local models
Since it is not clear which local model is best suited for generating adversarial samples that transfer well to the target model , Liu et al. (2017) propose the generation of adversarial examples for an ensemble of local models. This method modifies each of the existing transferability attacks by substituting a sum over the loss functions in place of the loss from a single local model.
Concretely, let the ensemble of local models to be used to generate the local loss be . The ensemble loss is then computed as , where is the weight given to each model in the ensemble. The FGS attack in the ensemble setting then becomes . The Iterative FGS attack is modified similarly. Liu et al. (2017) show that the Transferability attack (local model ensemble) performs well even in the targeted attack case, while Transferability attack (single local model) is usually only effective for untargeted attacks. The intuition is that while one model’s gradient may not be adversarial for a target model, it is likely that at least one of the gradient directions from the ensemble represents a direction that is somewhat adversarial for the target model.
Untargeted Transferability to Model A  
Singlestep  Iterative  
Source  FGSxent  FGSlogit  IFGSxent  IFGSlogit 
B  66.3 (6.2)  80.8 (6.3)  89.8 (4.75)  88.5 (4.75) 
B,C  68.1 (6.2)  89.8 (6.3)  95.0 (4.8)  97.1 (4.9) 
B,C,D  56.0 (6.3)  88.7 (6.4)  73.5 (5.3)  94.4 (5.3) 
Targeted Transferability to Model A  
Singlestep  Iterative  
Source  FGST (xent)  FGST (logit)  IFGST (xent)  IFGST (logit) 
B  18.3 (6.3)  18.1 (6.3)  54.5 (4.6)  46.5 (4.2) 
B,C  23.0 (6.3)  23.0 (6.3)  76.7 (4.8)  72.3 (4.5) 
B,C,D  25.2 (6.4)  25.1 (6.4)  74.6 (4.9)  66.1 (4.7) 
6.3.3 Transferability attack results
For the transferability experiments, we choose to transfer from Model B for MNIST dataset and from Resnet2810 for CIFAR10 dataset, as these models are each similar to at least one of the other models for their respective dataset and different from one of the others. They are also fairly representative instances of DNNs used in practice.
Adversarial samples generated using singlestep methods and transferred from Model B to the other models have higher success rates for untargeted attacks when they are generated using the logit loss as compared to the cross entropy loss as can be seen in Table 1. For iterative adversarial samples, however, the untargeted attack success rates are roughly the same for both loss functions. As has been observed before, the adversarial success rate for targeted attacks with transferability is much lower than the untargeted case, even when iteratively generated samples are used. On the MNIST dataset, the highest targeted transferability rate is 54.5% (Table 2) as compared to 89.8% in the untargeted case (Table 1).
One attempt to improve the transferability rate is to use an ensemble of local models, instead of a single one. The results for this on the MNIST data are presented in Table 6. In general, both untargeted and targeted transferability increase when an ensemble is used. However, the increase is not monotonic in the number of models used in the ensemble, and we can see that the transferability rate for IFGSxent samples falls sharply when Model D is added to the ensemble. This may be due to it having a very different architecture as compared to the models, and thus also having very different gradient directions. This highlights one of the pitfalls of transferability, where it is important to use a local surrogate model similar to the target model for achieving high attack success rates.
7 Conclusion
Overall, in this paper, we conduct a systematic analysis of new and existing blackbox attacks on stateoftheart classifiers and defenses. We propose Gradient Estimation attacks which achieve high attack success rates comparable with even whitebox attacks and outperform other stateoftheart blackbox attacks. We apply random grouping and PCA based methods to reduce the number of queries required to a small constant and demonstrate the effectiveness of the Gradient Estimation attack even in this setting. We also apply our blackbox attack against a realworld classifier and stateoftheart defenses. All of our results show that Gradient Estimation attacks are extremely effective in a variety of settings, making the development of better defenses against blackbox attacks an urgent task.
References
 Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Benenson (2016) Rodrigo Benenson. Classification datasets results. http://rodrigob.github.io/are_we_there_yet/build/#classificationdatasettype, 2016. Accessed: 20170822.
 Carlini & Wagner (2017) Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017, 2017.
 Chen et al. (2017) PinYu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and ChoJui Hsieh. Zoo: Zeroth order optimization based blackbox attacks to deep neural networks without training substitute models. arXiv preprint arXiv:1708.03999, 2017.
 (5) Clarifai. Clarifai  image & video recognition API. https://clarifai.com. Accessed: 20170822.
 Dang et al. (2017) Hung Dang, Huang Yue, and EeChien Chang. Evading classifiers by morphing in the dark. 2017.
 Fawzi et al. (2015) Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Analysis of classifiers’ robustness to adversarial perturbations. arXiv preprint arXiv:1502.02590, 2015.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
 Goodfellow et al. (2015) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 (10) Google Vision API. Vision API  image content analysis  Google cloud platform. https://cloud.google.com/vision/. Accessed: 20170822.
 Gu & Rigazio (2014) Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Hildebrand (1962) Francis Begnaud Hildebrand. Advanced calculus for applications, volume 63. PrenticeHall Englewood Cliffs, NJ, 1962.
 Kennedy (2011) James Kennedy. Particle swarm optimization. In Encyclopedia of machine learning, pp. 760–766. Springer, 2011.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.

LeCun & Cortes (1998)
Yann LeCun and Corrina Cortes.
The MNIST database of handwritten digits.
1998.  Liu (2016) Amy Liu. Clarifai featured hack: Block unwanted nudity in blog comments with disqus. https://goo.gl/TCCVrR, 2016. Accessed: 20170822.
 Liu et al. (2017) Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and blackbox attacks. In ICLR, 2017.
 MoosaviDezfooli et al. (2015) SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. arXiv preprint arXiv:1511.04599, 2015.
 MoosaviDezfooli et al. (2016) SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. arXiv preprint arXiv:1610.08401, 2016.
 Mopuri et al. (2017) Konda Reddy Mopuri, Utsav Garg, and R Venkatesh Babu. Fast feature fool: A data independent approach to universal adversarial perturbations. arXiv preprint arXiv:1707.05572, 2017.
 Mądry et al. (2017) Aleksander Mądry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083 [cs, stat], June 2017.
 Narodytska & Kasiviswanathan (2016) Nina Narodytska and Shiva Prasad Kasiviswanathan. Simple blackbox adversarial perturbations for deep networks. arXiv preprint arXiv:1612.06299, 2016.
 Nelson et al. (2012) Blaine Nelson, Benjamin IP Rubinstein, Ling Huang, Anthony D Joseph, Steven J Lee, Satish Rao, and JD Tygar. Query strategies for evading convexinducing classifiers. The Journal of Machine Learning Research, 13(1):1293–1332, 2012.
 Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.
 Papernot et al. (2017) Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical blackbox attacks against deep learning systems using adversarial examples. In Proceedings of the 2017 ACM Asia Conference on Computer and Communications Security, 2017.

Sharif et al. (2016)
Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter.
Accessorize to a crime: Real and stealthy attacks on stateoftheart face recognition.
In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1528–1540. ACM, 2016.  Shlens (2014) Jonathon Shlens. A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100, 2014.
 Spall (1992) James C Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE transactions on automatic control, 37(3):332–341, 1992.
 Spall (2005) James C Spall. Introduction to stochastic search and optimization: estimation, simulation, and control, volume 65. John Wiley & Sons, 2005.
 Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
 TensorFlow Authors (a) TensorFlow Authors. Tensorflow resnet models. https://github.com/tensorflow/models/tree/master/resnet, a. Accessed: 20170822.
 TensorFlow Authors (b) TensorFlow Authors. Tensorflow CIFAR10 tutorial model. https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10, b. Accessed: 20170822.
 Tramèr et al. (2017a) Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017a.
 Tramèr et al. (2017b) Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453, 2017b.
 (37) Watson Visual Recognition. Watson visual recognition. https://www.ibm.com/watson/services/visualrecognition/. Accessed: 20171027.
 Wright & Nocedal (1999) Stephen J Wright and Jorge Nocedal. Numerical optimization. Springer Science, 35(6768):7, 1999.
 Xu et al. (2016) Weilin Xu, Yanjun Qi, and David Evans. Automatically evading classifiers. In Proceedings of the 2016 Network and Distributed Systems Symposium, 2016.
 Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Appendix A Alternative adversarial success metric
Note that the adversarial success rate can also be computed by considering only the fraction of inputs that meet the adversary’s objective given that the original sample was correctly classified. That is, one would count the fraction of correctly classified inputs (i.e. ) for which in the untargeted case, and in the targeted case. In a sense, this fraction represents those samples which are truly adversarial, since they are misclassified solely due to the adversarial perturbation added and not due to the classifier’s failure to generalize well. In practice, both these methods of measuring the adversarial success rate lead to similar results for classifiers with high accuracy on the test data.
Appendix B Formal definitions for querybased attacks
Here, we provide a unified framework assuming an adversary can make active queries to the model. Existing attacks making zero queries are a special case in this framework. Given an input instance , the adversary makes a sequence of queries based on the adversarial constraint set , and iteratively adds perturbations until the desired query results are obtained, using which the corresponding adversarial example is generated.
We formally define the targeted and untargeted blackbox attacks based on the framework as below.
Definition 1 (Untargeted blackbox attack).
Given an input instance and an iterative active query attack strategy , a query sequence can be generated as , …, , where denotes the th corresponding query result on , and we set . A blackbox attack on is untargeted if the adversarial example satisfies , where is the number of queries made.
Definition 2 (Targeted blackbox attack).
Given an input instance and an iterative active query attack strategy , a query sequence can be generated as , …, , where denotes the th corresponding query result on , and we set . A blackbox attack on is targeted if the adversarial example satisfies , where and are the target class and the number of queries made, respectively.
The case where the adversary makes no queries to the target classifier is a special case we refer to as a zeroquery attack. In the literature, a number of these zeroquery attacks have been carried out with varying degrees of success (Papernot et al., 2016; Liu et al., 2017; MoosaviDezfooli et al., 2016; Mopuri et al., 2017).
Appendix C Summary of attacks evaluated
Taxonomy of blackbox attacks: To deepen our understanding of the effectiveness of blackbox attacks, in this work, we propose a taxonomy of blackbox attacks, intuitively based on the number of queries on the target model used in the attack. The details are provided in Table 7.
Attack  Abbreviation  Untargeted  Targeted  
Blackbox 
Zeroquery 
RandomGaussianperturbation  Rand.  +  
Differenceofmeans  D. of M.  +  +  
Transfer from model  Surrogate attack  Steps  Loss function  
FGS  Singlestep  Crossentropy  Transfer model FGSxent  +  +  
Logitbased  Transfer model FGSlogit  +  +  
IFGS  Iterative  Crossentropy  Transfer model IFGSxent  +  +  
Logitbased  Transfer model IFGSlogit  +  +  
Query based 
Finitedifference gradient estimation  Steps  Loss function  
Singlestep  Crossentropy  FDxent  +  +  
Logitbased  FDlogit  +  +  
Iterative  Crossentropy  IFDxent  +  +  
Logitbased  IFDlogit  +  +  
Queryreduced gradient estimation  Technique  Steps  Loss function  
Random grouping  Singlestep  Logitbased  GEQR (RG, logit)  +  +  
Iterative  Logitbased  IGEQR (RG, logit)  +  +  
PCA  Singlestep  Logitbased  GEQR (PCA, logit)  +  +  
Iterative  Logitbased  IGEQR (PCA, logit)  +  +  
Whitebox 
Fast gradient sign (FGS)  Steps  Loss function  
Singlestep  Crossentropy  WB FGSxent  +  +  
Logitbased  WB FGSlogit  +  +  
Iterative  Crossentropy  WB IFGSxent  +  +  
Logitbased  WB IFGSlogit  +  + 
We evaluate the following attacks summarized in Table 7:

Zeroquery attacks

Baseline attacks: RandomGaussian perturbations (Rand.) and DifferenceofMeans aligned perturbations (D. of M.)

Transferability attack (single local model) using Fast Gradient Sign (FGS) and Iterative FGS (IFGS) samples generated on a single source model for both loss functions (Transfer model FGS/IFGSloss); e.g., Transfer Model A FGSlogit

Transferability attack (local model ensemble) using FGS and IFGS samples generated on a source model for both loss functions (Transfer models FGS/IFGSloss); e.g., Transfer Model B, Model C IFGSlogit


Query based attacks

Finitedifference and Iterative Finitedifference attacks for the gradient estimation attack for both loss functions (FD/IFDloss); e.g., FDlogit

Gradient Estimation and Iterative Gradient Estimation with Query reduction attacks (IGE/GEQR (Technique, loss)) using two query reduction techniques, random grouping (RG) and principal component analysis components (PCA); e.g., GEQR (PCA, logit)


Whitebox FGS and IFGS attacks for both loss functions (WB FGS/IFGS (loss))
Appendix D Whitebox attack results
In this section, we present the whitebox attack results for various cases in Tables 8–10. Where relevant, our results match previous work (Goodfellow et al., 2015; Kurakin et al., 2016).
MNIST  Whitebox  

Singlestep  Iterative  
Model  FGS (xent)  FGS (logit)  IFGS (xent)  IFGS (logit) 
A  69.2 (5.9)  90.1 (5.9)  99.5 (4.4)  100.0 (2.1) 
B  84.7 (6.2)  98.8 (6.3)  100.0 (4.75)  100.0 (1.6) 
C  67.9 (6.1)  76.5 (6.6)  100.0 (4.7)  100.0 (2.2) 
D  98.3 (6.3)  100.0 (6.5)  100.0 (5.6)  100.0 (1.2) 
CIFAR10  Whitebox  
Singlestep  Iterative  
Model  FGS (xent)  FGS (logit)  IFGS (xent)  IFGS (logit) 
Resnet32  82.6 (439.7)  86.8 (438.3)  100.0 (247.2)  100.0 (66.1) 
Resnet2810  86.4 (439.6)  87.4 (439.2)  100.0 (278.7)  100.0 (75.4) 
Std.CNN  93.9 (439.6)  98.5 (429.8)  98.0 (314.3)  100.0 (80.3) 
MNIST  Whitebox  
Singlestep  Iterative  
Model  FGS (xent)  FGS (logit)  IFGS (xent)  IFGS (logit) 
A  30.1 (6.1)  30.1 (6.1)  100.0 (4.7)  99.6 (2.7) 
B  29.6 (6.2)  29.4 (6.3)  100.0 (4.6)  98.7 (2.4) 
C  33.2 (6.4)  33.0 (6.4)  100.0 (4.6)  99.8 (3.0) 
D  61.5 (6.4)  60.9 (6.3)  100.0 (4.7)  99.9 (2.0) 
CIFAR10  Whitebox  
Singlestep  Iterative  
Model  FGS (xent)  FGS (logit)  IFGS (xent)  IFGS (logit) 
Resnet32  23.7 (439.6)  23.5 (436.0)  100.0 (200.4)  100.0 (89.5) 
Resnet2810  28.0 (439.6)  27.6 (436.5)  100.0 (215.7)  100.0 (99.0) 
Std.CNN  43.8 (439.5)  40.2 (435.6)  99.0 (262.4)  95.0 (127.8) 
MNIST  Whitebox  
Singlestep  Iterative  
Model  FGS (xent)  FGS (logit)  IFGS (xent)  IFGS (logit) 
Model A_{adv0.3}  2.8 (6.1)  2.9 (6.0)  79.1 (4.2)  78.5 (3.1) 
Model A_{advens0.3}  6.2 (6.2)  4.6 (6.3)  93.0 (4.1)  96.2 (2.7) 
Model A_{adviter0.3}  7.0 (6.4)  5.9 (7.5)  10.8 (3.6)  11.0 (3.6) 
CIFAR10  Whitebox  
Singlestep  Iterative  
Model  FGS (xent)  FGS (logit)  IFGS (xent)  IFGS (logit) 
Resnet32 _{adv8}  1.1 (439.7)  1.5 (438.8)  100.0 (200.6)  100.0 (73.7) 
Resnet32 _{advens8}  7.2 (439.7)  6.6 (437.9)  100.0 (201.3)  100.0 (85.3) 
Resnet32 _{adviter8}  48.3 (438.2)  50.4 (346.6)  54.9 (398.7)  57.3 (252.4) 