1 Introduction
It has been observed recently that machine learning algorithms, especially deep neural networks, are vulnerable to adversarial examples [3, 4, 5, 6, 7, 8]. For example, in image classification problems, attack algorithms [9, 3, 10] can find adversarial examples for almost every image with very small humanimperceptible perturbation. The problem of finding an adversarial example can be posed as solving an optimization problem—within a small neighbourhood around the original example, find a point to optimize the cost function measuring the “successfulness” of an attack. Solving this objective function with gradientbased optimizer leads to stateoftheart attacks [9, 3, 10, 4, 11].
Most current attacks [3, 9, 4, 12]
consider the “whitebox” setting, where the machine learning model is fully exposed to the attacker. In this setting, the gradient of the abovementioned attack objective function can be computed by backpropagation, so attacks can be done very easily. This whitebox setting is clearly unrealistic when the model parameters are unknown to an attacker. Instead, several recent works consider the “scorebased blackbox” setting, where the machine learning model is unknown to the attacker, but it is possible to make queries to obtain the corresponding probability outputs of the model
[10, 13]. However, in many cases realworld models will not provide probability outputs to users. Instead, only the final decision (e.g., top1 predicted class) can be observed. It is therefore interesting to show whether machine learning model is vulnerable in this setting.Furthermore, existing gradientbased attacks cannot be applied to some noncontinuous machine learning models which involve discrete decisions. For example, the robustness of decisiontree based models (random forest and gradient boosting decision trees (GBDT)) cannot be evaluated using gradientbased approaches, since the gradient of these functions does not exist.
In this paper, we develop an optimizationbased framework for attacking machine learning models in a more realistic and general “hardlabel blackbox” setting. We assume that the model is not revealed and the attacker can only make queries to get the corresponding hardlabel decision
instead of the probability outputs (also known as soft labels). Attacking in this setting is very challenging and almost all the previous attacks fail due to the following two reasons. First, the gradient cannot be computed directly by backpropagation, and finite differences based approaches also fail because the hardlabel output is insensitive to small input perturbations; second, since only hardlabel decision is observed, the attack objective functions become discontinuous with discrete outputs, which is combinatorial in nature and hard to optimize (see Section
2.4 for more details).In this paper, we make hardlabel blackbox attacks possible and queryefficient by reformulating the attack as a novel realvalued optimization problem, which is usually continuous and much easier to solve. Although the objective function of this reformulation cannot be written in an analytical form, we show how to use model queries to evaluate its function value and apply any zeroth order optimization algorithm to solve it. Furthermore, we prove that by carefully controlling the numerical accuracy of function evaluations, a Random GradientFree (RGF) method can convergence to stationary points as long as the boundary is smooth. We note that this is the first attack with a guaranteed convergence rate in the hardlabel blackbox setting. In the experiments, we show our algorithm can be successfully used to attack hardlabel blackbox CNN models on MNIST, CIFAR, and ImageNet with far less number of queries compared to the stateofart algorithm.
Moreover, since our algorithm does not depend on the gradient of the classifier, we can apply our approach to other nondifferentiable classifiers besides neural networks. We show an interesting application in attacking Gradient Boosting Decision Tree, which cannot be attacked by all the existing gradientbased methods even in the whitebox setting. Our method can successfully find adversarial examples with imperceptible perturbations for a GBDT within 30,000 queries.
2 Background and Related work
We will first introduce our problem setting and give a brief literature review to hightlight the difficulty of attacking hardlabel blackbox models.
2.1 Problem Setting
For simplicity, we consider attacking a way multiclass classification model in this paper. Given the classification model and an original example , the goal is to generate an adversarial example such that
(1) 
2.2 Whitebox attacks
Most attack algorithms in the literature consider the whitebox setting, where the classifier is exposed to the attacker. For neural networks, under this assumption, backpropagation can be conducted on the target model because both network structure and weights are known by the attacker. For classification models in neural networks, it is usually assumed that , where
is the final (logit) layer output, and
is the prediction score for the th class. The objectives in (1) can then be naturally formulated as the following optimization problem:(2) 
where is some distance measurement (e.g., or norm in Euclidean space),
is the loss function corresponding to the goal of the attack, and
is a balancing parameter. For untargeted attack, where the goal is to make the target classifier misclassify, the loss function can be defined as(3) 
where is the original label predicted by the classifier. For targeted attack, where the goal is to turn it into a specific target class , the loss function can also be defined accordingly.
Therefore, attacking a machine learning model can be posed as solving this optimization problem [9, 12], which is also known as the C&W attack or the EAD attack depending on the choice of the distance measurement. To solve (2), one can apply any gradientbased optimization algorithm such as SGD or Adam, since the gradient of can be computed via backpropagation.
The ability of computing gradient also enables many different attacks in the whitebox setting. For example, eq (2) can also be turned into a constrained optimization problem, which can then be solved by projected gradient descent (PGD) [11]. FGSM [3] is the special case of one step PGD with norm distance. Other algorithms such as Deepfool [6] also solve similar optimization problems to construct adversarial examples.
2.3 Previous work on blackbox attack
In realworld systems, usually the underlying machine learning model will not be revealed and thus whitebox attacks cannot be applied. This motivates the study of attacking machine learning models in the blackbox setting, where attackers do not have any information about the function . And the only valid operation is to make queries to the model and get the corresponding output . The first approach for blackbox attack is using transfer attack [14]—instead of attacking the original model , attackers try to construct a substitute model to mimic and then attack using whitebox attack methods. This approach has been well studied and analyzed in [15]. However, recent papers have shown that attacking the substitute model usually leads to much larger distortion and low success rate [10]. Therefore, instead, [10] considers the scorebased blackbox setting, where attackers can use
to query the softmax layer output in addition to the final classification result. In this case, they can reconstruct the loss function (
3) and evaluate it as long as the objective function exists for any . Thus a zeroth order optimization approach can be directly applied to minimize . [16] further improves the query complexity of [10]by introducing two novel building blocks: (i) an adaptive random gradient estimation algorithm that balances query counts and distortion, and (ii) a welltrained autoencoder that achieves attack acceleration.
[13]also solves a scorebased attack problem using an evolutionary algorithm and it shows their method could be applied to hardlabel blackbox setting as well.
2.4 Difficulty of hardlabel blackbox attacks
Throughout this paper, the hardlabel blackbox setting refers to cases where realworld ML systems only provide limited prediction results of an input query. Specifically, only the final decision (top1 predicted label) instead of probability outputs is known to an attacker.
Attacking in this setting is very challenging. In Figure (a)a, we show a simple 3layer neural network’s decision boundary. Note that the term is continuous as in Figure (b)b because the logit layer output is realvalued functions. However, in the hardlabel blackbox setting, only is available instead of . Since
can only be onehot vector, if we plugin
into the loss function, (as shown in Figure (c)c) will be discontinuous and with discrete outputs.Optimizing this function will require combinatorial optimization or search algorithms, which is almost impossible to do given high dimensionality of the problem. Therefore, almost no algorithm can successfully conduct hardlabel blackbox attack in the literature. The only current approach
[1] is based on randomwalk on the boundary. Although this decisionbased attack can find adversarial examples with comparable distortion with whitebox attacks, it suffers from exponential search time, resulting in lots of queries, and lacks convergence guarantees. We show that our optimizationbased algorithm can significantly reduce the number of queries compared with decisionbased attack, and has guaranteed convergence in the number of iterations (queries).3 Algorithms
Now we will introduce a novel way to reformulate hardlabel blackbox attack as another optimization problem, show how to evaluate the function value using hardlabel queries, and then apply a zeroth order optimization algorithm to solve it.
3.1 A Boundarybased Reformulation
For a given example , true label and the hardlabel blackbox function , we define our objective function depending on the type of attack:
Untargeted attack:  (4)  
Targeted attack (given target ):  (5) 
In this formulation, represents the search direction
and is the distance from to the nearest adversarial example along the direction . The difference between (4) and (5) corresponds to the different definitions of “successfulness” in untargeted and targeted attack, where the former one aims to turn the prediction into any incorrect label and the later one aims to turn the prediction into the target label.
For untargeted attack, also corresponds to the distance to the decision boundary along the direction .
In image problems the input domain of is bounded, so we will add corresponding upper/lower bounds in the definition of (4) and (5).
Instead of searching for an adversarial example, we search the direction to minimize the distortion , which leads to the following optimization problem:
(6) 
Finally, the adversarial example can be found by , where is the optimal solution of (6).
Note that unlike the C&W or PGD objective functions, which are discontinuous step functions in the hardlabel setting (see Section 2), maps input direction to realvalued output (distance to decision boundary), which is usually continuous—a small change of usually leads to a small change of , as can be seen from Figure 2.
Moreover, we give three examples of defined in two dimension input space and their corresponding . In Figure (a)a, we have a continuous classification function defined as follows
In this case, as shown in Figure (c)c, is continuous. Moreover, in Figure (b)b and Figure (a)a, we show decision boundaries generated by GBDT and neural network classifier, which are not continuous. However, as showed in Figure (d)d and Figure (d)d, even if the classifier function is not continuous, is still continuous. This makes it easy to apply zeroth order method to solve (6).
Compute up to certain accuracy. We are not able to evaluate the gradient of , but we can evaluate the function value of using the hardlabel queries to the original function . For simplicity, we focus on untargeted attack here, but the same procedure can be applied to targeted attack as well.
First, we discuss how to compute directly without additional information. This is used in the initialization step of our algorithm. For a given normalized , we do a finegrained search and then a binary search. In finegrained search, we query the points one by one until we find . This means the boundary goes between . We then enter the second phase and conduct a binary search to find the solution within this region (same with line 11–17 in Algorithm 1). Note that there is an upper bound of the first stage if we choose by the direction of with some from another class. This procedure is used to find the initial and corresponding in our optimization algorithm. We omit the detailed algorithm for this part since it is similar to Algorithm 1.
Next, we discuss how to compute when we know the solution is very close to a value . This is used in all the function evaluations in our optimization algorithm, since the current solution is usually close to the previous solution, and when we estimate the gradient using (7), the queried direction will only be a small perturbation of the previous one. In this case, we first increase or decrease in local region to find the interval that contains boundary (e.g, and ), then conduct a binary search to find the final value of . Our procedure for computing value is presented in Algorithm 1.
3.2 Zeroth Order Optimization
To solve the optimization problem (1) for which we can only evaluate function value instead of gradient, zeroth order optimization algorithms can be naturally applied. In fact, after the reformulation, the problem can be potentially solved by any zeroth order optimization algorithm, like zeroth order gradient descent or coordinate descent (see [17] for a comprehensive survey).
Here we propose to solve (1) using Randomized GradientFree (RGF) method proposed in [2, 18]. In practice we found it outperforms zerothorder coordinate descent. In each iteration, the gradient is estimated by
(7) 
where is a random Gaussian vector, and is a smoothing parameter (we set in all our experiments). The solution is then updated by with a step size . The procedure is summarized in Algorithm 2.
There are several implementation details when we apply this algorithm. First, for highdimensional problems, we found the estimation in (7) is very noisy. Therefore, instead of using one vector, we sample vectors from Gaussian distribution and average their estimators to get . We set in all the experiments. The convergence proofs can be naturally extended to this case. Second, instead of using a fixed step size (suggested in theory), we use a backtracking linesearch approach to find step size at each step. This leads to additional query counts, but makes the algorithm more stable and eliminates the need to handtuning the step size.
3.3 Theoretical Analysis
If can be computed exactly, it has been proved in [2] that RGF in Algorithm 2 requires at most iterations to converge to a point with . However, in our algorithm the function value cannot be computed exactly; instead, we compute it up to precision, and this precision can be controlled by binary threshold in Algorithm 1. We thus extend the proof in [2] to include the case of approximate function value evaluation, as described in the following theorem.
Theorem 1
In Algorithm 2, suppose g has Lipschitzcontinuous gradient with constant . If the error of function value evaluation is controlled by and , then in order to obtain , the total number of iterations is at most .
Detailed proofs can be found in the appendix. Note that the binary search procedure could obtain the desired function value precision in steps. By using the same idea with Theorem 1 and following the proof in [2], we could also achieve complexity when is nonsmooth but Lipschitz continuous.
4 Experimental results
We test the performance of our hardlabel blackbox attack algorithm on convolutional neural network (CNN) models and compare with decisionbased attack [1]. Furthermore, we show our method can be applied to attack Gradient Boosting Decision Tree (GBDT) and present some interesting findings.
4.1 Attack CNN image classification models
We use three standard datasets: MNIST [19], CIFAR10 [20] and ImageNet1000 [21]. To have a fair comparison with previous work, we adopt the same networks used in both [9] and [1]
. In detail, both MNIST and CIFAR use the same network structure with four convolution layers, two maxpooling layers and two fullyconnected layers. Using the parameters provided by
[9], we could achieve 99.5% accuracy on MNIST and 82.5% accuracy on CIFAR10, which is similar to what was reported in [9]. For Imagenet1000, we use the pretrained network Resnet50 [22] provided by torchvision^{1}^{1}1https://github.com/pytorch/vision/tree/master/torchvision, which could achieve 76.15% top1 accuracy. All models are trained using Pytorch and our source code is publicly available
^{2}^{2}2https://github.com/LeMinhThong/blackboxattack.We include the following algorithms into comparison:

Optbased blackbox attack (Optattack): our proposed algorithm.

Decisionbased blackbox attack [1] (Decisionattack): the only previous work on attacking hardlabel black box model. We use the authors’ implementation and use default parameters provided in Foolbox^{3}^{3}3https://github.com/bethgelab/foolbox.

C&W whitebox attack [9]: one of the current stateoftheart attacking algorithm in the whitebox setting. We do binary search on parameter per image to achieve the best performance. Attacking in the whitebox setting is a much easier problem, so we include C&W attack just for reference and indicate the best performance we can possibly achieve.
For all the cases, we conduct adversarial attacks for randomly sampled images from validation sets. Note that all three attacks have 100% successful rate, and we report the average distortion, defined by , where is the adversarial example constructed by an attack algorithm and is the original th example. For blackbox attack algorithms, we also report average number of queries for comparison.
4.1.1 Untargeted Attack
MNIST  CIFAR10  Imagenet (ResNet50)  
Avg  # queries  Avg  # queries  Avg  # queries  
Decisionattack (blackbox)  1.1222  60,293  0.1575  123,879  5.9791  123,407 
1.1087  143,357  0.1501  220,144  3.7725  260,797  
Optattack (blackbox)  1.188  22,940  0.2050  40,941  6.9796  71,100 
1.049  51,683  0.1625  77,327  4.7100  127,086  
1.011  126,486  0.1451  133,662  3.1120  237,342  
C&W (whitebox)  0.9921    0.1012    1.9365   
For untargeted attack, the goal is to turn a correctly classified image into any other label. The results are presented in Table 1. Note that for both Optattack and Decisionattack, by changing stopping conditions we can get the performance with different number of queries.
First, we compare two blackbox attack methods in Table 1. Our algorithm consistently achieves smaller distortion with less number of queries than Decisionattack. For example, on MNIST data, we are able to reduce the number of queries by 34 folds, and Decisionattack converges to worse solutions in all the 3 datasets. Compared with C&W attack, we found blackbox attacks attain slightly worse distortion on MNIST and CIFAR.
This is reasonable because whitebox attack has much more information than blackbox attack and is strictly easier. We note that the experiments in [1] conclude that C&W and Decisionattack have similar performance because they only run C&W with a single regularization parameter without doing binary search to obtain the optimal parameter. For ImageNet, since we constraint the number of queries, the distortion of blackbox attacks is much worse than C&W attack. The gap can be reduced by increasing the number of queries as showed in Figure 4.
4.1.2 Targeted attack
The results for targeted attack is presented in Table 2. Following the experiments in [1], for each randomly sampled image with label we set target label . On MNIST data, we found our algorithm is more than 4 times faster (in terms of number of queries) than Decisionattack and converge to a better solution. On CIFAR data, our algorithm has similar efficiency with Decisionattack at the first 60,000 queries, but converges to a slightly worse solution. Also, we show a example quality comparison from the same starting point to the original sample in Figure 5.
MNIST  CIFAR10  
Avg  # queries  Avg  # queries  
Decisionattack (blackbox)  2.3158  30,103  0.2850  55,552 
2.0052  58,508  0.2213  140,572  
1.8668  192,018  0.2122  316,791  
Optattack (blackbox)  1.8522  46,248  0.2758  61,869 
1.7744  57,741  0.2369  141,437  
1.7114  73,293  0.2300  186,753  
C&W (whitebox)  1.4178    0.1901   
HIGGS  MNIST  

Avg  # queries  Avg  # queries  
Ours  0.3458  4,229  0.6113  5,125 
0.2179  11,139  0.5576  11,858  
0.1704  29,598  0.5505  32,230 
4.1.3 Attack Gradient Boosting Decision Tree (GBDT)
To evaluate our method’s ability to attack models with discrete decision functions, we conduct our untargeted attack on gradient booting decision tree (GBDT). In this experiment, we use two standard datasets: HIGGS [23] for binary classification and MNIST [19] for multiclass classification. We use popular LightGBM^{4}^{4}4https://github.com/Microsoft/LightGBM framework to train the GBDT models. Using suggested parameters^{5}^{5}5https://github.com/Koziev/MNIST_Boosting, we could achieve 0.8457 AUC for HIGGS and 98.09% accuracy for MNIST. The results of untargeted attack on GBDT are in Table 3.
As shown in Table 3, by using around 30K queries, we could get a small distortion on both datasets, which firstly uncovers the vulnerability of GBDT models. Treebased methods are wellknown for its good interpretability. And because of that, they are widely used in the industry. However, we show that even with good interpretability and a similar prediction accuracy with convolution neural network, the GBDT models are vulnerable under our Optattack. This result raises a question about treebased models’ robustness, which will be an interesting direction in the future.
5 Conclusion
In this paper, we propose a generic and optimizationbased hardlabel blackbox attack algorithm, which can be applied to discrete and noncontinuous models other than neural networks, such as the gradient boosting decision tree. Our method enjoys queryefficiency and has a theoretical convergence guarantee on the attack performance. Moreover, our attack achieves smaller or similar distortion using 34 times less queries compared with the stateoftheart algorithm.
References
 [1] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decisionbased adversarial attacks: Reliable attacks against blackbox machine learning models. arXiv preprint arXiv:1712.04248, 2017.
 [2] Yurii Nesterov and Vladimir Spokoiny. Random gradientfree minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
 [3] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 [4] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 [5] SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations.

[6]
Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard.
Deepfool: a simple and accurate method to fool deep neural networks.
In
Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, number EPFLCONF218057, 2016.  [7] Hongge Chen, Huan Zhang, PinYu Chen, Jinfeng Yi, and ChoJui Hsieh. Attacking visual language grounding with adversarial examples: A case study on neural image captioning. In ACL, 2018.
 [8] Minhao Cheng, Jinfeng Yi, Huan Zhang, PinYu Chen, and ChoJui Hsieh. Seq2sick: Evaluating the robustness of sequencetosequence models with adversarial examples. CoRR, 2018.
 [9] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.

[10]
PinYu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and ChoJui Hsieh.
Zoo: Zeroth order optimization based blackbox attacks to deep neural
networks without training substitute models.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pages 15–26. ACM, 2017. 
[11]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
Adrian Vladu.
Towards deep learning models resistant to adversarial attacks.
In ICLR, 2018.  [12] PinYu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and ChoJui Hsieh. Ead: elasticnet attacks to deep neural networks via adversarial examples. In AAAI, 2018.
 [13] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Queryefficient blackbox adversarial examples. arXiv preprint arXiv:1712.07113, 2017.
 [14] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical blackbox attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519. ACM, 2017.
 [15] Arjun Nitin Bhagoji, Warren He, Bo Li, and Dawn Song. Exploring the space of blackbox attacks on deep neural networks. arXiv preprint arXiv:1712.09491, 2017.
 [16] ChunChen Tu, PaiShun Ting, PinYu Chen, Sijia Liu, Huan Zhang, Jinfeng Yi, ChoJui Hsieh, and ShinMing Cheng. Autozoom: Autoencoderbased zeroth order optimization method for attacking blackbox neural networks. CoRR, abs/1805.11770, 2018.
 [17] Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivativefree optimization, volume 8. Siam, 2009.
 [18] Saeed Ghadimi and Guanghui Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 [19] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [20] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 [21] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [23] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in highenergy physics with deep learning. Nature communications, 5:4308, 2014.
 [24] Yurii Nesterov. Random gradientfree minimization of convex functions. Technical report, 2011.
6 Appendix
Because there is a stopping criterion in Algorithm 1, we couldn’t achieve the exact . Instead, we could get with error, i.e., . Also, we define to be the noisy gradient estimator.
Following [24], we define the Guassian smoothing approximation over , i.e,
(8) 
Also, we have the upper bounds for the moments
from [24] Lemma 1.For , we have
(9) 
If , we have twosided bounds
(10) 
6.1 Proof of Theorem 1
Suppose has a lipschitzcontinuous gradient with constant , then
(11) 
We could bound as follows. Since
(12)  
and ,
(13) 
Take expectation over u, and with Theorem 3 in [24], which is ,
(14)  
With , we could bound :
(15) 
We use the result that
(16) 
which is proved in [24] Lemma 4.
Therefore, since , we could get
(17)  
Therefore, since has Lipshcitzcontinuous gradient:
(18) 
so that
(19) 
Since
(20)  
where is a allone vector, taking the expectation in , we obtain
(21)  
Choosing , we obtain
(22)  
Since , taking expectation over , where , we get
(23) 
where and .
Assuming , summing over k and divided by N+1, we get
(24)  
Clearly, .
Since , is in the same order oas. In order to satisfy , we need to choose , then N is bounded by .
Comments
There are no comments yet.