It has been observed recently that machine learning algorithms, especially deep neural networks, are vulnerable to adversarial examples [3, 4, 5, 6, 7, 8]. For example, in image classification problems, attack algorithms [9, 3, 10] can find adversarial examples for almost every image with very small human-imperceptible perturbation. The problem of finding an adversarial example can be posed as solving an optimization problem—within a small neighbourhood around the original example, find a point to optimize the cost function measuring the “successfulness” of an attack. Solving this objective function with gradient-based optimizer leads to state-of-the-art attacks [9, 3, 10, 4, 11].
consider the “white-box” setting, where the machine learning model is fully exposed to the attacker. In this setting, the gradient of the above-mentioned attack objective function can be computed by back-propagation, so attacks can be done very easily. This white-box setting is clearly unrealistic when the model parameters are unknown to an attacker. Instead, several recent works consider the “score-based black-box” setting, where the machine learning model is unknown to the attacker, but it is possible to make queries to obtain the corresponding probability outputs of the model[10, 13]. However, in many cases real-world models will not provide probability outputs to users. Instead, only the final decision (e.g., top-1 predicted class) can be observed. It is therefore interesting to show whether machine learning model is vulnerable in this setting.
Furthermore, existing gradient-based attacks cannot be applied to some non-continuous machine learning models which involve discrete decisions. For example, the robustness of decision-tree based models (random forest and gradient boosting decision trees (GBDT)) cannot be evaluated using gradient-based approaches, since the gradient of these functions does not exist.
In this paper, we develop an optimization-based framework for attacking machine learning models in a more realistic and general “hard-label black-box” setting. We assume that the model is not revealed and the attacker can only make queries to get the corresponding hard-label decision
instead of the probability outputs (also known as soft labels). Attacking in this setting is very challenging and almost all the previous attacks fail due to the following two reasons. First, the gradient cannot be computed directly by backpropagation, and finite differences based approaches also fail because the hard-label output is insensitive to small input perturbations; second, since only hard-label decision is observed, the attack objective functions become discontinuous with discrete outputs, which is combinatorial in nature and hard to optimize (see Section2.4 for more details).
In this paper, we make hard-label black-box attacks possible and query-efficient by reformulating the attack as a novel real-valued optimization problem, which is usually continuous and much easier to solve. Although the objective function of this reformulation cannot be written in an analytical form, we show how to use model queries to evaluate its function value and apply any zeroth order optimization algorithm to solve it. Furthermore, we prove that by carefully controlling the numerical accuracy of function evaluations, a Random Gradient-Free (RGF) method can convergence to stationary points as long as the boundary is smooth. We note that this is the first attack with a guaranteed convergence rate in the hard-label black-box setting. In the experiments, we show our algorithm can be successfully used to attack hard-label black-box CNN models on MNIST, CIFAR, and ImageNet with far less number of queries compared to the state-of-art algorithm.
Moreover, since our algorithm does not depend on the gradient of the classifier, we can apply our approach to other non-differentiable classifiers besides neural networks. We show an interesting application in attacking Gradient Boosting Decision Tree, which cannot be attacked by all the existing gradient-based methods even in the white-box setting. Our method can successfully find adversarial examples with imperceptible perturbations for a GBDT within 30,000 queries.
2 Background and Related work
We will first introduce our problem setting and give a brief literature review to hightlight the difficulty of attacking hard-label black-box models.
2.1 Problem Setting
For simplicity, we consider attacking a -way multi-class classification model in this paper. Given the classification model and an original example , the goal is to generate an adversarial example such that
2.2 White-box attacks
Most attack algorithms in the literature consider the white-box setting, where the classifier is exposed to the attacker. For neural networks, under this assumption, back-propagation can be conducted on the target model because both network structure and weights are known by the attacker. For classification models in neural networks, it is usually assumed that , where
is the final (logit) layer output, andis the prediction score for the -th class. The objectives in (1) can then be naturally formulated as the following optimization problem:
where is some distance measurement (e.g., or norm in Euclidean space),
is the loss function corresponding to the goal of the attack, andis a balancing parameter. For untargeted attack, where the goal is to make the target classifier misclassify, the loss function can be defined as
where is the original label predicted by the classifier. For targeted attack, where the goal is to turn it into a specific target class , the loss function can also be defined accordingly.
Therefore, attacking a machine learning model can be posed as solving this optimization problem [9, 12], which is also known as the C&W attack or the EAD attack depending on the choice of the distance measurement. To solve (2), one can apply any gradient-based optimization algorithm such as SGD or Adam, since the gradient of can be computed via back-propagation.
The ability of computing gradient also enables many different attacks in the white-box setting. For example, eq (2) can also be turned into a constrained optimization problem, which can then be solved by projected gradient descent (PGD) . FGSM  is the special case of one step PGD with norm distance. Other algorithms such as Deepfool  also solve similar optimization problems to construct adversarial examples.
2.3 Previous work on black-box attack
In real-world systems, usually the underlying machine learning model will not be revealed and thus white-box attacks cannot be applied. This motivates the study of attacking machine learning models in the black-box setting, where attackers do not have any information about the function . And the only valid operation is to make queries to the model and get the corresponding output . The first approach for black-box attack is using transfer attack —instead of attacking the original model , attackers try to construct a substitute model to mimic and then attack using white-box attack methods. This approach has been well studied and analyzed in . However, recent papers have shown that attacking the substitute model usually leads to much larger distortion and low success rate . Therefore, instead,  considers the score-based black-box setting, where attackers can use
to query the softmax layer output in addition to the final classification result. In this case, they can reconstruct the loss function (3) and evaluate it as long as the objective function exists for any . Thus a zeroth order optimization approach can be directly applied to minimize .  further improves the query complexity of 
by introducing two novel building blocks: (i) an adaptive random gradient estimation algorithm that balances query counts and distortion, and (ii) a well-trained autoencoder that achieves attack acceleration.
also solves a score-based attack problem using an evolutionary algorithm and it shows their method could be applied to hard-label black-box setting as well.
2.4 Difficulty of hard-label black-box attacks
Throughout this paper, the hard-label black-box setting refers to cases where real-world ML systems only provide limited prediction results of an input query. Specifically, only the final decision (top-1 predicted label) instead of probability outputs is known to an attacker.
Attacking in this setting is very challenging. In Figure (a)a, we show a simple 3-layer neural network’s decision boundary. Note that the term is continuous as in Figure (b)b because the logit layer output is real-valued functions. However, in the hard-label black-box setting, only is available instead of . Since
can only be one-hot vector, if we plug-ininto the loss function, (as shown in Figure (c)c) will be discontinuous and with discrete outputs.
Optimizing this function will require combinatorial optimization or search algorithms, which is almost impossible to do given high dimensionality of the problem. Therefore, almost no algorithm can successfully conduct hard-label black-box attack in the literature. The only current approach is based on random-walk on the boundary. Although this decision-based attack can find adversarial examples with comparable distortion with white-box attacks, it suffers from exponential search time, resulting in lots of queries, and lacks convergence guarantees. We show that our optimization-based algorithm can significantly reduce the number of queries compared with decision-based attack, and has guaranteed convergence in the number of iterations (queries).
Now we will introduce a novel way to re-formulate hard-label black-box attack as another optimization problem, show how to evaluate the function value using hard-label queries, and then apply a zeroth order optimization algorithm to solve it.
3.1 A Boundary-based Re-formulation
For a given example , true label and the hard-label black-box function , we define our objective function depending on the type of attack:
|Targeted attack (given target ):||(5)|
In this formulation, represents the search direction
and is the distance from to the nearest adversarial example along the direction . The difference between (4) and (5) corresponds to the different definitions of “successfulness” in untargeted and targeted attack, where the former one aims to turn the prediction into any incorrect label and the later one aims to turn the prediction into the target label.
For untargeted attack, also corresponds to the distance to the decision boundary along the direction .
In image problems the input domain of is bounded, so we will add corresponding upper/lower bounds in the definition of (4) and (5).
Instead of searching for an adversarial example, we search the direction to minimize the distortion , which leads to the following optimization problem:
Finally, the adversarial example can be found by , where is the optimal solution of (6).
Note that unlike the C&W or PGD objective functions, which are discontinuous step functions in the hard-label setting (see Section 2), maps input direction to real-valued output (distance to decision boundary), which is usually continuous—a small change of usually leads to a small change of , as can be seen from Figure 2.
Moreover, we give three examples of defined in two dimension input space and their corresponding . In Figure (a)a, we have a continuous classification function defined as follows
In this case, as shown in Figure (c)c, is continuous. Moreover, in Figure (b)b and Figure (a)a, we show decision boundaries generated by GBDT and neural network classifier, which are not continuous. However, as showed in Figure (d)d and Figure (d)d, even if the classifier function is not continuous, is still continuous. This makes it easy to apply zeroth order method to solve (6).
Compute up to certain accuracy. We are not able to evaluate the gradient of , but we can evaluate the function value of using the hard-label queries to the original function . For simplicity, we focus on untargeted attack here, but the same procedure can be applied to targeted attack as well.
First, we discuss how to compute directly without additional information. This is used in the initialization step of our algorithm. For a given normalized , we do a fine-grained search and then a binary search. In fine-grained search, we query the points one by one until we find . This means the boundary goes between . We then enter the second phase and conduct a binary search to find the solution within this region (same with line 11–17 in Algorithm 1). Note that there is an upper bound of the first stage if we choose by the direction of with some from another class. This procedure is used to find the initial and corresponding in our optimization algorithm. We omit the detailed algorithm for this part since it is similar to Algorithm 1.
Next, we discuss how to compute when we know the solution is very close to a value . This is used in all the function evaluations in our optimization algorithm, since the current solution is usually close to the previous solution, and when we estimate the gradient using (7), the queried direction will only be a small perturbation of the previous one. In this case, we first increase or decrease in local region to find the interval that contains boundary (e.g, and ), then conduct a binary search to find the final value of . Our procedure for computing value is presented in Algorithm 1.
3.2 Zeroth Order Optimization
To solve the optimization problem (1) for which we can only evaluate function value instead of gradient, zeroth order optimization algorithms can be naturally applied. In fact, after the reformulation, the problem can be potentially solved by any zeroth order optimization algorithm, like zeroth order gradient descent or coordinate descent (see  for a comprehensive survey).
Here we propose to solve (1) using Randomized Gradient-Free (RGF) method proposed in [2, 18]. In practice we found it outperforms zeroth-order coordinate descent. In each iteration, the gradient is estimated by
where is a random Gaussian vector, and is a smoothing parameter (we set in all our experiments). The solution is then updated by with a step size . The procedure is summarized in Algorithm 2.
There are several implementation details when we apply this algorithm. First, for high-dimensional problems, we found the estimation in (7) is very noisy. Therefore, instead of using one vector, we sample vectors from Gaussian distribution and average their estimators to get . We set in all the experiments. The convergence proofs can be naturally extended to this case. Second, instead of using a fixed step size (suggested in theory), we use a backtracking line-search approach to find step size at each step. This leads to additional query counts, but makes the algorithm more stable and eliminates the need to hand-tuning the step size.
3.3 Theoretical Analysis
If can be computed exactly, it has been proved in  that RGF in Algorithm 2 requires at most iterations to converge to a point with . However, in our algorithm the function value cannot be computed exactly; instead, we compute it up to -precision, and this precision can be controlled by binary threshold in Algorithm 1. We thus extend the proof in  to include the case of approximate function value evaluation, as described in the following theorem.
In Algorithm 2, suppose g has Lipschitz-continuous gradient with constant . If the error of function value evaluation is controlled by and , then in order to obtain , the total number of iterations is at most .
Detailed proofs can be found in the appendix. Note that the binary search procedure could obtain the desired function value precision in steps. By using the same idea with Theorem 1 and following the proof in , we could also achieve complexity when is non-smooth but Lipschitz continuous.
4 Experimental results
We test the performance of our hard-label black-box attack algorithm on convolutional neural network (CNN) models and compare with decision-based attack . Furthermore, we show our method can be applied to attack Gradient Boosting Decision Tree (GBDT) and present some interesting findings.
4.1 Attack CNN image classification models
. In detail, both MNIST and CIFAR use the same network structure with four convolution layers, two max-pooling layers and two fully-connected layers. Using the parameters provided by, we could achieve 99.5% accuracy on MNIST and 82.5% accuracy on CIFAR-10, which is similar to what was reported in . For Imagenet-1000, we use the pretrained network Resnet-50  provided by torchvision111https://github.com/pytorch/vision/tree/master/torchvision
, which could achieve 76.15% top-1 accuracy. All models are trained using Pytorch and our source code is publicly available222https://github.com/LeMinhThong/blackbox-attack.
We include the following algorithms into comparison:
Opt-based black-box attack (Opt-attack): our proposed algorithm.
C&W white-box attack : one of the current state-of-the-art attacking algorithm in the white-box setting. We do binary search on parameter per image to achieve the best performance. Attacking in the white-box setting is a much easier problem, so we include C&W attack just for reference and indicate the best performance we can possibly achieve.
For all the cases, we conduct adversarial attacks for randomly sampled images from validation sets. Note that all three attacks have 100% successful rate, and we report the average distortion, defined by , where is the adversarial example constructed by an attack algorithm and is the original -th example. For black-box attack algorithms, we also report average number of queries for comparison.
4.1.1 Untargeted Attack
|Avg||# queries||Avg||# queries||Avg||# queries|
For untargeted attack, the goal is to turn a correctly classified image into any other label. The results are presented in Table 1. Note that for both Opt-attack and Decision-attack, by changing stopping conditions we can get the performance with different number of queries.
First, we compare two black-box attack methods in Table 1. Our algorithm consistently achieves smaller distortion with less number of queries than Decision-attack. For example, on MNIST data, we are able to reduce the number of queries by 3-4 folds, and Decision-attack converges to worse solutions in all the 3 datasets. Compared with C&W attack, we found black-box attacks attain slightly worse distortion on MNIST and CIFAR.
This is reasonable because white-box attack has much more information than black-box attack and is strictly easier. We note that the experiments in  conclude that C&W and Decision-attack have similar performance because they only run C&W with a single regularization parameter without doing binary search to obtain the optimal parameter. For ImageNet, since we constraint the number of queries, the distortion of black-box attacks is much worse than C&W attack. The gap can be reduced by increasing the number of queries as showed in Figure 4.
4.1.2 Targeted attack
The results for targeted attack is presented in Table 2. Following the experiments in , for each randomly sampled image with label we set target label . On MNIST data, we found our algorithm is more than 4 times faster (in terms of number of queries) than Decision-attack and converge to a better solution. On CIFAR data, our algorithm has similar efficiency with Decision-attack at the first 60,000 queries, but converges to a slightly worse solution. Also, we show a example quality comparison from the same starting point to the original sample in Figure 5.
|Avg||# queries||Avg||# queries|
|Avg||# queries||Avg||# queries|
4.1.3 Attack Gradient Boosting Decision Tree (GBDT)
To evaluate our method’s ability to attack models with discrete decision functions, we conduct our untargeted attack on gradient booting decision tree (GBDT). In this experiment, we use two standard datasets: HIGGS  for binary classification and MNIST  for multi-class classification. We use popular LightGBM444https://github.com/Microsoft/LightGBM framework to train the GBDT models. Using suggested parameters555https://github.com/Koziev/MNIST_Boosting, we could achieve 0.8457 AUC for HIGGS and 98.09% accuracy for MNIST. The results of untargeted attack on GBDT are in Table 3.
As shown in Table 3, by using around 30K queries, we could get a small distortion on both datasets, which firstly uncovers the vulnerability of GBDT models. Tree-based methods are well-known for its good interpretability. And because of that, they are widely used in the industry. However, we show that even with good interpretability and a similar prediction accuracy with convolution neural network, the GBDT models are vulnerable under our Opt-attack. This result raises a question about tree-based models’ robustness, which will be an interesting direction in the future.
In this paper, we propose a generic and optimization-based hard-label black-box attack algorithm, which can be applied to discrete and non-continuous models other than neural networks, such as the gradient boosting decision tree. Our method enjoys query-efficiency and has a theoretical convergence guarantee on the attack performance. Moreover, our attack achieves smaller or similar distortion using 3-4 times less queries compared with the state-of-the-art algorithm.
-  Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248, 2017.
-  Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations.
-  Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In , number EPFL-CONF-218057, 2016.
-  Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. Attacking visual language grounding with adversarial examples: A case study on neural image captioning. In ACL, 2018.
-  Minhao Cheng, Jinfeng Yi, Huan Zhang, Pin-Yu Chen, and Cho-Jui Hsieh. Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. CoRR, 2018.
-  Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.
Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh.
Zoo: Zeroth order optimization based black-box attacks to deep neural
networks without training substitute models.
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017.
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
Towards deep learning models resistant to adversarial attacks.In ICLR, 2018.
-  Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. Ead: elastic-net attacks to deep neural networks via adversarial examples. In AAAI, 2018.
-  Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Query-efficient black-box adversarial examples. arXiv preprint arXiv:1712.07113, 2017.
-  Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519. ACM, 2017.
-  Arjun Nitin Bhagoji, Warren He, Bo Li, and Dawn Song. Exploring the space of black-box attacks on deep neural networks. arXiv preprint arXiv:1712.09491, 2017.
-  Chun-Chen Tu, Pai-Shun Ting, Pin-Yu Chen, Sijia Liu, Huan Zhang, Jinfeng Yi, Cho-Jui Hsieh, and Shin-Ming Cheng. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. CoRR, abs/1805.11770, 2018.
-  Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free optimization, volume 8. Siam, 2009.
-  Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5:4308, 2014.
-  Yurii Nesterov. Random gradient-free minimization of convex functions. Technical report, 2011.
Because there is a stopping criterion in Algorithm 1, we couldn’t achieve the exact . Instead, we could get with error, i.e., . Also, we define to be the noisy gradient estimator.
Following , we define the Guassian smoothing approximation over , i.e,
Also, we have the upper bounds for the momentsfrom  Lemma 1.
For , we have
If , we have two-sided bounds
6.1 Proof of Theorem 1
Suppose has a lipschitz-continuous gradient with constant , then
We could bound as follows. Since
Take expectation over u, and with Theorem 3 in , which is ,
With , we could bound :
We use the result that
which is proved in  Lemma 4.
Therefore, since , we could get
Therefore, since has Lipshcitz-continuous gradient:
where is a all-one vector, taking the expectation in , we obtain
Choosing , we obtain
Since , taking expectation over , where , we get
where and .
Assuming , summing over k and divided by N+1, we get
Since , is in the same order oas. In order to satisfy , we need to choose , then N is bounded by .