Query-Efficient Hard-label Black-box Attack:An Optimization-based Approach

07/12/2018 ∙ by Minhao Cheng, et al. ∙ University of California-Davis ibm JD.com, Inc. 8

We study the problem of attacking a machine learning model in the hard-label black-box setting, where no model information is revealed except that the attacker can make queries to probe the corresponding hard-label decisions. This is a very challenging problem since the direct extension of state-of-the-art white-box attacks (e.g., CW or PGD) to the hard-label black-box setting will require minimizing a non-continuous step function, which is combinatorial and cannot be solved by a gradient-based optimizer. The only current approach is based on random walk on the boundary, which requires lots of queries and lacks convergence guarantees. We propose a novel way to formulate the hard-label black-box attack as a real-valued optimization problem which is usually continuous and can be solved by any zeroth order optimization algorithm. For example, using the Randomized Gradient-Free method, we are able to bound the number of iterations needed for our algorithm to achieve stationary points. We demonstrate that our proposed method outperforms the previous random walk approach to attacking convolutional neural networks on MNIST, CIFAR, and ImageNet datasets. More interestingly, we show that the proposed algorithm can also be used to attack other discrete and non-continuous machine learning models, such as Gradient Boosting Decision Trees (GBDT).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has been observed recently that machine learning algorithms, especially deep neural networks, are vulnerable to adversarial examples [3, 4, 5, 6, 7, 8]. For example, in image classification problems, attack algorithms [9, 3, 10] can find adversarial examples for almost every image with very small human-imperceptible perturbation. The problem of finding an adversarial example can be posed as solving an optimization problem—within a small neighbourhood around the original example, find a point to optimize the cost function measuring the “successfulness” of an attack. Solving this objective function with gradient-based optimizer leads to state-of-the-art attacks [9, 3, 10, 4, 11].

Most current attacks [3, 9, 4, 12]

consider the “white-box” setting, where the machine learning model is fully exposed to the attacker. In this setting, the gradient of the above-mentioned attack objective function can be computed by back-propagation, so attacks can be done very easily. This white-box setting is clearly unrealistic when the model parameters are unknown to an attacker. Instead, several recent works consider the “score-based black-box” setting, where the machine learning model is unknown to the attacker, but it is possible to make queries to obtain the corresponding probability outputs of the model 

[10, 13]. However, in many cases real-world models will not provide probability outputs to users. Instead, only the final decision (e.g., top-1 predicted class) can be observed. It is therefore interesting to show whether machine learning model is vulnerable in this setting.

Furthermore, existing gradient-based attacks cannot be applied to some non-continuous machine learning models which involve discrete decisions. For example, the robustness of decision-tree based models (random forest and gradient boosting decision trees (GBDT)) cannot be evaluated using gradient-based approaches, since the gradient of these functions does not exist.

In this paper, we develop an optimization-based framework for attacking machine learning models in a more realistic and general “hard-label black-box” setting. We assume that the model is not revealed and the attacker can only make queries to get the corresponding hard-label decision

instead of the probability outputs (also known as soft labels). Attacking in this setting is very challenging and almost all the previous attacks fail due to the following two reasons. First, the gradient cannot be computed directly by backpropagation, and finite differences based approaches also fail because the hard-label output is insensitive to small input perturbations; second, since only hard-label decision is observed, the attack objective functions become discontinuous with discrete outputs, which is combinatorial in nature and hard to optimize (see Section

2.4 for more details).

In this paper, we make hard-label black-box attacks possible and query-efficient by reformulating the attack as a novel real-valued optimization problem, which is usually continuous and much easier to solve. Although the objective function of this reformulation cannot be written in an analytical form, we show how to use model queries to evaluate its function value and apply any zeroth order optimization algorithm to solve it. Furthermore, we prove that by carefully controlling the numerical accuracy of function evaluations, a Random Gradient-Free (RGF) method can convergence to stationary points as long as the boundary is smooth. We note that this is the first attack with a guaranteed convergence rate in the hard-label black-box setting. In the experiments, we show our algorithm can be successfully used to attack hard-label black-box CNN models on MNIST, CIFAR, and ImageNet with far less number of queries compared to the state-of-art algorithm.

Moreover, since our algorithm does not depend on the gradient of the classifier, we can apply our approach to other non-differentiable classifiers besides neural networks. We show an interesting application in attacking Gradient Boosting Decision Tree, which cannot be attacked by all the existing gradient-based methods even in the white-box setting. Our method can successfully find adversarial examples with imperceptible perturbations for a GBDT within 30,000 queries.

2 Background and Related work

We will first introduce our problem setting and give a brief literature review to hightlight the difficulty of attacking hard-label black-box models.

2.1 Problem Setting

For simplicity, we consider attacking a -way multi-class classification model in this paper. Given the classification model and an original example , the goal is to generate an adversarial example such that

(1)

2.2 White-box attacks

Most attack algorithms in the literature consider the white-box setting, where the classifier is exposed to the attacker. For neural networks, under this assumption, back-propagation can be conducted on the target model because both network structure and weights are known by the attacker. For classification models in neural networks, it is usually assumed that , where

is the final (logit) layer output, and

is the prediction score for the -th class. The objectives in (1) can then be naturally formulated as the following optimization problem:

(2)

where is some distance measurement (e.g., or norm in Euclidean space),

is the loss function corresponding to the goal of the attack, and

is a balancing parameter. For untargeted attack, where the goal is to make the target classifier misclassify, the loss function can be defined as

(3)

where is the original label predicted by the classifier. For targeted attack, where the goal is to turn it into a specific target class , the loss function can also be defined accordingly.

Therefore, attacking a machine learning model can be posed as solving this optimization problem [9, 12], which is also known as the C&W attack or the EAD attack depending on the choice of the distance measurement. To solve (2), one can apply any gradient-based optimization algorithm such as SGD or Adam, since the gradient of can be computed via back-propagation.

The ability of computing gradient also enables many different attacks in the white-box setting. For example, eq (2) can also be turned into a constrained optimization problem, which can then be solved by projected gradient descent (PGD) [11]. FGSM [3] is the special case of one step PGD with norm distance. Other algorithms such as Deepfool [6] also solve similar optimization problems to construct adversarial examples.

2.3 Previous work on black-box attack

In real-world systems, usually the underlying machine learning model will not be revealed and thus white-box attacks cannot be applied. This motivates the study of attacking machine learning models in the black-box setting, where attackers do not have any information about the function . And the only valid operation is to make queries to the model and get the corresponding output . The first approach for black-box attack is using transfer attack [14]—instead of attacking the original model , attackers try to construct a substitute model to mimic and then attack using white-box attack methods. This approach has been well studied and analyzed in [15]. However, recent papers have shown that attacking the substitute model usually leads to much larger distortion and low success rate [10]. Therefore, instead, [10] considers the score-based black-box setting, where attackers can use

to query the softmax layer output in addition to the final classification result. In this case, they can reconstruct the loss function (

3) and evaluate it as long as the objective function exists for any . Thus a zeroth order optimization approach can be directly applied to minimize . [16] further improves the query complexity of [10]

by introducing two novel building blocks: (i) an adaptive random gradient estimation algorithm that balances query counts and distortion, and (ii) a well-trained autoencoder that achieves attack acceleration.

[13]

also solves a score-based attack problem using an evolutionary algorithm and it shows their method could be applied to hard-label black-box setting as well.

2.4 Difficulty of hard-label black-box attacks

Throughout this paper, the hard-label black-box setting refers to cases where real-world ML systems only provide limited prediction results of an input query. Specifically, only the final decision (top-1 predicted label) instead of probability outputs is known to an attacker.

Attacking in this setting is very challenging. In Figure (a)a, we show a simple 3-layer neural network’s decision boundary. Note that the term is continuous as in Figure (b)b because the logit layer output is real-valued functions. However, in the hard-label black-box setting, only is available instead of . Since

can only be one-hot vector, if we plug-in

into the loss function, (as shown in Figure (c)c) will be discontinuous and with discrete outputs.

(a) Decision boundary of
(b)
(c)
(d)
Figure 1: (a) A neural network classifier. (b) illustrates the loss function of C&W attack, which is continuous and hence can be easily optimized. (c) is the C&W loss function in the hard-label setting, which is discrete and discontinuous. (d) our proposed attack objective for this problem, which is continuous and easier to optimize. See detailed discussions in Section 3.

Optimizing this function will require combinatorial optimization or search algorithms, which is almost impossible to do given high dimensionality of the problem. Therefore, almost no algorithm can successfully conduct hard-label black-box attack in the literature. The only current approach 

[1] is based on random-walk on the boundary. Although this decision-based attack can find adversarial examples with comparable distortion with white-box attacks, it suffers from exponential search time, resulting in lots of queries, and lacks convergence guarantees. We show that our optimization-based algorithm can significantly reduce the number of queries compared with decision-based attack, and has guaranteed convergence in the number of iterations (queries).

3 Algorithms

Now we will introduce a novel way to re-formulate hard-label black-box attack as another optimization problem, show how to evaluate the function value using hard-label queries, and then apply a zeroth order optimization algorithm to solve it.

3.1 A Boundary-based Re-formulation

For a given example , true label and the hard-label black-box function , we define our objective function depending on the type of attack:

Untargeted attack: (4)
Targeted attack (given target ): (5)

In this formulation, represents the search direction and is the distance from to the nearest adversarial example along the direction . The difference between (4) and (5) corresponds to the different definitions of “successfulness” in untargeted and targeted attack, where the former one aims to turn the prediction into any incorrect label and the later one aims to turn the prediction into the target label. For untargeted attack, also corresponds to the distance to the decision boundary along the direction . In image problems the input domain of is bounded, so we will add corresponding upper/lower bounds in the definition of (4) and (5).

Figure 2: Illustration

Instead of searching for an adversarial example, we search the direction to minimize the distortion , which leads to the following optimization problem:

(6)

Finally, the adversarial example can be found by , where is the optimal solution of (6).

Note that unlike the C&W or PGD objective functions, which are discontinuous step functions in the hard-label setting (see Section 2), maps input direction to real-valued output (distance to decision boundary), which is usually continuous—a small change of usually leads to a small change of , as can be seen from Figure 2.

Moreover, we give three examples of defined in two dimension input space and their corresponding . In Figure  (a)a, we have a continuous classification function defined as follows

In this case, as shown in Figure  (c)c, is continuous. Moreover, in Figure  (b)b and Figure  (a)a, we show decision boundaries generated by GBDT and neural network classifier, which are not continuous. However, as showed in Figure  (d)d and Figure  (d)d, even if the classifier function is not continuous, is still continuous. This makes it easy to apply zeroth order method to solve (6).

(a) Decision boundary of continuous function
(b) Decision boundary of GBDT
(c) of (a)
(d) of (b)
Figure 3: Examples of decision boundary of classification function and corresponding

Compute up to certain accuracy. We are not able to evaluate the gradient of , but we can evaluate the function value of using the hard-label queries to the original function . For simplicity, we focus on untargeted attack here, but the same procedure can be applied to targeted attack as well.

First, we discuss how to compute directly without additional information. This is used in the initialization step of our algorithm. For a given normalized , we do a fine-grained search and then a binary search. In fine-grained search, we query the points one by one until we find . This means the boundary goes between . We then enter the second phase and conduct a binary search to find the solution within this region (same with line 11–17 in Algorithm 1). Note that there is an upper bound of the first stage if we choose by the direction of with some from another class. This procedure is used to find the initial and corresponding in our optimization algorithm. We omit the detailed algorithm for this part since it is similar to Algorithm 1.

Next, we discuss how to compute when we know the solution is very close to a value . This is used in all the function evaluations in our optimization algorithm, since the current solution is usually close to the previous solution, and when we estimate the gradient using (7), the queried direction will only be a small perturbation of the previous one. In this case, we first increase or decrease in local region to find the interval that contains boundary (e.g, and ), then conduct a binary search to find the final value of . Our procedure for computing value is presented in Algorithm 1.

1:Input: Hard-label model , original image , query direction , previous value , increase/decrease ratio , stopping tolerance (maximum tolerance of computed error)
2:
3:if  then
4:     
5:     while  do
6:               
7:else
8:     
9:     while  do
10:               
11:## Binary Search within
12:while  do
13:     
14:     if  then
15:         
16:     else
17:               
18:return
Algorithm 1 Compute locally

3.2 Zeroth Order Optimization

To solve the optimization problem (1) for which we can only evaluate function value instead of gradient, zeroth order optimization algorithms can be naturally applied. In fact, after the reformulation, the problem can be potentially solved by any zeroth order optimization algorithm, like zeroth order gradient descent or coordinate descent (see [17] for a comprehensive survey).

Here we propose to solve (1) using Randomized Gradient-Free (RGF) method proposed in [2, 18]. In practice we found it outperforms zeroth-order coordinate descent. In each iteration, the gradient is estimated by

(7)

where is a random Gaussian vector, and is a smoothing parameter (we set in all our experiments). The solution is then updated by with a step size . The procedure is summarized in Algorithm 2.

1:Input: Hard-label model , original image , initial .
2:for  do
3:     Randomly choose

from a zero-mean Gaussian distribution

4:     Evaluate and using Algorithm 1
5:     Compute  
6:     Update  
7:return
Algorithm 2 RGF for hard-label black-box attack

There are several implementation details when we apply this algorithm. First, for high-dimensional problems, we found the estimation in (7) is very noisy. Therefore, instead of using one vector, we sample vectors from Gaussian distribution and average their estimators to get . We set in all the experiments. The convergence proofs can be naturally extended to this case. Second, instead of using a fixed step size (suggested in theory), we use a backtracking line-search approach to find step size at each step. This leads to additional query counts, but makes the algorithm more stable and eliminates the need to hand-tuning the step size.

3.3 Theoretical Analysis

If can be computed exactly, it has been proved in [2] that RGF in Algorithm 2 requires at most iterations to converge to a point with . However, in our algorithm the function value cannot be computed exactly; instead, we compute it up to -precision, and this precision can be controlled by binary threshold in Algorithm 1. We thus extend the proof in [2] to include the case of approximate function value evaluation, as described in the following theorem.

Theorem 1

In Algorithm 2, suppose g has Lipschitz-continuous gradient with constant . If the error of function value evaluation is controlled by and , then in order to obtain , the total number of iterations is at most .

Detailed proofs can be found in the appendix. Note that the binary search procedure could obtain the desired function value precision in steps. By using the same idea with Theorem 1 and following the proof in [2], we could also achieve complexity when is non-smooth but Lipschitz continuous.

4 Experimental results

We test the performance of our hard-label black-box attack algorithm on convolutional neural network (CNN) models and compare with decision-based attack [1]. Furthermore, we show our method can be applied to attack Gradient Boosting Decision Tree (GBDT) and present some interesting findings.

4.1 Attack CNN image classification models

We use three standard datasets: MNIST [19], CIFAR-10 [20] and ImageNet-1000 [21]. To have a fair comparison with previous work, we adopt the same networks used in both  [9] and  [1]

. In detail, both MNIST and CIFAR use the same network structure with four convolution layers, two max-pooling layers and two fully-connected layers. Using the parameters provided by  

[9], we could achieve 99.5% accuracy on MNIST and 82.5% accuracy on CIFAR-10, which is similar to what was reported in  [9]. For Imagenet-1000, we use the pretrained network Resnet-50  [22] provided by torchvision111https://github.com/pytorch/vision/tree/master/torchvision

, which could achieve 76.15% top-1 accuracy. All models are trained using Pytorch and our source code is publicly available

222https://github.com/LeMinhThong/blackbox-attack.

We include the following algorithms into comparison:

  • Opt-based black-box attack (Opt-attack): our proposed algorithm.

  • Decision-based black-box attack [1] (Decision-attack): the only previous work on attacking hard-label black box model. We use the authors’ implementation and use default parameters provided in Foolbox333https://github.com/bethgelab/foolbox.

  • C&W white-box attack [9]: one of the current state-of-the-art attacking algorithm in the white-box setting. We do binary search on parameter per image to achieve the best performance. Attacking in the white-box setting is a much easier problem, so we include C&W attack just for reference and indicate the best performance we can possibly achieve.

For all the cases, we conduct adversarial attacks for randomly sampled images from validation sets. Note that all three attacks have 100% successful rate, and we report the average distortion, defined by , where is the adversarial example constructed by an attack algorithm and is the original -th example. For black-box attack algorithms, we also report average number of queries for comparison.

4.1.1 Untargeted Attack

MNIST CIFAR10 Imagenet (ResNet-50)
Avg # queries Avg # queries Avg # queries
Decision-attack (black-box) 1.1222 60,293 0.1575 123,879 5.9791 123,407
1.1087 143,357 0.1501 220,144 3.7725 260,797
Opt-attack (black-box) 1.188 22,940 0.2050 40,941 6.9796 71,100
1.049 51,683 0.1625 77,327 4.7100 127,086
1.011 126,486 0.1451 133,662 3.1120 237,342
C&W (white-box) 0.9921 - 0.1012 - 1.9365 -
Table 1: Results of untargeted attack.
Figure 4: Log distortion comparison of Decision-attack (solid curves) vs Opt-attack (dotted curves) over number of queries for 6 different images.

For untargeted attack, the goal is to turn a correctly classified image into any other label. The results are presented in Table 1. Note that for both Opt-attack and Decision-attack, by changing stopping conditions we can get the performance with different number of queries.

First, we compare two black-box attack methods in Table 1. Our algorithm consistently achieves smaller distortion with less number of queries than Decision-attack. For example, on MNIST data, we are able to reduce the number of queries by 3-4 folds, and Decision-attack converges to worse solutions in all the 3 datasets. Compared with C&W attack, we found black-box attacks attain slightly worse distortion on MNIST and CIFAR.

(a) Examples of targeted Opt-attack
(b) Examples of targeted Decision-attack
Figure 5: Example quality comparison between targeted Opt-attack and Decision-attack. Opt-attack can achieve a better result with less queries.

This is reasonable because white-box attack has much more information than black-box attack and is strictly easier. We note that the experiments in [1] conclude that C&W and Decision-attack have similar performance because they only run C&W with a single regularization parameter without doing binary search to obtain the optimal parameter. For ImageNet, since we constraint the number of queries, the distortion of black-box attacks is much worse than C&W attack. The gap can be reduced by increasing the number of queries as showed in Figure  4.

4.1.2 Targeted attack

The results for targeted attack is presented in Table 2. Following the experiments in [1], for each randomly sampled image with label we set target label . On MNIST data, we found our algorithm is more than 4 times faster (in terms of number of queries) than Decision-attack and converge to a better solution. On CIFAR data, our algorithm has similar efficiency with Decision-attack at the first 60,000 queries, but converges to a slightly worse solution. Also, we show a example quality comparison from the same starting point to the original sample in Figure 5.

MNIST CIFAR10
Avg # queries Avg # queries
Decision-attack (black-box) 2.3158 30,103 0.2850 55,552
2.0052 58,508 0.2213 140,572
1.8668 192,018 0.2122 316,791
Opt-attack (black-box) 1.8522 46,248 0.2758 61,869
1.7744 57,741 0.2369 141,437
1.7114 73,293 0.2300 186,753
C&W (white-box) 1.4178 - 0.1901 -
Table 2: Results of targeted attack.
HIGGS MNIST
Avg # queries Avg # queries
Ours 0.3458 4,229 0.6113 5,125
0.2179 11,139 0.5576 11,858
0.1704 29,598 0.5505 32,230
Table 3: Results of untargeted attack on gradient boosting decision tree.

4.1.3 Attack Gradient Boosting Decision Tree (GBDT)

To evaluate our method’s ability to attack models with discrete decision functions, we conduct our untargeted attack on gradient booting decision tree (GBDT). In this experiment, we use two standard datasets: HIGGS [23] for binary classification and MNIST [19] for multi-class classification. We use popular LightGBM444https://github.com/Microsoft/LightGBM framework to train the GBDT models. Using suggested parameters555https://github.com/Koziev/MNIST_Boosting, we could achieve 0.8457 AUC for HIGGS and 98.09% accuracy for MNIST. The results of untargeted attack on GBDT are in Table 3.

As shown in Table  3, by using around 30K queries, we could get a small distortion on both datasets, which firstly uncovers the vulnerability of GBDT models. Tree-based methods are well-known for its good interpretability. And because of that, they are widely used in the industry. However, we show that even with good interpretability and a similar prediction accuracy with convolution neural network, the GBDT models are vulnerable under our Opt-attack. This result raises a question about tree-based models’ robustness, which will be an interesting direction in the future.

5 Conclusion

In this paper, we propose a generic and optimization-based hard-label black-box attack algorithm, which can be applied to discrete and non-continuous models other than neural networks, such as the gradient boosting decision tree. Our method enjoys query-efficiency and has a theoretical convergence guarantee on the attack performance. Moreover, our attack achieves smaller or similar distortion using 3-4 times less queries compared with the state-of-the-art algorithm.

References

6 Appendix

Because there is a stopping criterion in Algorithm  1, we couldn’t achieve the exact . Instead, we could get with error, i.e., . Also, we define to be the noisy gradient estimator.

Following [24], we define the Guassian smoothing approximation over , i.e,

(8)

Also, we have the upper bounds for the moments

from  [24] Lemma 1.

For , we have

(9)

If , we have two-sided bounds

(10)

6.1 Proof of Theorem 1

Suppose has a lipschitz-continuous gradient with constant , then

(11)

We could bound as follows. Since

(12)

and ,

(13)

Take expectation over u, and with Theorem 3 in  [24], which is ,

(14)

With , we could bound :

(15)

We use the result that

(16)

which is proved in  [24] Lemma 4.

Therefore, since , we could get

(17)

Therefore, since has Lipshcitz-continuous gradient:

(18)

so that

(19)

Since

(20)

where is a all-one vector, taking the expectation in , we obtain

(21)

Choosing , we obtain

(22)

Since , taking expectation over , where , we get

(23)

where and .

Assuming , summing over k and divided by N+1, we get

(24)

Clearly, .

Since , is in the same order oas. In order to satisfy , we need to choose , then N is bounded by .