Deep Neural Networks (DNNs) are vulnerable to adversarial examples, which are malicious inputs designed to fool the network’s prediction—see(Biggio and Roli, 2018) for a comprehensive, recent overview of adversarial examples. Research on generating these malicious inputs started in the white-box setting, where access to the gradients of the models was assumed. Since the gradient points to the direction of steepest ascent, a malicious input can be perturbed along that gradient to maximize the network’s loss, thereby fooling its prediction. The assumption of access to the underlying gradient does not however reflect real world scenarios. Attacks to a realistic, more restrictive black-box threat model, which does not assume access to gradients, have since been studied as summarized in Section 1.2.
Central to the approach of generating adversarial examples in a black-box threat model is estimating the gradients of the model being attacked. In estimating these gradients (their magnitudes and signs), the community at large has focused on formulating it as a problem in continuous optimization. Their works seek to reduce the query complexity from the standard , where is the number of input features/covariates. In this paper, we take a different view and focus on estimating just the sign of the gradient by reformulating the problem as minimizing the Hamming distance to the gradient sign. Given access to a Hamming distance oracle, this view guarantees a query complexity of : an order of magnitude lesser than the full gradient estimation’s query complexity for most practically-occurring input dimensions . Our key objective is to answer the following:
Is it possible to estimate only the sign of the gradient with such query efficiency and generate adversarial examples as effective as those generated by full gradient estimation approaches?
We propose a novel formulation which attempts to achieve this by exploiting some properties of the directional derivative of the loss function of the model under attack, and through rigorous empirical evaluation we show our approach outperforms state of the art full gradient estimation techniques. We also identify several key areas of research which we believe will help the community towards query-efficient adversarial attacks and gradient-free optimization.
1.2 Related Work
We organize the related work in two themes, namely Adversarial Example Generation and Sign-Based Optimization.
Adversarial Example Generation. This literature can be organized as generating examples in either a white-box or a black-box setting. Nelson et al. (2012) provide a theoretical framework to analyze adversarial querying in a white-box setting. Following the works of Biggio et al. (2013) and Goodfellow et al. (2015) who introduced the fast gradient sign method (FGSM), several methods to produce adversarial examples have been proposed for various learning tasks and threat perturbation constraints (Carlini and Wagner, 2017; Moosavi-Dezfooli et al., 2016; Hayes and Danezis, 2017; Al-Dujaili et al., 2018; Huang et al., 2018; Kurakin et al., 2017; Shamir et al., 2019). These methods assume a white-box setup and are not the focus of this work. An approach, which has received the community’s attention, involves learning adversarial examples for one model (with access to its gradient information) to transfer them against another (Liu et al., 2016; Papernot et al., 2017). As an alternative to the transferability phenomenon, Xiao et al. (2018) use a Generative Adversarial Network (GAN) to generate adversarial examples which are based on small norm-bounded perturbations. Both approaches involve learning on a different model, which is expensive, and does not lend itself to comparison in our setup, where we directly query the model of interest. Among works which generate examples in a black-box setting through iterative optimization schemes, Narodytska and Kasiviswanathan (2017) showed how a naïve policy of perturbing random segments of an image achieved adversarial example generation. They do not use any gradient information. Bhagoji et al. (2017)
reduce the dimensions of the feature space using Principal Component Analysis (PCA) and random feature grouping, before estimating gradients. This enables them to bound the number of queries made.Chen et al. (2017) introduced a principled approach to solving this problem using gradient based optimization. They employ finite differences, a zeroth-order optimization tool, to estimate the gradient and then use it to design a gradient-based attack on models. While this approach successfully generates adversarial examples, it is expensive in the number of queries made to the model. Ilyas et al. (2018) substitute traditional finite differences methods with Natural Evolutionary Strategies () to obtain an estimate of the gradient. Tu et al. (2018) provide an adaptive random gradient estimation algorithm that balances query counts and distortion, and introduces a trained auto-encoder to achieve attack acceleration. Ilyas et al. (2019) extend this line of work by proposing the idea of gradient priors. Our work contrasts the general approach used by these works. We investigate whether just estimating the sign of the gradient suffices to efficiently generate examples.
In the context of general-purpose continuous optimization methods, sign-based stochastic gradient descent was studied in both zeroth- and first-order setups. In the latter,Bernstein et al. (2018) analyzed signSGD, a sign-based Stochastic Gradient Descent, and showed that it enjoys a faster empirical convergence than SGD in addition to the cost reduction of communicating gradients across multiple workers. Liu et al. (2019) extended signSGD to zeroth-order setup with the algorithm. requires times more iterations than signSGD, leading to a convergence rate of , where is the number of optimization variables, and is the number of iterations.
Adversarial Examples Meet Sign-based Optimization. In the context of adversarial examples generation, the effectiveness of sign of the gradient coordinates was noted in both white- and black-box settings. In the former, the Fast Gradient Sign Method ()—which is algorithmically similar to signSGD—was proposed to generate white-box adversarial examples (Goodfellow et al., 2015). Ilyas et al. (2019) examined a noisy version of to address the question of How accurate of a gradient estimate is necessary to execute a successful attack on a neural net. In Figure 1, we reproduce their experiment on an -based model—Plot (c)—and extended it to the and datasets—Plots (a) and (b). Observe that estimating the sign of the top gradient coordinates (in terms of their magnitudes) is enough to achieve a misclassification rate of . Furthermore, (Liu et al., 2019) was shown to perform better than at generating adversarial examples against a black-box neural network on the dataset.
1.3 Our Contributions
Motivated by i) the practical effectiveness of the gradient sign information; and that ii) the gradient sign can be recovered with a lower query complexity than to retrieve both its sign and magnitude (as we will show herein), we view the black-box adversarial attack problem as estimating the gradient’s sign bits. This shift from continuous to binary black-box optimization leads to the following contributions at the intersection of adversarial machine learning and black-box (zeroth-order) optimization:
We present three properties of the directional derivative of the loss function of the model under attack in the direction of vectors, and propose methods to estimate the gradient sign bits exploiting these properties. Namely,
Property 3.1 shows that the directional derivative in the direction of a sign vector is an affine transformation of the Hamming distance between and the gradient sign vector. This suggests that if we can recover the Hamming distance from the directional derivative, then the gradient sign bits can be recovered with a query complexity of using any off-the-shelf efficient Hamming search strategy.
Property 5 shows that the directional derivative is locally smooth around the gradient sign. This lets us employ the optimism in the face of uncertainty principle in estimating the gradient sign. Through the use of hierarchical bandits, we show that the knowledge of this smoothness is not required and provide a finite-time upper bound on the quality of the estimation at the expense of searching over the possible sign vectors.
Property 3.3 shows that the directional derivative is separable with respect to coordinates of a sign vector . Based on this property, we devise a divide-and-conquer algorithm, which we refer to as , that reduces the search complexity from to . When given a budget of queries, is guaranteed to perform at least as well as (Goodfellow et al., 2015), which has access to the model’s gradient.
Through rigorous empirical evaluation, Property 3.3 (and hence ) is found to be the most effective in crafting black-box adversarial examples. In particular,
To exploit Property 3.1, we propose an estimation of the Hamming distance (to the gradient sign) oracle from the finite difference of model’s loss value queries, and provide an empirical motivation and evaluation of the same. We find that efficient Hamming search strategies from the literature (e.g., Maurer (2009)) are not robust to approximation error of the proposed Hamming distance estimation, and hence no guarantees can be made about the estimated sign vector.
Despite being theoretically-founded, the approach exploiting Property 5 is slow and not scalable for most practically-occurring input dimensions .
Through experiments on , , and for both and threat constraints, yields black-box attacks that are more query efficient and less failure-prone than the state of the art attacks combined. On two public black-box attack challenges, our approach achieves the highest evasion rate surpassing techniques based on transferability, ensembling, and generative adversarial networks.
Finally, we release a software framework222This builds on other open-source frameworks such as the MNIST and CIFAR challenges (Madry et al., 2017). to systematically benchmark adversarial black-box attacks on DNNs for , , and datasets in terms of their success rate, query count, and other related metrics. This was motivated by the problem we faced in comparing approaches from the literature, where different researchers evaluated their approaches on different datasets, metrics, and setups—e.g., some compared only on while others considered and .
The rest of the paper is structured as follows. First, a formal background is presented in Section 2. Section 3 describes our approach for black-box adversarial attacks by examining three properties of the loss’s directional derivative of the model under attack. Experiments are discussed in Section 4. Using two public black-box attack challenges, we evaluate the approach against one of the defenses developed to mitigate adversarial examples in Section 5. Finally, open questions and conclusions are outlined in Sections 6 and 7.
2 Formal Background
Let denote the dimension of a neural network’s input. Denote a hidden -dimensional binary code by . That is, . The response of the Hamming (distance) oracle to the th query is denoted by and equals the Hamming distance
where the Hamming norm is defined as the number of non-zero entries of vector . We also refer to as the noiseless Hamming oracle, in contrast to the noisy Hamming oracle , which returns noisy versions of ’s responses as we will see shortly. is the identity matrix. The query ratio is defined as where is the number of queries to required to retrieve . Furthermore, denote the directional derivative of some function at a point in the direction of a vector by which often can be approximated by the finite difference method. That is, for , we have
Let be the projection operator onto the set , be the ball of radius around . Next, we provide lower and upper bounds on the query ratio .
2.2 Bounds on the Query Ratio
Lower Bound on .
for any sequence of queries that determine every -dimensional binary code uniquely. See (Vaishampayan, 2012, Page 4).
Exact Solution with .
In the following theorem, we show that no more than queries are required to retrieve the hidden -dimensional binary code .
A hidden -dimensional binary code can be retrieved exactly with no more than queries to the noiseless Hamming oracle .
The key element of this proof is that the Hamming distance between two -dimensional binary codes can be written as
Let be an matrix where the th row is the th query code . Likewise, let be the corresponding th query response, and is the concatenating vector. In matrix form, we have
where is invertible if we construct linearly independent queries .
2.3 Gradient Estimation Problem: a Hamming Distance View
At the heart of black-box adversarial attacks is generating a perturbation vector to slightly modify the original input so as to fool the network prediction of its true label . Put it differently, an adversarial example maximizes the network’s loss but still remains -close to the original input . Although the loss function can be non-concave, gradient-based techniques are often very successful in crafting an adversarial example (Madry et al., 2017). That is, to set the perturbation vector as a step in the direction of . Subsequently, the bulk of black-box attack methods sought to estimate the gradient by querying an oracle that returns, for a given input/label pair , the value of the network’s loss . Using only such value queries, the basic approach relies on the finite difference method to approximate the directional derivative ((2)) of the function at the input/label pair in the direction of a vector , which corresponds to . With linearly independent vectors , one can construct a linear system of equations to recover the full gradient. Clearly, this approach’s query complexity is , which can be prohibitively expensive for large (e.g., for the dataset). Moreover, the queries are not adaptive, where one may make use of the past queries’ responses to construct the new query and recover the full gradient with less queries. Recent works tried to mitigate this issue by exploiting data- and/or time-dependent priors (Tu et al., 2018; Ilyas et al., 2018, 2019).
The lower bound of Theorem 2.2 on the query complexity of a Hamming oracle to find a hidden vector suggests the following: instead of estimating the full gradient (sign and magnitude) and apart from exploiting any data- or time-dependent priors; why do we not focus on estimating its sign? After all, simply leveraging (noisy) sign information of the gradient yields successful attacks; see Figure 1. Therefore, our interest in this paper is the gradient sign estimation problem, which we formally define next, breaking away from the general trend of the continuous optimization view in constructing black-box adversarial attacks, manifested by the focus on the full gradient estimation problem.
(Gradient Sign Estimation Problem) For an input/label pair and a loss function , let be the gradient of at and be the sign bit vector of .333Without loss of generality, we encode the sign bit vector in rather than . This is a common representation in sign-related literature. Note that the standard function has the range of . Here, we use the non-standard definition (Zhao, 2018) whose range is . This follows from the observation that DNNs’ gradients are not sparse (Ilyas et al., 2019, Appendix B.1). Then the goal of the gradient sign estimation problem is to find a binary444Throughout the paper, we use the terms binary vectors and sign vectors/bits interchangeably. vector minimizing the Hamming norm
or equivalently maximizing the directional derivative
from a limited number of (possibly adaptive) function value queries .
In the next section, we set to tackle the problem above leveraging three properties of the loss directional derivative which, in the black-box setup, is approximated by finite difference of loss value queries .
Recall that our definition of the Hamming distance here is over the binary vectors ((1)), a formal statement of the gradient sign estimation problem. In contrast, Shamir et al. (2019) consider the Hamming distance in defining the threat perturbation constraint: if the threat perturbation constraint is , only data features (pixels) are allowed to be changed, and each one of them can change a lot.
3 A Framework for Estimating Sign of the Gradient from Loss Oracles
Our interest in this section is to estimate the gradient sign bits of the loss function of the model under attack at an input/label pair () from a limited number of loss value queries . To this end, we examine the basic concept of directional derivatives that has been employed in recent black-box adversarial attacks. Particularly, we present three approaches to estimate the gradient sign bits based on three properties of the directional derivative of the loss in the direction of a sign vector .
3.1 Approach 1: Loss Oracle as a Noisy Hamming Oracle
The directional derivative of the loss function at in the direction of a binary code can be written as
where , . Note that . The quantities and are the means of and , respectively. Observe that the Hamming distance between and the gradient sign . In other words, the directional derivative has the following property. The directional derivative of the loss function at an input/label pair in the direction of a binary code can be written as an affine transformation of the Hamming distance between and . Formally, we have
If we can recover the Hamming distance from the directional derivative based on (7), efficient Hamming search strategies—e.g., (Maurer, 2009)—can then be used to recover the gradient sign bits with a query complexity as stated in Theorem 2.2. However, not all terms of (7) is known to us. While is the number of data features (known a priori) and is available through a finite difference oracle, and
are not known. Here, we propose to approximate these values by their Monte Carlo estimates: averages of the magnitude of sampled gradient components. Our assumption is that the magnitudes of gradient coordinates are not very different from each other, and hence a Monte Carlo estimate is good enough (with small variance). Our experiments on , , and confirm the same—see Figure15 in the supplement.
To use the th gradient component as a sample for our estimation, one can construct two binary codes and such that only their th bit is different, i.e., . Thus, we have
where and .555It is possible that one of and will (e.g., when we only have one sample). In this case, we make the approximation as . As a result, the Hamming distance between and the gradient sign can be approximated with the following quantity, which we refer to as the noisy Hamming oracle .
We empirically evaluated the quality of ’s responses on a toy problem where we controlled the magnitude spread/concentration of the gradient coordinates with being the number of unique values (magnitudes) of the gradient coordinates. As detailed in Figure 3, the error can reach . This a big mismatch, especially if we recall the Hamming distance’s range is . The negative impact of this on the Hamming search strategy by Maurer (2009) was verified empirically in Figure 4. We considered the simplest case where was given access to the noisy Hamming oracle in a setup similar to the one outlined in Figure 3, with , , , and the hidden code . To account for the randomness in constructing , we ran independent runs and plot the average Hamming distance (with confidence bounds) over queries. In Figure 4 (a), which corresponds to exact estimation , spends queries to construct and terminates one query afterwards with the true binary code , achieving a query ratio of 21/80. On the other hand, when we set in Figure 4 (b); returns a 4-Hamming-distance away solution from the true binary code after queries. This is not bad for an -bit long code. However, this is in a tightly controlled setup where the gradient magnitudes are just one of two values. To be studied further is the bias/variance decomposition of the returned solution and the corresponding query ratio. We leave this investigation for future work.
3.2 Approach 2: Optimism in the Face of Uncertainty
In the previous approach, we considered the approximated Hamming distance ((12)) as a surrogate for the formal optimization objective ((4)) of the gradient sign estimation problem. We found that current Hamming search strategies are not robust to approximation error. In this approach, we consider maximizing the directional derivative ((5)) as our formal objective of the gradient sign estimation problem. Formally, we treat the problem as a binary black-box optimization over the hypercube vertices, which correspond to all possible sign vectors. This is significantly worse than of the continuous optimization view. Nevertheless, the rationale here is that we do not need to solve (5) to optimality (recall Figure 1); we rather need a fast convergence to a suboptimal but adversarially helpful sign vector . In addition, the continuous optimization view often employs an iterative scheme of steps within the perturbation ball , calling the gradient estimation routine in every step leading to a search complexity of . In our setup, we use the best obtained solution for (5) so far in a similar fashion to the noisy of Figure 1. In other words, our gradient sign estimation routine runs at the top level of our adversarial example generation procedure instead of calling it as a subroutine. In this and the next approach, we address the following question: how do we solve (5)?
Optimistic methods, i.e., methods that implement the optimism in the face of uncertainty principle have demonstrated a theoretical as well as empirical success when applied to black-box optimization problems (Munos, 2011; Al-Dujaili and Suresh, 2017, 2018). Such a principle finds its foundations in the machine learning field addressing the exploration vs. exploitation dilemma, known as the multi-armed bandit problem. Within the context of function optimization, optimistic approaches formulate the complex problem of optimizing an arbitrary black-box function (e.g., (5)) over the search space ( in this paper) as a hierarchy of simple bandit problems (Kocsis and Szepesvári, 2006) in the form of space-partitioning tree search . At step , the algorithm optimistically expands a leaf node (partitions the corresponding subspace) from the set of leaf nodes that may contain the global optimum. The node at depth , denoted by , corresponds to the subspace/cell such that . To each node , a representative point is assigned, and the value of the node is set to . See Figure 6 for an example of a space-partitioning tree of , which will be used in our second approach to estimate the gradient sign vector.
Under some assumptions on the optimization objective and the hierarchical partitioning of the search space, optimistic methods enjoy a finite-time bound on their regret defined as
where is the best found solution by the optimistic method after steps. The challenge is how to align the search space such that these assumptions hold. In the following, we show that these assumptions can be satisfied for our optimization objective ((5)). In particular, when is the directional derivative function , and ’s vertices are aligned on a 1-dimensional line according to the Gray code ordering, then we can construct an optimistic algorithm with a finite-time bound on its regret. To demonstrate this, we adopt the Simultaneous Optimistic Optimization framework by Munos (2011) and the assumptions therein.
For completeness, we reproduce Munos (2011)’s basic definitions and assumptions with respect to our notation. At the same time we show how the gradient sign estimation problem ((5)) satisfies them based on the second property of the directional derivative as follows.
[Semi-metric] We assume that is such that for all , we have and if and only if .
(Near-optimality dimension) The near-optimal dimension is the smallest such that there exists such that for any , the maximal number of disjoint -balls of radius and center in is less than .
[Local smoothness of ] For any input/label pair , there exists at least a global optimizer of (i.e., ) and for all ,
Refer to Figure 5 for a pictorial proof of Property 5. [Bounded diameters] There exists a decreasing a decreasing sequence , such that for any depth , for any cell of depth , we have . To see how Assumption 3.2 is met, refer to Figure 6. [Well-shaped cells] There exists such that for any depth , any cell contains a -ball of radius centered in . To see how Assumption 3.2 is met, refer to Figure 6. With the above assumptions satisfied, we propose the Gray-code Optimistic Optimization (), which is an instantiation of (Munos, 2011, Algorithm 2) tailored to our optimization problem ((5)) over a 1-dimensional alignment of using the Gray code ordering. The pseudocode is outlined in Algorithm 1. The following theorem bounds ’s regret.
Regret Convergence of Let us write the smallest integer such that
Then, with , the regret of (Algorithm 1) is bounded as
We have showed that our objective function ((5)) and the hierarchical partitioning of following the Gray code ordering confirm to Property 5 and Assumptions 3.2 and 3.2. The term in (14) is to accommodate the evaluation of node before growing the space-partitioning tree —see Figure 6. The rest follows from the proof of (Munos, 2011, Theorem 2).
Despite being theoretically-founded, is slow in practice. This is expected since it is a global search technique that considers all the vertices of the -dimensional hypercube . Recall that we are looking for adversarially helpful solution that may not be necessarily optimal. To this end, we consider the separability property of the directional derivative, a more useful property than its local smoothness as described in our third approach next.
|(a) view||(b) view|
3.3 Approach 3: Divide & Conquer
Based on the definition of the directional derivative ((2)), we state the following property. [Separability of ] The directional derivative of the loss function at an input/label pair in the direction of a binary code is separable. That is,
Instead of considering the search space (Section 3.2), we employ the above property in a divide-and-conquer search which we refer to as . As outlined in Algorithm 2, the technique starts with a random guess of the sign vector . It then proceeds to flip the sign of all the coordinates to get a new sign vector , and revert the flips if the loss oracle returned a value (or equivalently the directional derivative ) less than the best obtained so far . applies the same rule to the first half of the coordinates, the second half, the first quadrant, the second quadrant, and so on. For a search space of dimension , needs sign flips to complete its search. If the query budget is not exhausted by then, one can update with the recovered signs and restart the procedure at the updated point with a new starting code ( in Algorithm 2). In the next theorem, we show that is guaranteed to perform at least as well as the Fast Gradient Sign Method after oracle queries.
(Optimality of ) Given queries, is at least as effective as (Goodfellow et al., 2015) in crafting adversarial examples.
The th coordinate of the gradient sign vector can be recovered as outlined in (9) which takes queries. From the definition of , this is carried out for all the coordinates after queries. That is, the gradient sign vector is fully recovered after queries, and therefore one can employ the attack to craft adversarial examples. Note that this is under the assumption that our finite difference approximation of the directional derivative ((2)) is good enough (or at least a rank-preserving).
Theorem 3.3 provides an upper bound on the number of queries required for to recover the gradient sign bits, and perform as well as . In practice (as will be shown in our experiments), crafts adversarial examples with a fraction of this upper bound. Note that one could recover the gradient sign vector with queries by starting with an arbitrary sign vector and flipping its bits sequentially. Nevertheless, incorporates the queries in a framework of majority voting to recover as many sign bits as possible with as few queries as possible. Consider the case where all the gradient coordinates have the same magnitude—the case in Figure 3. If we start with a random sign vector whose Hamming distance to the optimal sign vector is : agreeing with in the first half of coordinates. In this case, needs just four queries to recover the entire sign vector, whereas the sequential bit flipping would require queries.
|Hamming Distance Trace||Directional Derivative Trace|
Moreover, is amenable to parallel hardware architecture and thus can carry out attacks in batches more efficiently, compared to the previous presented approaches. We tested both and (along with and ) on a set of toy problems and found that performs significantly better than , while and were sensitive to the approximation error—see Figure 7. For these reasons, in our experiments on the real datasets , , ; we opted for as our algorithm of choice to estimate the gradient sign in crafting black-box adversarial attacks as outlined in Algorithm 3.
In this section, we evaluate and compare it with established algorithms from the literature: (Liu et al., 2019), (Ilyas et al., 2018), and (Ilyas et al., 2019) in terms of their effectiveness in crafting untargeted black-box adversarial examples. Both and threat models are considered on the , , and datasets.
4.1 Experiments Setup
Our experiment setup is similar to (Ilyas et al., 2019). Each attacker is given a budget of oracle queries per attack attempt and is evaluated on images from the test sets of , , and . We did not find a standard practice of setting the perturbation bound , arbitrary bounds were used in several papers. We set the perturbation bounds based on the following.
We show results based on standard models. For and , the naturally trained models from (Madry et al., 2017)’s 666https://github.com/MadryLab/mnist_challenge and 777https://github.com/MadryLab/cifar10_challenge challenges are used. For , the Inception-V3 model from TensorFlow is used.888https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/nets/inception_v3_test.py The loss oracle represents the cross-entropy loss of the respective model. General setup of the experiments is summarized in Table 3 in Appendix B.
4.2 Hyperparameters Setup
To ensure a fair comparison among the considered algorithms, we did our best in tuning their hyperparameters. Initially, the hyperparameters were set to the values reported by the corresponding authors, for which we observed suboptimal performance. This can be attributed to either using a different software framework (e.g., TensorFlow vs. PyTorch), models, or the way the model’s inputs are transformed (e.g., some models take pixel values to be in the range while others are built for ). We made use of a synthetic concave loss function to tune the algorithms’ parameters for each dataset perturbation constraint combination. The performance curves on the synthetic loss function using the tuned values of the hyperparameters did show consistency with the reported results from the literature. For instance, we noted that converges faster than . Further, outperformed the rest of the algorithms towards the end of query budget. That said, we invite the community to provide their best tuned attacks. Note that does not have any hyperparameters to tune. The finite difference probe for is set to the perturbation bound because this perturbation is used for for both computing the finite difference and crafting the adversarial examples—see Line 3 in Algorithm 2. This parameter-free setup of offers a robust edge over the state-of-the-art black-box attacks, which often require expert knowledge to carefully tune their parameters as discussed above. More details on the hyperparameters setup can be found in Appendix B.
show the trade-off between the success (evasion) rate and the mean number of queries (of the successful attacks) needed to generate an adversarial example for the , , and classifiers in theand perturbation constraints, respectively. In other words, these figures indicate the average number of queries required for a desired success rate. Tabulated summary of these plots can be found in Appendix D, namely Tables 8, 9, and 10
. Furthermore, we plot the classifier loss and the gradient estimation quality (in terms of Hamming distance and Cosine similarity) averaged over all the images as a function of the number of queries used as shown in Figures16, 17, and 18 in Appendix D. Based on the results, we observe the following:
For any given success rate, dominates the previous state of the art approaches in all settings except the setup,999To be accurate, all the algorithms are comparable in the setup for success rate . where shows a better query efficiency when the desired success rate is greater than or equal .
is remarkably efficient in the setup (e.g., achieving a evasion using—on average—just queries per image against the classifier!). Its performance degrades—yet, still outperforms the rest, most of the time—in the setup. This is expected, since perturbs all the coordinates with the same magnitude and the perturbation bound for all the datasets in our experiments is set such that as shown in Table 3. Take the case of (), where and . For , the setup is equivalent to an perturbation bound of . The employed perturbation bounds give the state of the art—continuous optimization based—approaches more perturbation options. For instance, it is possible for to perturb one pixel in an MNIST image by a magnitude of each; two pixels by a magnitude of each; and ten pixels by a magnitude of each. On the other hand, the binary optimization view of limits it to always perturb all pixels by a magnitude of
. Despite its less degrees of freedom, maintains its effectiveness in thesetup. On the other hand, the plots can be viewed as a sensitivity assessment of as gets smaller for each dataset.
Incorporating in an iterative framework of perturbing the data point till the query budget is exhausted (Lines 3 to 3 in Algorithm 3) supports the observation in white-box settings that iterative —or Projected Gradient Descent (PGD)—is stronger than (Madry et al., 2017; Al-Dujaili et al., 2018). This is evident by the upticks in ’s performance on the case (Figure 16: classifier’s loss, average Cosine distance, and average Hamming similarity plots), which happens after every iteration (after every other queries).
Plots of the average Hamming similarity capture the quality of the gradient sign estimation in terms of (4), while plots of the average Cosine similarity capture it in terms of (5). Both and consistently optimize both objectives. In general, enjoys a faster convergence especially on the Hamming metric because it is estimating the signs compared to ’s full gradient estimation. This is highlighted in the setup. Note that once an attack is successful, the gradient sign estimation at that point is used for the rest of the plot. This explains why, in the settings, ’s plot does not improve compared to its counterpart, as most of the attacks are successful in the very first few queries made in the oracle.
5 Public Black-Box Attack Challenges
To complement our results in Section 4, we evaluated against adversarial training, an effective way to improve the robustness of DNNs (Madry et al., 2017; Al-Dujaili et al., 2018). In particular, we attacked the secret model used in two public challenges as follows.
5.1 Public MNIST Black-Box Attack Challenge
In line with the challenge setup101010https://github.com/MadryLab/mnist_challenge, we attacked test images with an perturbation bound of . Although the secret model is released, we treated it as a black box similar to our experiments in Section 4. There was no specification of maximum query budget, so we set it to queries. This is similar to the number of iterations given to a PGD attack in the white-box setup of the challenge: 100-steps with 50 random restarts. As shown in Table 1, ’s attacks resulted in the lowest model accuracy of , outperforming all other state-of-the-art attack strategies submitted to the challenge with an average number of queries of per successful attack. Note that the most powerful white-box attack by Zheng et al. (2018) resulted in a model accuracy of –not shown in the table.
|Black-Box Attack||Model Accuracy|
|Xiao et al. (2018)|
|PGD against three independently and adversarially trained copies of the network|
|on the CW loss for model B from (Tramèr et al., 2017)|
|on the CW loss for the naturally trained public network|
|PGD on the cross-entropy loss for the naturally trained public network|
|Attack using Gaussian Filter for selected pixels on the adversarially trained public network|
|on the cross-entropy loss for the adversarially trained public network|
|PGD on the cross-entropy loss for the adversarially trained public network|
|Black-Box Attack||Model Accuracy|
|PGD on the cross-entropy loss for the adversarially trained public network|
|PGD on the CW loss for the adversarially trained public network|
|on the CW loss for the adversarially trained public network|
|on the CW loss for the naturally trained public network|
5.2 Public CIFAR10 Black-Box Attack Challenge
In line with the challenge setup111111https://github.com/MadryLab/cifar10_challenge, we attacked test images with an perturbation bound of . Although the secret model is released, we treated it as a black box similar to our experiments in Section 4. We set the query budget to queries similar to Section 5.1. As shown in Table 2, ’s attacks resulted in the lowest model accuracy of , outperforming all other state-of-the-art attack strategies submitted to the challenge with an average number of queries of per successful attack. Note that the most powerful white-box attack by Zheng et al. (2018) resulted in a model accuracy of –not shown in the table.
6 Open Questions
There are many interesting questions left open by our research:
Priors. Current version of does not exploit any data- or time-dependent priors. With these priors, algorithms such as operate on a search space of dimensionality less than that of for . In domain-specific examples such as images, can Binary Partition Trees (BPT) (Al-Dujaili et al., 2015) be incorporated in to have a data-dependent grouping of gradient coordinates instead of the current equal-size grouping?
for Continuous Optimization/Reinforcement Learning. In (Salimans et al., 2017; Chrabaszcz et al., 2018), it was shown that a class of black-box continuous optimization algorithms (as well as a very basic canonical ES
algorithm) rival the performance of standard reinforcement learning techniques. On the other hand, is tailored towards recovering the gradient sign bits and creating adversarial examples similar to using the best gradient sign estimation obtained so far. Can we incorporate in an iterative framework for continuous optimization? Figure11 shows a small, preliminary experiment comparing and to a simple iterative framework employing . In the regime of high dimension/few iterations, can be remarkably faster. However, with more iterations, the algorithm fails to improve further and starts to oscillate. The reason is that always provides updates (non-standard sign convention) compared to the other algorithms whose updates can be zero. Can we get the best of both worlds using clever initializations and adaptive step size updates?
Perturbation Vertices.121212We define perturbation vertices as extreme points of the perturbation region . That is, , where when and when . See Figure 10. Using its first queries, probes extreme points of the perturbation region as potential adversarial examples, while iterative continuous optimization such as and probe points in the Gaussian sphere around the current point as shown in Figure 10. Does looking up extreme points (vertices) of the perturbation region suffice to craft adversarial examples? If that is the case, how to efficiently search through them? searches through vertices out of and it could find adversarial examples among a tiny fraction of these vertices. Recall, in the setup in Section 4, it was enough to look up just out of vertices for each image achieving a evasion over images. Note that after queries, may not visit other vertices as they will be away as shown in Figure 10. We ignored this effect in our experiments.131313In fact, this effect is negligible in the setup as . Will be more effective if the probes are made strictly at the perturbation vertices? This question shows up clearly in the public MNIST challenge where the loss value at the potential adversarial examples dips after queries (see top left plot of Figure 19). We believe the reason is that these potential adversarial examples are not extreme points as illustrated in Figure 10: they are like the red ball 2 rather than the red ball 1.
Adversarial Training. Compared to other attacks, our approach showed more effectiveness towards adversarial training. Standard adversarial training relies on inner maximizers (attacks) that employ iterative continuous optimization methods such as PGD in contrast to our attack which stems from a binary optimization view. What are the implications?
Other Domains. Much of the work done to understand and counter adversarail examples has occurred in the image classification domain. The binary view of our approach lends itself naturally to other domains where binary features are used (e.g., malware detection (Al-Dujaili et al., 2018; Luca et al., 2019)). How effective our approach is on these domains?
|(a) perturbation||(b) perturbation|
and the shaded region indicates the standard deviation of results over random trials. We used a fixed step size ofin line with (Liu et al., 2019) and a finite difference perturbation of . The starting point for all the algorithms was set to be the all-one vector .
In this paper, we studied the problem of generating adversarial examples for neural nets assuming a black-box threat model. Motivated by i) the significant empirical effectiveness of gradient sign information; and ii) the low query complexity of recovering a sign vector using a noiseless Hamming distance oracle, we proposed the gradient sign estimation problem as the core challenge in crafting adversarial examples, and we formulate it as a binary black-box optimization problem: minimizing the Hamming distance to the gradient sign or, equivalently, maximizing the directional derivative.
Approximated by the finite difference of the loss value queries, we examine three properties of the directional derivative of the model’s loss in the direction of vectors. Based on the first property, the loss oracle can be used as a noisy Hamming distance oracle. We found that current search Hamming search strategies (e.g. Maurer (2009)) are not suitable for such oracles. The second property lets us employ the optimism in the face of uncertainty principle in the form of hierarchical bandits. This resulted in , an optimistic optimization algorithm for binary black-box optimization problems with a finite-time analysis on its regret. However, its query complexity is worse than the continuous optimization setup. The third property of separability helped us devise , a divide-and-conquer algorithm that is guaranteed to perform at least as well as after queries. In practice, needs a fraction of this number of queries to craft adversarial examples. To verify its effectiveness on real-world datasets, was compared against the state-of-the-art black-box attacks on neural network models for the , , and datasets. yields black-box attacks that are more query efficient and less failure-prone than the state of the art attacks combined. Moreover, achieves the highest evasion rate on two public black-box attack challenge surpassing other attacks that are based on transferability and generative adversarial networks. Our future work will investigate several research questions
This work was supported by the MIT-IBM Watson AI Lab. The authors would like to thank Shashank Srikant for his timely help.
Al-Dujaili and Suresh (2017)
Abdullah Al-Dujaili and Sundaram Suresh.
Embedded bandits for large-scale black-box optimization.
Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Al-Dujaili and Suresh (2018) Abdullah Al-Dujaili and Sundaram Suresh. Multi-objective simultaneous optimistic optimization. Information Sciences, 424:159–174, 2018.
- Al-Dujaili et al. (2015) Abdullah Al-Dujaili, François Merciol, and Sébastien Lefèvre. Graphbpt: An efficient hierarchical data structure for image representation and probabilistic inference. In International Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing, pages 301–312. Springer, 2015.
Al-Dujaili et al. (2018)
Abdullah Al-Dujaili, Alex Huang, Erik Hemberg, and Una-May O’Reilly.
Adversarial deep learning for robust detection of binary encoded malware.In 2018 IEEE Security and Privacy Workshops (SPW), pages 76–82. IEEE, 2018.
- Bernstein et al. (2018) Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD: Compressed optimisation for non-convex problems. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 560–569, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/bernstein18a.html.
- Bhagoji et al. (2017) Arjun Nitin Bhagoji, Warren He, Bo Li, and Dawn Song. Exploring the space of black-box attacks on deep neural networks. arXiv preprint arXiv:1712.09491, 2017.
- Biggio and Roli (2018) Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84:317–331, 2018.
- Biggio et al. (2013) Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
- Carlini and Wagner (2017) Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
- Chen et al. (2017) Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017.
- Chrabaszcz et al. (2018) Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. Back to basics: Benchmarking canonical evolution strategies for playing atari. arXiv preprint arXiv:1802.08842, 2018.
- Cohen et al. (2019) Jeremy Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. arXiv:1902.02918v1, 2019.
- Goodfellow et al. (2015) Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572.
- Hayes and Danezis (2017) Jamie Hayes and George Danezis. Machine learning as an adversarial service: Learning black-box adversarial examples. CoRR, abs/1708.05207, 2017.
- Huang et al. (2018) Alex Huang, Abdullah Al-Dujaili, Erik Hemberg, and Una-May O’Reilly. On visual hallmarks of robustness to adversarial malware. arXiv preprint arXiv:1805.03553, 2018.
- Ilyas et al. (2018) Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2137–2146, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/ilyas18a.html.
- Ilyas et al. (2019) Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Prior convictions: Black-box adversarial attacks with bandits and priors. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BkMiWhR5K7.
- Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
- Kurakin et al. (2017) Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale. 2017. URL https://arxiv.org/abs/1611.01236.
- Liu et al. (2019) Sijia Liu, Pin-Yu Chen, Xiangyi Chen, and Mingyi Hong. signSGD via zeroth-order oracle. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BJe-DsC5Fm.
- Liu et al. (2016) Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.
- Luca et al. (2019) Demetrio Luca, Battista Biggio, Lagorio Giovanni, Fabio Roli, and Armando Alessandro. Explaining vulnerabilities of deep learning to adversarial malware binaries. In ITASEC19, 2019.
- Madry et al. (2017) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Maurer (2009) Peter M Maurer. A search strategy using a hamming-distance oracle. 2009.
Moosavi-Dezfooli et al. (2016)
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard.
Deepfool: a simple and accurate method to fool deep neural networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.
- Munos (2011) Rémi Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Advances in neural information processing systems, pages 783–791, 2011.
- Narodytska and Kasiviswanathan (2017) Nina Narodytska and Shiva Prasad Kasiviswanathan. Simple black-box adversarial attacks on deep neural networks. In CVPR Workshops, volume 2, 2017.
- Nelson et al. (2012) Blaine Nelson, Benjamin IP Rubinstein, Ling Huang, Anthony D Joseph, Steven J Lee, Satish Rao, and JD Tygar. Query strategies for evading convex-inducing classifiers. Journal of Machine Learning Research, 13(May):1293–1332, 2012.
- Papernot et al. (2017) Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519. ACM, 2017.
- Salimans et al. (2017) Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
- Shamir et al. (2019) Adi Shamir, Itay Safran, Eyal Ronen, and Orr Dunkelman. A simple explanation for the existence of adversarial examples with small hamming distance. arXiv preprint arXiv:1901.10861, 2019.
- Tramèr et al. (2017) Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
- Tu et al. (2018) Chun-Chen Tu, Paishun Ting, Pin-Yu Chen, Sijia Liu, Huan Zhang, Jinfeng Yi, Cho-Jui Hsieh, and Shin-Ming Cheng. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. arXiv preprint arXiv:1805.11770, 2018.
- Vaishampayan (2012) Vinay Anant Vaishampayan. Query matrices for retrieving binary vectors based on the hamming distance oracle. arXiv preprint arXiv:1202.2794, 2012.
- Xiao et al. (2018) Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks, 2018. URL https://openreview.net/forum?id=HknbyQbC-.
- Zhao (2018) Yun-Bin Zhao. Sparse optimization theory and methods. CRC Press, an imprint of Taylor and Francis, Boca Raton, FL, 2018. ISBN 978-1138080942.
- Zheng et al. (2018) Tianhang Zheng, Changyou Chen, and Kui Ren. Distributionally adversarial attack. arXiv preprint arXiv:1808.05537, 2018.
Appendix A. Noisy
This section shows the performance of the noisy on standard models (described in Section 4) on the , and datasets. In Figure 12, we consider the threat perturbation constraint. Figure 13 reports the performance for the setup. Similar to Ilyas et al. (2019), for each in the experiment, the top percent of the signs of the coordinates—chosen either randomly (random-k) or by the corresponding magnitude (top-k)—are set correctly, and the rest are set to or at random. The misclassification rate shown considers only images that were correctly classified (with no adversarial perturbation). In accordance with the models’ accuracy, there were , , and such images for , , and out of the sampled images, respectively. These figures also serve as a validation for Theorem 3.3 when compared to ’s performance shown in Appendix C.
Appendix B. Experiments Setup
This section outlines the experiments setup as follows. Figure 14 shows the performance of the considered algorithms on a synthetic concave loss function after tuning their hyperparameters. A possible explanation of ’s superb performance is that the synthetic loss function is well-behaved in terms of its gradient given an image. That is, most of gradient coordinates share the same sign, since pixels tend to have the same values and the optimal value for all the pixels is the same . Thus, will recover the true gradient sign with as few queries as possible (recall the example in Section 3.3). Moreover, given the structure of the synthetic loss function, the optimal loss value is always at the boundary of the perturbation region. The boundary is where samples its perturbations. Tables 4, 5, 6, and 7 outline the algorithms’ hyperparameters, while Table 3 describes the general setup for the experiments.