There are No Bit Parts for Sign Bits in Black-Box Attacks

02/19/2019 ∙ by Abdullah Al-Dujaili, et al. ∙ MIT 0

Machine learning models are vulnerable to adversarial examples. In this paper, we are concerned with black-box adversarial attacks, where only loss-oracle access to a model is available. At the heart of black-box adversarial attack is the gradient estimation problem with query complexity O(n), where n is the number of data features. Recent work has developed query-efficient gradient estimation schemes by exploiting data- and/or time-dependent priors. Practically, sign-based optimization has shown to be effective in both training deep nets as well as attacking them in a white-box setting. Therefore, instead of a gradient estimation view of black-box adversarial attacks, we view the black-box adversarial attack problem as estimating the gradient's sign bits. This shifts the view from continuous to binary black-box optimization and theoretically guarantees a lower query complexity of Ω(n/ _2(n+1)) when given access to a Hamming loss oracle. We present three algorithms to estimate the gradient sign bits given a limited number of queries to the loss oracle. Using one of our proposed algorithms to craft black-box adversarial examples, we demonstrate evasion rate experiments on standard models trained on the MNIST, CIFAR10, and IMAGENET datasets that set new state-of-the-art results for query-efficient black-box attacks. Averaged over all the datasets and metrics, our attack fails 3.8× less often and spends in total 2.5× fewer queries than the current state-of-the-art attacks combined given a budget of 10,000 queries per attack attempt. On a public MNIST black-box attack challenge, our attack achieves the highest evasion rate surpassing all of the submitted attacks. Notably, our attack is hyperparameter-free (no hyperparameter tuning) and does not employ any data-/time-dependent prior, the latter fact suggesting that the number of queries can further be reduced.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Problem

Deep Neural Networks (DNNs) are vulnerable to adversarial examples, which are malicious inputs designed to fool the network’s prediction—see 

(Biggio and Roli, 2018) for a comprehensive, recent overview of adversarial examples. Research on generating these malicious inputs started in the white-box setting, where access to the gradients of the models was assumed. Since the gradient points to the direction of steepest ascent, a malicious input can be perturbed along that gradient to maximize the network’s loss, thereby fooling its prediction. The assumption of access to the underlying gradient does not however reflect real world scenarios. Attacks to a realistic, more restrictive black-box threat model, which does not assume access to gradients, have since been studied as summarized in Section 1.2.

Central to the approach of generating adversarial examples in a black-box threat model is estimating the gradients of the model being attacked. In estimating these gradients (their magnitudes and signs), the community at large has focused on formulating it as a problem in continuous optimization. Their works seek to reduce the query complexity from the standard , where is the number of input features/covariates. In this paper, we take a different view and focus on estimating just the sign of the gradient by reformulating the problem as minimizing the Hamming distance to the gradient sign. Given access to a Hamming distance oracle, this view guarantees a query complexity of : an order of magnitude lesser than the full gradient estimation’s query complexity for most practically-occurring input dimensions . Our key objective is to answer the following:

Is it possible to estimate only the sign of the gradient with such query efficiency and generate adversarial examples as effective as those generated by full gradient estimation approaches?

We propose a novel formulation which attempts to achieve this by exploiting some properties of the directional derivative of the loss function of the model under attack, and through rigorous empirical evaluation we show our approach outperforms state of the art full gradient estimation techniques. We also identify several key areas of research which we believe will help the community towards query-efficient adversarial attacks and gradient-free optimization.

1.2 Related Work

We organize the related work in two themes, namely Adversarial Example Generation and Sign-Based Optimization.

Adversarial Example Generation. This literature can be organized as generating examples in either a white-box or a black-box setting. Nelson et al. (2012) provide a theoretical framework to analyze adversarial querying in a white-box setting. Following the works of Biggio et al. (2013) and Goodfellow et al. (2015) who introduced the fast gradient sign method (FGSM), several methods to produce adversarial examples have been proposed for various learning tasks and threat perturbation constraints (Carlini and Wagner, 2017; Moosavi-Dezfooli et al., 2016; Hayes and Danezis, 2017; Al-Dujaili et al., 2018; Huang et al., 2018; Kurakin et al., 2017; Shamir et al., 2019). These methods assume a white-box setup and are not the focus of this work. An approach, which has received the community’s attention, involves learning adversarial examples for one model (with access to its gradient information) to transfer them against another (Liu et al., 2016; Papernot et al., 2017). As an alternative to the transferability phenomenon, Xiao et al. (2018) use a Generative Adversarial Network (GAN) to generate adversarial examples which are based on small norm-bounded perturbations. Both approaches involve learning on a different model, which is expensive, and does not lend itself to comparison in our setup, where we directly query the model of interest. Among works which generate examples in a black-box setting through iterative optimization schemes, Narodytska and Kasiviswanathan (2017) showed how a naïve policy of perturbing random segments of an image achieved adversarial example generation. They do not use any gradient information. Bhagoji et al. (2017)

reduce the dimensions of the feature space using Principal Component Analysis (PCA) and random feature grouping, before estimating gradients. This enables them to bound the number of queries made.

Chen et al. (2017) introduced a principled approach to solving this problem using gradient based optimization. They employ finite differences, a zeroth-order optimization tool, to estimate the gradient and then use it to design a gradient-based attack on models. While this approach successfully generates adversarial examples, it is expensive in the number of queries made to the model. Ilyas et al. (2018) substitute traditional finite differences methods with Natural Evolutionary Strategies () to obtain an estimate of the gradient. Tu et al. (2018) provide an adaptive random gradient estimation algorithm that balances query counts and distortion, and introduces a trained auto-encoder to achieve attack acceleration. Ilyas et al. (2019) extend this line of work by proposing the idea of gradient priors. Our work contrasts the general approach used by these works. We investigate whether just estimating the sign of the gradient suffices to efficiently generate examples.

Sign-Based Optimization.

In the context of general-purpose continuous optimization methods, sign-based stochastic gradient descent was studied in both zeroth- and first-order setups. In the latter,

Bernstein et al. (2018) analyzed signSGD, a sign-based Stochastic Gradient Descent, and showed that it enjoys a faster empirical convergence than SGD in addition to the cost reduction of communicating gradients across multiple workers. Liu et al. (2019) extended signSGD to zeroth-order setup with the algorithm. requires times more iterations than signSGD, leading to a convergence rate of , where is the number of optimization variables, and is the number of iterations.

Adversarial Examples Meet Sign-based Optimization. In the context of adversarial examples generation, the effectiveness of sign of the gradient coordinates was noted in both white- and black-box settings. In the former, the Fast Gradient Sign Method ()—which is algorithmically similar to signSGD—was proposed to generate white-box adversarial examples (Goodfellow et al., 2015). Ilyas et al. (2019) examined a noisy version of to address the question of How accurate of a gradient estimate is necessary to execute a successful attack on a neural net. In Figure 1, we reproduce their experiment on an -based model—Plot (c)—and extended it to the and datasets—Plots (a) and (b). Observe that estimating the sign of the top gradient coordinates (in terms of their magnitudes) is enough to achieve a misclassification rate of . Furthermore,  (Liu et al., 2019) was shown to perform better than at generating adversarial examples against a black-box neural network on the dataset.

(a) (b) (c)
Figure 1: Misclassification rate of three neural nets (for (a) , (b) , and (c) , respectively) on the noisy ’s adversarial examples as a function of correctly estimated coordinates of on random images from the corresponding evaluation dataset. Across all the models, estimating the sign of the top gradient coordinates (in terms of their magnitudes) is enough to achieve a misclassification rate of . More details can be found in Appendix A.

1.3 Our Contributions

Motivated by i) the practical effectiveness of the gradient sign information; and that ii) the gradient sign can be recovered with a lower query complexity than to retrieve both its sign and magnitude (as we will show herein), we view the black-box adversarial attack problem as estimating the gradient’s sign bits. This shift from continuous to binary black-box optimization leads to the following contributions at the intersection of adversarial machine learning and black-box (zeroth-order) optimization:

  • We present three properties of the directional derivative of the loss function of the model under attack in the direction of vectors, and propose methods to estimate the gradient sign bits exploiting these properties. Namely,

    1. Property 3.1 shows that the directional derivative in the direction of a sign vector is an affine transformation of the Hamming distance between and the gradient sign vector. This suggests that if we can recover the Hamming distance from the directional derivative, then the gradient sign bits can be recovered with a query complexity of using any off-the-shelf efficient Hamming search strategy.

    2. Property 5 shows that the directional derivative is locally smooth around the gradient sign. This lets us employ the optimism in the face of uncertainty principle in estimating the gradient sign. Through the use of hierarchical bandits, we show that the knowledge of this smoothness is not required and provide a finite-time upper bound on the quality of the estimation at the expense of searching over the possible sign vectors.

    3. Property 3.3 shows that the directional derivative is separable with respect to coordinates of a sign vector . Based on this property, we devise a divide-and-conquer algorithm, which we refer to as , that reduces the search complexity from to . When given a budget of queries, is guaranteed to perform at least as well as  (Goodfellow et al., 2015), which has access to the model’s gradient.

  • Through rigorous empirical evaluation, Property 3.3 (and hence ) is found to be the most effective in crafting black-box adversarial examples. In particular,

    1. To exploit Property 3.1, we propose an estimation of the Hamming distance (to the gradient sign) oracle from the finite difference of model’s loss value queries, and provide an empirical motivation and evaluation of the same. We find that efficient Hamming search strategies from the literature (e.g., Maurer (2009)) are not robust to approximation error of the proposed Hamming distance estimation, and hence no guarantees can be made about the estimated sign vector.

    2. Despite being theoretically-founded, the approach exploiting Property 5 is slow and not scalable for most practically-occurring input dimensions .

    3. Through experiments on , , and for both and threat constraints, yields black-box attacks that are more query efficient and less failure-prone than the state of the art attacks combined. On two public black-box attack challenges, our approach achieves the highest evasion rate surpassing techniques based on transferability, ensembling, and generative adversarial networks.

  • Finally, we release a software framework222This builds on other open-source frameworks such as the MNIST and CIFAR challenges (Madry et al., 2017). to systematically benchmark adversarial black-box attacks on DNNs for , , and datasets in terms of their success rate, query count, and other related metrics. This was motivated by the problem we faced in comparing approaches from the literature, where different researchers evaluated their approaches on different datasets, metrics, and setups—e.g., some compared only on while others considered and .

The rest of the paper is structured as follows. First, a formal background is presented in Section 2. Section 3 describes our approach for black-box adversarial attacks by examining three properties of the loss’s directional derivative of the model under attack. Experiments are discussed in Section 4. Using two public black-box attack challenges, we evaluate the approach against one of the defenses developed to mitigate adversarial examples in Section 5. Finally, open questions and conclusions are outlined in Sections 6 and 7.

2 Formal Background

2.1 Notation.

Let denote the dimension of a neural network’s input. Denote a hidden -dimensional binary code by . That is, . The response of the Hamming (distance) oracle  to the th query is denoted by  and equals the Hamming distance

(1)

where the Hamming norm is defined as the number of non-zero entries of vector . We also refer to as the noiseless Hamming oracle, in contrast to the noisy Hamming oracle , which returns noisy versions of ’s responses as we will see shortly. is the identity matrix. The query ratio  is defined as where is the number of queries to required to retrieve . Furthermore, denote the directional derivative of some function at a point in the direction of a vector by which often can be approximated by the finite difference method. That is, for , we have

(2)

Let be the projection operator onto the set , be the ball of radius around . Next, we provide lower and upper bounds on the query ratio .

2.2 Bounds on the Query Ratio

Lower Bound on .

Using a packing argument,  Vaishampayan (2012) proved the following lower bound on query ratio . (Vaishampayan, 2012, Theorem 1) For the noiseless Hamming oracle , the query ratio must satisfy

for any sequence of queries that determine every -dimensional binary code uniquely. See (Vaishampayan, 2012, Page 4).

Exact Solution with .

In the following theorem, we show that no more than queries are required to retrieve the hidden -dimensional binary code .

A hidden -dimensional binary code can be retrieved exactly with no more than queries to the noiseless Hamming oracle .

The key element of this proof is that the Hamming distance between two -dimensional binary codes can be written as

(3)

Let be an matrix where the th row is the th query code . Likewise, let be the corresponding th query response, and is the concatenating vector. In matrix form, we have

where is invertible if we construct linearly independent queries .

Figure 2: Expected Query Ratios for with the noiseless Hamming oracle .

In Figure 2, we plot the bounds above for , along with two search strategies for using the Hamming oracle: i)  Maurer (2009); and ii) search by which, after response to query , eliminates all binary codes with in an iterative manner. Note that is a naive technique that is not scalable with .

2.3 Gradient Estimation Problem: a Hamming Distance View

At the heart of black-box adversarial attacks is generating a perturbation vector to slightly modify the original input so as to fool the network prediction of its true label . Put it differently, an adversarial example maximizes the network’s loss but still remains -close to the original input . Although the loss function can be non-concave, gradient-based techniques are often very successful in crafting an adversarial example (Madry et al., 2017). That is, to set the perturbation vector as a step in the direction of . Subsequently, the bulk of black-box attack methods sought to estimate the gradient by querying an oracle that returns, for a given input/label pair , the value of the network’s loss . Using only such value queries, the basic approach relies on the finite difference method to approximate the directional derivative ((2)) of the function at the input/label pair in the direction of a vector , which corresponds to . With linearly independent vectors , one can construct a linear system of equations to recover the full gradient. Clearly, this approach’s query complexity is , which can be prohibitively expensive for large (e.g., for the dataset). Moreover, the queries are not adaptive, where one may make use of the past queries’ responses to construct the new query and recover the full gradient with less queries. Recent works tried to mitigate this issue by exploiting data- and/or time-dependent priors (Tu et al., 2018; Ilyas et al., 2018, 2019).

The lower bound of Theorem 2.2 on the query complexity of a Hamming oracle to find a hidden vector suggests the following: instead of estimating the full gradient (sign and magnitude) and apart from exploiting any data- or time-dependent priors; why do we not focus on estimating its sign? After all, simply leveraging (noisy) sign information of the gradient yields successful attacks; see Figure 1. Therefore, our interest in this paper is the gradient sign estimation problem, which we formally define next, breaking away from the general trend of the continuous optimization view in constructing black-box adversarial attacks, manifested by the focus on the full gradient estimation problem.

(Gradient Sign Estimation Problem) For an input/label pair and a loss function , let be the gradient of at and be the sign bit vector of .333Without loss of generality, we encode the sign bit vector in rather than . This is a common representation in sign-related literature. Note that the standard function has the range of . Here, we use the non-standard definition (Zhao, 2018) whose range is . This follows from the observation that DNNs’ gradients are not sparse (Ilyas et al., 2019, Appendix B.1). Then the goal of the gradient sign estimation problem is to find a binary444Throughout the paper, we use the terms binary vectors and sign vectors/bits interchangeably. vector minimizing the Hamming norm

(4)

or equivalently maximizing the directional derivative

(5)

from a limited number of (possibly adaptive) function value queries .

In the next section, we set to tackle the problem above leveraging three properties of the loss directional derivative which, in the black-box setup, is approximated by finite difference of loss value queries .

Recall that our definition of the Hamming distance here is over the binary vectors ((1)), a formal statement of the gradient sign estimation problem. In contrast, Shamir et al. (2019) consider the Hamming distance in defining the threat perturbation constraint: if the threat perturbation constraint is , only data features (pixels) are allowed to be changed, and each one of them can change a lot.

3 A Framework for Estimating Sign of the Gradient from Loss Oracles

Our interest in this section is to estimate the gradient sign bits of the loss function of the model under attack at an input/label pair () from a limited number of loss value queries . To this end, we examine the basic concept of directional derivatives that has been employed in recent black-box adversarial attacks. Particularly, we present three approaches to estimate the gradient sign bits based on three properties of the directional derivative of the loss in the direction of a sign vector .

3.1 Approach 1: Loss Oracle as a Noisy Hamming Oracle

The directional derivative of the loss function at in the direction of a binary code can be written as

(6)

where , . Note that . The quantities and are the means of and , respectively. Observe that the Hamming distance between and the gradient sign . In other words, the directional derivative has the following property. The directional derivative of the loss function at an input/label pair in the direction of a binary code can be written as an affine transformation of the Hamming distance between and . Formally, we have

(7)

If we can recover the Hamming distance from the directional derivative based on (7), efficient Hamming search strategies—e.g., (Maurer, 2009)—can then be used to recover the gradient sign bits with a query complexity as stated in Theorem 2.2. However, not all terms of (7) is known to us. While is the number of data features (known a priori) and is available through a finite difference oracle, and

are not known. Here, we propose to approximate these values by their Monte Carlo estimates: averages of the magnitude of sampled gradient components. Our assumption is that the magnitudes of gradient coordinates are not very different from each other, and hence a Monte Carlo estimate is good enough (with small variance). Our experiments on , , and confirm the same—see Figure 

15 in the supplement.

To use the th gradient component as a sample for our estimation, one can construct two binary codes and such that only their th bit is different, i.e., . Thus, we have

(8)
(9)

Let be the set of indices of gradient components we have recovered—magnitude and sign—so far through (8) and (9). Then,

(10)
(11)

where and .555It is possible that one of and will (e.g., when we only have one sample). In this case, we make the approximation as . As a result, the Hamming distance between and the gradient sign can be approximated with the following quantity, which we refer to as the noisy Hamming oracle .

(12)
(a) (b)
Figure 3: The error distribution of the noisy Hamming oracle (right side of (12)) compared to the noiseless counterpart (left side of (12)) as a function of the number of unique values (magnitudes) of the gradient coordinates . Here, has the form . That is, with being the input length. With , the estimation is exact () for all the binary code queries —32 codes for , 1028 codes for . The error seems to be bounded by . For a given , —the coordinate of —is randomly assigned a value from the evenly spaced numbers in the range . We set the size of the sampled gradient coordinates set  to .

We empirically evaluated the quality of ’s responses on a toy problem where we controlled the magnitude spread/concentration of the gradient coordinates with being the number of unique values (magnitudes) of the gradient coordinates. As detailed in Figure 3, the error can reach . This a big mismatch, especially if we recall the Hamming distance’s range is . The negative impact of this on the Hamming search strategy by Maurer (2009) was verified empirically in Figure 4. We considered the simplest case where was given access to the noisy Hamming oracle in a setup similar to the one outlined in Figure 3, with , , , and the hidden code . To account for the randomness in constructing , we ran independent runs and plot the average Hamming distance (with confidence bounds) over queries. In Figure 4 (a), which corresponds to exact estimation , spends queries to construct and terminates one query afterwards with the true binary code , achieving a query ratio of 21/80. On the other hand, when we set in Figure 4 (b); returns a 4-Hamming-distance away solution from the true binary code after queries. This is not bad for an -bit long code. However, this is in a tightly controlled setup where the gradient magnitudes are just one of two values. To be studied further is the bias/variance decomposition of the returned solution and the corresponding query ratio. We leave this investigation for future work.

(a) (b)
Figure 4: Performance of on the noisy Hamming oracle . The setup is similar to that of Figure 3, with and .

3.2 Approach 2: Optimism in the Face of Uncertainty

In the previous approach, we considered the approximated Hamming distance ((12)) as a surrogate for the formal optimization objective ((4)) of the gradient sign estimation problem. We found that current Hamming search strategies are not robust to approximation error. In this approach, we consider maximizing the directional derivative ((5)) as our formal objective of the gradient sign estimation problem. Formally, we treat the problem as a binary black-box optimization over the hypercube vertices, which correspond to all possible sign vectors. This is significantly worse than of the continuous optimization view. Nevertheless, the rationale here is that we do not need to solve (5) to optimality (recall Figure 1); we rather need a fast convergence to a suboptimal but adversarially helpful sign vector . In addition, the continuous optimization view often employs an iterative scheme of steps within the perturbation ball , calling the gradient estimation routine in every step leading to a search complexity of . In our setup, we use the best obtained solution for (5) so far in a similar fashion to the noisy of Figure 1. In other words, our gradient sign estimation routine runs at the top level of our adversarial example generation procedure instead of calling it as a subroutine. In this and the next approach, we address the following question: how do we solve (5)?

Optimistic methods, i.e., methods that implement the optimism in the face of uncertainty principle have demonstrated a theoretical as well as empirical success when applied to black-box optimization problems (Munos, 2011; Al-Dujaili and Suresh, 2017, 2018). Such a principle finds its foundations in the machine learning field addressing the exploration vs. exploitation dilemma, known as the multi-armed bandit problem. Within the context of function optimization, optimistic approaches formulate the complex problem of optimizing an arbitrary black-box function (e.g., (5)) over the search space ( in this paper) as a hierarchy of simple bandit problems (Kocsis and Szepesvári, 2006) in the form of space-partitioning tree search . At step , the algorithm optimistically expands a leaf node (partitions the corresponding subspace) from the set of leaf nodes that may contain the global optimum. The node at depth , denoted by , corresponds to the subspace/cell such that . To each node , a representative point is assigned, and the value of the node is set to . See Figure 6 for an example of a space-partitioning tree of , which will be used in our second approach to estimate the gradient sign vector.

Under some assumptions on the optimization objective and the hierarchical partitioning  of the search space, optimistic methods enjoy a finite-time bound on their regret defined as

(13)

where is the best found solution by the optimistic method after steps. The challenge is how to align the search space such that these assumptions hold. In the following, we show that these assumptions can be satisfied for our optimization objective ((5)). In particular, when is the directional derivative function , and ’s vertices are aligned on a 1-dimensional line according to the Gray code ordering, then we can construct an optimistic algorithm with a finite-time bound on its regret. To demonstrate this, we adopt the Simultaneous Optimistic Optimization framework by Munos (2011) and the assumptions therein.

(a) (b)

(c) (d)
Figure 5: The directional derivative of some function at a point in the direction of a binary vector is locally smooth around the gradient sign vector, , when is ordered over one coordinate as a sequence of Gray codes. The plots show the local smoothness property—with two semi-metrics and —of the directional derivative of functions of the form and for , as tabulated in Plots (a), (c) and (b, d), respectively. The local smoothness is evident as and for all . The semi-metrics and have the form , where refers to the rank of an -binary code in the Gray ordering of -binary codes (e.g., ), , and . With this property at hand, we employ the optimism in the face of uncertainty principle in to maximize over . For legibility, we replaced with when enumerating on the x-axis.

For completeness, we reproduce Munos (2011)’s basic definitions and assumptions with respect to our notation. At the same time we show how the gradient sign estimation problem ((5)) satisfies them based on the second property of the directional derivative as follows.

[Semi-metric] We assume that is such that for all , we have and if and only if .

(Near-optimality dimension) The near-optimal dimension is the smallest such that there exists such that for any , the maximal number of disjoint -balls of radius and center in is less than .

[Local smoothness of ] For any input/label pair , there exists at least a global optimizer of (i.e., ) and for all ,

Refer to Figure 5 for a pictorial proof of Property 5. [Bounded diameters] There exists a decreasing a decreasing sequence , such that for any depth , for any cell of depth , we have . To see how Assumption 3.2 is met, refer to Figure 6. [Well-shaped cells] There exists such that for any depth , any cell contains a -ball of radius centered in . To see how Assumption 3.2 is met, refer to Figure 6. With the above assumptions satisfied, we propose the Gray-code Optimistic Optimization (), which is an instantiation of (Munos, 2011, Algorithm 2) tailored to our optimization problem ((5)) over a 1-dimensional alignment of using the Gray code ordering. The pseudocode is outlined in Algorithm 1. The following theorem bounds ’s regret.

Regret Convergence of Let us write the smallest integer such that

(14)

Then, with , the regret of (Algorithm 1) is bounded as

We have showed that our objective function ((5)) and the hierarchical partitioning of following the Gray code ordering confirm to Property 5 and Assumptions 3.2 and 3.2. The term in (14) is to accommodate the evaluation of node before growing the space-partitioning tree —see Figure 6. The rest follows from the proof of (Munos, 2011, Theorem 2).

Input :  : the black-box linear function to be maximized over the binary hypercube 
Initialization :  Set , (root node). Align over using the Gray code ordering.
1 while true do
2      
3       for  to  do
4             Among all leaves of depth , select
5             if  then
6                   Expand this node: add to the two children
7                  
8                  
9                   if query budget is exhaused then
10                         return the best found solution .
11                        
12                  
13            
14      
Algorithm 1 Gray-code Optimistic Optimization ()

Despite being theoretically-founded, is slow in practice. This is expected since it is a global search technique that considers all the vertices of the -dimensional hypercube . Recall that we are looking for adversarially helpful solution that may not be necessarily optimal. To this end, we consider the separability property of the directional derivative, a more useful property than its local smoothness as described in our third approach next.

(a) view (b) view
Figure 6: Illustration of the proposed Gray-ordering based partitioning (fully expanded) tree of the search space —with —used in the Gray-code Optimistic Optimization (). The plots are two different views of the same tree. Plot (a) displays the node name , while Plot (b) shows its representative binary code . For brevity, we replaced s with s. The red oval and rectangle highlights the tree’s root and its corresponding binary code, respectively. Consider the node whose representative code is and its corresponding subspace . The same reasoning applies to the rest of the nodes. To maintain a valid binary partition tree, one can ignore the anomaly leaf node , this corresponds to the code , which in practice can be evaluated prior to building the tree. Let us consider the nodes at depth , observe that for all 1) ; 2) is centered around the other members of in the Gray code ordering; and that 3) constitutes a contiguous block of codes along the 1-dimensional alignment shown below the tree. Thus, it suffices to define a semi-metric based on the corresponding indices of the codes along this alignment. For a given depth , the index of any code is at most , which establishes Assumption 2. Assumption 3 follows naturally from the fact that nodes at a given depth partition the search space equally (e.g., for all ).

3.3 Approach 3: Divide & Conquer

Based on the definition of the directional derivative ((2)), we state the following property. [Separability of ] The directional derivative of the loss function at an input/label pair in the direction of a binary code is separable. That is,

(15)

Instead of considering the search space (Section 3.2), we employ the above property in a divide-and-conquer search which we refer to as . As outlined in Algorithm 2, the technique starts with a random guess of the sign vector . It then proceeds to flip the sign of all the coordinates to get a new sign vector , and revert the flips if the loss oracle returned a value (or equivalently the directional derivative ) less than the best obtained so far . applies the same rule to the first half of the coordinates, the second half, the first quadrant, the second quadrant, and so on. For a search space of dimension , needs sign flips to complete its search. If the query budget is not exhausted by then, one can update with the recovered signs and restart the procedure at the updated point with a new starting code ( in Algorithm 2). In the next theorem, we show that is guaranteed to perform at least as well as the Fast Gradient Sign Method after oracle queries.

(Optimality of ) Given queries, is at least as effective as  (Goodfellow et al., 2015) in crafting adversarial examples.

The th coordinate of the gradient sign vector can be recovered as outlined in (9) which takes queries. From the definition of , this is carried out for all the coordinates after queries. That is, the gradient sign vector is fully recovered after queries, and therefore one can employ the attack to craft adversarial examples. Note that this is under the assumption that our finite difference approximation of the directional derivative ((2)) is good enough (or at least a rank-preserving).

Theorem 3.3 provides an upper bound on the number of queries required for to recover the gradient sign bits, and perform as well as . In practice (as will be shown in our experiments), crafts adversarial examples with a fraction of this upper bound. Note that one could recover the gradient sign vector with queries by starting with an arbitrary sign vector and flipping its bits sequentially. Nevertheless, incorporates the queries in a framework of majority voting to recover as many sign bits as possible with as few queries as possible. Consider the case where all the gradient coordinates have the same magnitude—the case in Figure 3. If we start with a random sign vector whose Hamming distance to the optimal sign vector is : agreeing with in the first half of coordinates. In this case, needs just four queries to recover the entire sign vector, whereas the sequential bit flipping would require queries.

Input :  : the black-box linear function to be maximized over the binary hypercube 
1 def init
2       
3       
4       
5       
6       done
7         
8 def is_done
9        return done  
10 def step
11        chunk_len
12        flip the bits of indexed from chunk_len till chunk_len
13        if :
14             
15        else:
16              flip back the bits of indexed from chunk_len till chunk_len
17        increment
18        if :
19             
20              increment
21              if :
22                    done  
23 def get_current_sign_estimate
       return
Algorithm 2
Hamming Distance Trace Directional Derivative Trace

(a) (b)

(c) (d)
Figure 7: Noiseless vs. Noisy Hamming Oracle: The trace of the Hamming distance (first column, where the lower the trace the better) and directional derivative (second column, where the higher the trace the better) values of and queries, when given access to a noiseless/ideal (first row) and noisy Hamming oracles (second row)—through a directional derivative approximation as discussed in Section ?? of the main paper—for a synthetic function of the form with . We expect the traces to go up and down as they explore the search space. The end of an algorithm’s trace represents the value of the Hamming distance (directional derivative) for the first column (for the second column) at the algorithm’s solution. For comparison, we also plot and ’s traces. Note that the performance of and is the same in both noiseless and noisy cases as both algorithms operate directly on the directional derivative approximation rather than the noiseless/noisy Hamming oracle. In the case of noiseless Hamming oracle, both and finds the optimal vector with # queries —their traces end at most at just to show that the algorithm’s solution achieves a zero Hamming distance as shown in Plot (a), which corresponds to the maximum directional derivative in Plot (b). With a noisy Hamming oracle, these algorithm break as shown in Plots (c) and (d): taking more than queries and returning sub-optimal solutions—e.g., returns on average a three-Hamming-distance solution. On the other hand, and achieve a zero Hamming distance in both cases at the expense of being less query efficient. While being theoretically-founded, is slow as it employs a global search over the space. Despite ’s local search, it converges to the optimal solution after queries in accordance with Theorem 3.3. The solid curves indicate the corresponding averaged trace surrounded by a -confidence bounds using independent runs. For convenience, we plot the symmetric bounds where in fact they should be asymmetric in Plots (a) and (c) as the Hamming distance’s range is

Moreover, is amenable to parallel hardware architecture and thus can carry out attacks in batches more efficiently, compared to the previous presented approaches. We tested both and (along with and ) on a set of toy problems and found that performs significantly better than , while and were sensitive to the approximation error—see Figure 7. For these reasons, in our experiments on the real datasets , , ; we opted for as our algorithm of choice to estimate the gradient sign in crafting black-box adversarial attacks as outlined in Algorithm 3.

Input : 
: input to be perturbed,
: ’s true label,
: perturbation ball of radius ,
: loss function of the neural net under attack
1
2 Adversarial input to be constructed
3 Define the function as
4 .init
5 returns top class
6 while  do
7       .step()
8       .get_current_sign_estimate()
9      
10       if .is_done() then
11            
12             Define the function as
13             .init
14      
15 return
Algorithm 3 Black-Box Adversarial Example Generation with

4 Experiments

In this section, we evaluate and compare it with established algorithms from the literature:  (Liu et al., 2019),  (Ilyas et al., 2018), and  (Ilyas et al., 2019) in terms of their effectiveness in crafting untargeted black-box adversarial examples. Both and threat models are considered on the , , and datasets.

4.1 Experiments Setup

Our experiment setup is similar to (Ilyas et al., 2019). Each attacker is given a budget of oracle queries per attack attempt and is evaluated on images from the test sets of , , and . We did not find a standard practice of setting the perturbation bound , arbitrary bounds were used in several papers. We set the perturbation bounds based on the following.

  • For the threat model, we use (Madry et al., 2017)’s bound for and (Ilyas et al., 2019)’s bounds for both and .

  • For the threat model, (Ilyas et al., 2019)’s bound is used for . ’s bound is set based on the sufficient distortions observed in (Liu et al., 2019), which are smaller than the one used in (Madry et al., 2017). We use the observed bound in (Cohen et al., 2019) for .

We show results based on standard models. For and , the naturally trained models from (Madry et al., 2017)’s 666https://github.com/MadryLab/mnist_challenge and 777https://github.com/MadryLab/cifar10_challenge challenges are used. For , the Inception-V3 model from TensorFlow is used.888https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/nets/inception_v3_test.py The loss oracle represents the cross-entropy loss of the respective model. General setup of the experiments is summarized in Table 3 in Appendix B.

4.2 Hyperparameters Setup

To ensure a fair comparison among the considered algorithms, we did our best in tuning their hyperparameters. Initially, the hyperparameters were set to the values reported by the corresponding authors, for which we observed suboptimal performance. This can be attributed to either using a different software framework (e.g., TensorFlow vs. PyTorch), models, or the way the model’s inputs are transformed (e.g., some models take pixel values to be in the range while others are built for ). We made use of a synthetic concave loss function to tune the algorithms’ parameters for each dataset perturbation constraint combination. The performance curves on the synthetic loss function using the tuned values of the hyperparameters did show consistency with the reported results from the literature. For instance, we noted that converges faster than . Further, outperformed the rest of the algorithms towards the end of query budget. That said, we invite the community to provide their best tuned attacks. Note that does not have any hyperparameters to tune. The finite difference probe for is set to the perturbation bound because this perturbation is used for for both computing the finite difference and crafting the adversarial examples—see Line 3 in Algorithm 2. This parameter-free setup of offers a robust edge over the state-of-the-art black-box attacks, which often require expert knowledge to carefully tune their parameters as discussed above. More details on the hyperparameters setup can be found in Appendix B.

4.3 Results

Figures 8 and 9

show the trade-off between the success (evasion) rate and the mean number of queries (of the successful attacks) needed to generate an adversarial example for the , , and classifiers in the

and perturbation constraints, respectively. In other words, these figures indicate the average number of queries required for a desired success rate. Tabulated summary of these plots can be found in Appendix D, namely Tables 89, and 10

. Furthermore, we plot the classifier loss and the gradient estimation quality (in terms of Hamming distance and Cosine similarity) averaged over all the images as a function of the number of queries used as shown in Figures 

1617, and 18 in Appendix D. Based on the results, we observe the following:

(a) (b) (c)
Figure 8: Performance of black-box attacks in the perturbation constraint. The plots show the average number of queries used per successful image for each attack when reaching a specified success rate.
(a) (b) (c)
Figure 9: Performance of black-box attacks in the perturbation constraint. The plots show the average number of queries used per successful image for each attack when reaching a specified success rate.
  • For any given success rate, dominates the previous state of the art approaches in all settings except the setup,999To be accurate, all the algorithms are comparable in the setup for success rate . where shows a better query efficiency when the desired success rate is greater than or equal .

  • is remarkably efficient in the setup (e.g., achieving a evasion using—on average—just queries per image against the classifier!). Its performance degrades—yet, still outperforms the rest, most of the time—in the setup. This is expected, since perturbs all the coordinates with the same magnitude and the perturbation bound for all the datasets in our experiments is set such that as shown in Table 3. Take the case of (), where and . For , the setup is equivalent to an perturbation bound of . The employed perturbation bounds give the state of the art—continuous optimization based—approaches more perturbation options. For instance, it is possible for to perturb one pixel in an MNIST image by a magnitude of each; two pixels by a magnitude of each; and ten pixels by a magnitude of each. On the other hand, the binary optimization view of limits it to always perturb all pixels by a magnitude of

    . Despite its less degrees of freedom, maintains its effectiveness in the

    setup. On the other hand, the plots can be viewed as a sensitivity assessment of as gets smaller for each dataset.

  • The plots verify Theorem 3.3 when compared with the performance of (Figures 12 and 13 in Appendix A, the noisy at ) in both and setups for and —for , is beyond our query budget of queries. For example, has a failure rate of for (Figure 13 (b)), while achieves a failure rate of with queries.

  • Incorporating in an iterative framework of perturbing the data point till the query budget is exhausted (Lines 3 to 3 in Algorithm 3) supports the observation in white-box settings that iterative —or Projected Gradient Descent (PGD)—is stronger than  (Madry et al., 2017; Al-Dujaili et al., 2018). This is evident by the upticks in ’s performance on the case (Figure 16: classifier’s loss, average Cosine distance, and average Hamming similarity plots), which happens after every iteration (after every other queries).

  • Plots of the average Hamming similarity capture the quality of the gradient sign estimation in terms of (4), while plots of the average Cosine similarity capture it in terms of (5). Both and consistently optimize both objectives. In general, enjoys a faster convergence especially on the Hamming metric because it is estimating the signs compared to ’s full gradient estimation. This is highlighted in the setup. Note that once an attack is successful, the gradient sign estimation at that point is used for the rest of the plot. This explains why, in the settings, ’s plot does not improve compared to its counterpart, as most of the attacks are successful in the very first few queries made in the oracle.

Overall, fails less often that the state-of-the-art approaches combined, and spend over all the images (successful and unsuccessful attacks) less queries. The number of queries spent is computed as

(1 - fail_rate) * avg_#_queries + fail_rate * 10,000

based on Tables 89, and 10.

5 Public Black-Box Attack Challenges

To complement our results in Section 4, we evaluated against adversarial training, an effective way to improve the robustness of DNNs (Madry et al., 2017; Al-Dujaili et al., 2018). In particular, we attacked the secret model used in two public challenges as follows.

5.1 Public MNIST Black-Box Attack Challenge

In line with the challenge setup101010https://github.com/MadryLab/mnist_challenge, we attacked test images with an perturbation bound of . Although the secret model is released, we treated it as a black box similar to our experiments in Section 4. There was no specification of maximum query budget, so we set it to queries. This is similar to the number of iterations given to a PGD attack in the white-box setup of the challenge: 100-steps with 50 random restarts. As shown in Table 1, ’s attacks resulted in the lowest model accuracy of , outperforming all other state-of-the-art attack strategies submitted to the challenge with an average number of queries of per successful attack. Note that the most powerful white-box attack by Zheng et al. (2018) resulted in a model accuracy of –not shown in the table.

Black-Box Attack Model Accuracy
(Algorithm 3)
Xiao et al. (2018)
PGD against three independently and adversarially trained copies of the network
on the CW loss for model B from (Tramèr et al., 2017)
on the CW loss for the naturally trained public network
PGD on the cross-entropy loss for the naturally trained public network
Attack using Gaussian Filter for selected pixels on the adversarially trained public network
on the cross-entropy loss for the adversarially trained public network
PGD on the cross-entropy loss for the adversarially trained public network
Table 1: Black-Box Leaderboard for the public black-box attack challenge. Adapted from https://github.com/MadryLab/mnist_challenge. Retrieved on February 22, 2019.
Black-Box Attack Model Accuracy
(Algorithm 3)
PGD on the cross-entropy loss for the adversarially trained public network
PGD on the CW loss for the adversarially trained public network
on the CW loss for the adversarially trained public network
on the CW loss for the naturally trained public network
Table 2: Black-Box Leaderboard for the public black-box attack challenge. Adapted from https://github.com/MadryLab/cifar10_challenge. Retrieved on February 22, 2019.

5.2 Public CIFAR10 Black-Box Attack Challenge

In line with the challenge setup111111https://github.com/MadryLab/cifar10_challenge, we attacked test images with an perturbation bound of . Although the secret model is released, we treated it as a black box similar to our experiments in Section 4. We set the query budget to queries similar to Section 5.1. As shown in Table 2, ’s attacks resulted in the lowest model accuracy of , outperforming all other state-of-the-art attack strategies submitted to the challenge with an average number of queries of per successful attack. Note that the most powerful white-box attack by Zheng et al. (2018) resulted in a model accuracy of –not shown in the table.

For both challenges, we recorded the same metrics used in Section 4 as shown in Figure 19 in the supplement.

6 Open Questions

There are many interesting questions left open by our research:

  1. Priors. Current version of does not exploit any data- or time-dependent priors. With these priors, algorithms such as operate on a search space of dimensionality  less than that of for . In domain-specific examples such as images, can Binary Partition Trees (BPT) (Al-Dujaili et al., 2015) be incorporated in to have a data-dependent grouping of gradient coordinates instead of the current equal-size grouping?

  2. for Continuous Optimization/Reinforcement Learning. In (Salimans et al., 2017; Chrabaszcz et al., 2018), it was shown that a class of black-box continuous optimization algorithms (as well as a very basic canonical ES

    algorithm) rival the performance of standard reinforcement learning techniques. On the other hand, is tailored towards recovering the gradient sign bits and creating adversarial examples similar to using the best gradient sign estimation obtained so far. Can we incorporate in an iterative framework for continuous optimization? Figure 

    11 shows a small, preliminary experiment comparing and to a simple iterative framework employing . In the regime of high dimension/few iterations, can be remarkably faster. However, with more iterations, the algorithm fails to improve further and starts to oscillate. The reason is that always provides updates (non-standard sign convention) compared to the other algorithms whose updates can be zero. Can we get the best of both worlds using clever initializations and adaptive step size updates?

  3. Perturbation Vertices.121212We define perturbation vertices as extreme points of the perturbation region . That is, , where when and when . See Figure 10. Using its first queries, probes extreme points of the perturbation region as potential adversarial examples, while iterative continuous optimization such as and probe points in the Gaussian sphere around the current point as shown in Figure 10. Does looking up extreme points (vertices) of the perturbation region suffice to craft adversarial examples? If that is the case, how to efficiently search through them? searches through vertices out of and it could find adversarial examples among a tiny fraction of these vertices. Recall, in the setup in Section 4, it was enough to look up just out of vertices for each image achieving a evasion over images. Note that after queries, may not visit other vertices as they will be away as shown in Figure 10. We ignored this effect in our experiments.131313In fact, this effect is negligible in the setup as . Will be more effective if the probes are made strictly at the perturbation vertices? This question shows up clearly in the public MNIST challenge where the loss value at the potential adversarial examples dips after queries (see top left plot of Figure 19). We believe the reason is that these potential adversarial examples are not extreme points as illustrated in Figure 10: they are like the red ball 2 rather than the red ball 1.

  4. Adversarial Training. Compared to other attacks, our approach showed more effectiveness towards adversarial training. Standard adversarial training relies on inner maximizers (attacks) that employ iterative continuous optimization methods such as PGD in contrast to our attack which stems from a binary optimization view. What are the implications?

  5. Other Domains. Much of the work done to understand and counter adversarail examples has occurred in the image classification domain. The binary view of our approach lends itself naturally to other domains where binary features are used (e.g., malware detection (Al-Dujaili et al., 2018; Luca et al., 2019)). How effective our approach is on these domains?

(a) perturbation (b) perturbation
Figure 10: Illustration of adversarial examples crafted by in comparison to attacks that are based on the continuous optimization view, e.g., and in both (a) and (b) settings. Note that if is given a query budget , which is the case here, the crafted adversarial examples are not necessary at the perturbation vertices: the red ball 2 in both plots is not a perturbation vertex. We can modify to strictly look up perturbation vertices. This could be done by doubling the step size from to and we leave this for future work as outlined in Section 6.
(a) (b) (c) (d)
Figure 11: for continuous optimization. In this basic experiment, we run , and to minimize a function of the form for . The solid line represents the loss averaged over 30 independent trials with random

and the shaded region indicates the standard deviation of results over random trials. We used a fixed step size of

in line with (Liu et al., 2019) and a finite difference perturbation of . The starting point for all the algorithms was set to be the all-one vector .

7 Conclusion

In this paper, we studied the problem of generating adversarial examples for neural nets assuming a black-box threat model. Motivated by i) the significant empirical effectiveness of gradient sign information; and ii) the low query complexity of recovering a sign vector using a noiseless Hamming distance oracle, we proposed the gradient sign estimation problem as the core challenge in crafting adversarial examples, and we formulate it as a binary black-box optimization problem: minimizing the Hamming distance to the gradient sign or, equivalently, maximizing the directional derivative.

Approximated by the finite difference of the loss value queries, we examine three properties of the directional derivative of the model’s loss in the direction of vectors. Based on the first property, the loss oracle can be used as a noisy Hamming distance oracle. We found that current search Hamming search strategies (e.g. Maurer (2009)) are not suitable for such oracles. The second property lets us employ the optimism in the face of uncertainty principle in the form of hierarchical bandits. This resulted in , an optimistic optimization algorithm for binary black-box optimization problems with a finite-time analysis on its regret. However, its query complexity is worse than the continuous optimization setup. The third property of separability helped us devise , a divide-and-conquer algorithm that is guaranteed to perform at least as well as after queries. In practice, needs a fraction of this number of queries to craft adversarial examples. To verify its effectiveness on real-world datasets, was compared against the state-of-the-art black-box attacks on neural network models for the , , and datasets. yields black-box attacks that are more query efficient and less failure-prone than the state of the art attacks combined. Moreover, achieves the highest evasion rate on two public black-box attack challenge surpassing other attacks that are based on transferability and generative adversarial networks. Our future work will investigate several research questions

This work was supported by the MIT-IBM Watson AI Lab. The authors would like to thank Shashank Srikant for his timely help.


References

Appendix A. Noisy

This section shows the performance of the noisy on standard models (described in Section 4) on the , and datasets. In Figure 12, we consider the threat perturbation constraint. Figure 13 reports the performance for the setup. Similar to Ilyas et al. (2019), for each in the experiment, the top percent of the signs of the coordinates—chosen either randomly (random-k) or by the corresponding magnitude  (top-k)—are set correctly, and the rest are set to or at random. The misclassification rate shown considers only images that were correctly classified (with no adversarial perturbation). In accordance with the models’ accuracy, there were , , and such images for , , and out of the sampled images, respectively. These figures also serve as a validation for Theorem 3.3 when compared to ’s performance shown in Appendix C.

(a) (b) (c)
Figure 12: Misclassification rate of three neural nets (for (a) , (b) , and (c) , respectively) on the noisy ’s adversarial examples as a function of correctly estimated coordinates of on random images from the corresponding evaluation dataset, with the maximum allowed perturbation being set to , , and , respectively. Across all the models, estimating the sign of the top gradient coordinates (in terms of their magnitudes) is enough to achieve a misclassification rate of . Note that Plot (c) is similar to Ilyas et al. (2019)’s Figure 1, but it is produced with TensorFlow rather than PyTorch.
(a) (b) (c)
Figure 13: Misclassification rate of three neural nets (for (a) , (b) , and (c) , respectively) on the noisy ’s adversarial examples as a function of correctly estimated coordinates of on random images from the corresponding evaluation dataset, with the maximum allowed perturbation being set to , , and , respectively. Compared to Figure 12, the performance on and drops significantly.

Appendix B. Experiments Setup

This section outlines the experiments setup as follows. Figure 14 shows the performance of the considered algorithms on a synthetic concave loss function after tuning their hyperparameters. A possible explanation of ’s superb performance is that the synthetic loss function is well-behaved in terms of its gradient given an image. That is, most of gradient coordinates share the same sign, since pixels tend to have the same values and the optimal value for all the pixels is the same . Thus, will recover the true gradient sign with as few queries as possible (recall the example in Section 3.3). Moreover, given the structure of the synthetic loss function, the optimal loss value is always at the boundary of the perturbation region. The boundary is where samples its perturbations. Tables 456, and 7 outline the algorithms’ hyperparameters, while Table 3 describes the general setup for the experiments.

(a)   (b)  
(c)   (d)  
(e)   (f)  
Figure 14: Tuning testbed for the attacks. A synthetic loss function was used to tune the performance of the attacks over a random sample of 25 images for each dataset and perturbation constraint. The plots above show the average performance of the tuned attacks on the synthetic loss function , where using a query limit of queries for each image. Note that in all, outperforms both and . Also, we observe the same behavior reported by Liu et al. (2019) on the fast convergence of compared to . We did not tune ; it does not have any tunable parameters.
Value