Log In Sign Up

Back in Black: A Comparative Evaluation of Recent State-Of-The-Art Black-Box Attacks

by   Kaleel Mahmood, et al.
University of Connecticut

The field of adversarial machine learning has experienced a near exponential growth in the amount of papers being produced since 2018. This massive information output has yet to be properly processed and categorized. In this paper, we seek to help alleviate this problem by systematizing the recent advances in adversarial machine learning black-box attacks since 2019. Our survey summarizes and categorizes 20 recent black-box attacks. We also present a new analysis for understanding the attack success rate with respect to the adversarial model used in each paper. Overall, our paper surveys a wide body of literature to highlight recent attack developments and organizes them into four attack categories: score based attacks, decision based attacks, transfer attacks and non-traditional attacks. Further, we provide a new mathematical framework to show exactly how attack results can fairly be compared.


Exploring the Space of Black-box Attacks on Deep Neural Networks

Existing black-box attacks on deep neural networks (DNNs) so far have la...

A Frank-Wolfe Framework for Efficient and Effective Adversarial Attacks

Depending on how much information an adversary can access to, adversaria...

Data Poisoning Attacks on Regression Learning and Corresponding Defenses

Adversarial data poisoning is an effective attack against machine learni...

Black-box Adversarial ML Attack on Modulation Classification

Recently, many deep neural networks (DNN) based modulation classificatio...

Amora: Black-box Adversarial Morphing Attack

Nowadays, digital facial content manipulation has become ubiquitous and ...

Interactive Machine Learning: A State of the Art Review

Machine learning has proved useful in many software disciplines, includi...

Broadly Applicable Targeted Data Sample Omission Attacks

We introduce a novel clean-label targeted poisoning attack on learning m...

1. Introduction

Figure 1. Timeline of recent black-box attack developments. The transfer based attacks are show in red. The original transfer attack (Local Substitute Model) was proposed in (Papernot et al., 2017). The score based attacks are shown in blue. One of first widely adopted score based attacks (ZOO) was proposed in (Chen et al., 2017). The decision based attacks are shown in green. One of the first decision based attacks (Boundary Attack) was proposed in (Brendel et al., 2017).

One of the first works to popularize Convolutional Neural Networks (CNN) 

(LeCun et al., 1989) for image recognition was published in 1998. Since then, CNNs have been widely employed for tasks like image segmentation (He et al., 2017), object detection (Redmon et al., 2016) and image classification (Kolesnikov et al., 2020). Although CNNs are the de facto choice for machine learning tasks in the imaging domain, they have been shown to be vulnerable to adversarial examples (Goodfellow et al., 2014)

. In this paper, we discuss adversarial examples in the context of images. Specifically, an adversarial example is an input image which is visually correctly recognized by humans, but has a small noise added such that the classifier (i.e. a CNN) misclassifies the image with high confidence.

Attacks that create adversarial examples can be divided into two basic types, white-box and black-box attacks. White-box attacks require knowing the structure of the classifier as well as the associated trained model parameters (Goodfellow et al., 2014). In contrast to this, black-box attacks do not require directly knowing the model and trained parameters. Black-box attacks rely on alternative information like query access to the classifier (Chen et al., 2017), knowing the training dataset (Papernot et al., 2017), or transferring adversarial examples from one trained classifier to another (Zhou et al., 2020).

In this paper, we survey recent advances in black-box adversarial machine learning attacks. We select this scope for two main reasons. First, we choose the black-box adversary because it represents a realistic threat model where the classifier under attack is not directly visible. It has been noted that a black-box attacker represents a more practical adversary (Chen et al., 2020) and one which corresponds to real world scenarios (Papernot et al., 2017). The second reason we focus on black-box attacks is due to the large body of recently published literature. As shown in Figure 1, many new black-box attack papers have been proposed in recent years. These attacks are not included in current surveys or systematization of knowledge papers. Hence, there is a need to categorize and survey these works, which is precisely the goal of this paper. To the best of our knowledge, the last major survey (Bhambri et al., 2019) on adversarial black-box attacks was done in 2020. A graphical overview of the coverage of some of the new attacks we provide (versus the old attacks previously covered) are shown in Figure 2. The complete list of important attack papers we survey are graphically shown in Figure 1 and also listed in Table 1.

While each new attack paper published contributes to the literature, they often do not compare with other state-of-art techniques, or adequately explain how they fit within the scope of the field. In this survey, we summarize 20 recent black-box attacks, categorize them into four basic groups and create a mathematical framework under which results from different papers can be compared.

1.1. Advances in Adversarial Machine Learning

In this subsection we briefly discuss the history and development of the field of adversarial machine learning. Such a perspective helps illuminate how the field went from a white-box attack like FGSM (Goodfellow et al., 2014) in 2014 which required complete knowledge of the classifier and trained parameters, to a black-box attack in 2021 like SurFree (Maho et al., 2021) which can create an adversarial example with only query access to the classifier using 500 queries or less.

The inception point of adversarial machine learning can be traced back to several source papers. However, identifying the very first adversarial machine learning paper is a difficult task as the first paper in the field depends on how the term ”adversarial machine learning” itself is defined. If one defines adversarial machine learning as exclusive to CNNs, then in (Szegedy et al., 2013) the vulnerability of CNNs to adversarial examples was first demonstrated in 2013. However, others (Biggio and Roli, 2018) claim adversarial machine learning can be traced back as early as 2004. In (Biggio and Roli, 2018), the authors claim evading linear classifiers which constituted email spam detectors was one of the first examples of adversarial machine learning.

Regardless of the ambiguous starting point of adversarial examples, it remains a serious open problem which occurs across multiple machine learning domains including image recognition (Goodfellow et al., 2014)

and natural language processing 

(Hsieh et al., 2019)

. Adversarial machine learning is also not just limited to neural networks. Adversarial examples have been shown to be problematic for decision trees, k-nearest neighbor classifiers and support vector machines 

(Papernot et al., 2016).

The field of adversarial machine learning with respect to computer visions and imaging related tasks, first developed with respect to white-box adversaries. One of the first and most fundamental attacks proposed was the Fast Gradient Sign Method (FGSM) 

(Goodfellow et al., 2014). In the FGSM attack, the adversary uses the neural network model architecture

, loss function

, trained weights of the classifier

and performs a single forward and backward pass (backpropagation) on the network to obtain an adversarial example from a clean example

. Subsequent work included methods like the Projected Gradient Descent (PGD) (Madry et al., 2018) attack, which used multiple forward and backward passes to better fine tune the adversarial noise. Other attacks were developed to better determine the adversarial noise by forming an optimization problem with respect to certain norms, such as in the Carlini Wagner (Carlini and Wagner, 2017) attack, or the Elastic Net attack (Chen et al., 2018). Even more recent attacks (Croce and Hein, 2020) have focused on breaking adversarial defenses and overcoming false claims of security which are caused by a phenomena known as gradient masking (Athalye et al., 2018).

All of the aforementioned attacks are considered white-box attacks. That is, the adversary requires knowledge of the network architecture and trained weights in order to conduct the attack. Creating a less capable adversary (i.e., one that did not know the trained model parameters) was a motivating factor in developing black-box attacks. In the next subsection, we discuss black-box attacks and the categorization system we develop in this paper.

Figure 2.

Graph of different black-box attacks with the respective date they were proposed (e-print made available). The query number refers to the number of queries used in the attack on an ImageNet classifier. The orange points are attacks covered in previous survey work 

(Bhambri et al., 2019). The blue points are attacks covered in this work. We further denote whether the attack is targeted or untargeted by putting a U or T next to the text label in the graph. A square point represents an attack done with respect to the norm and a circular point represents attacks done with respect to the norm.

1.2. Black-box Attack Categorization

We can divide black-box attacks according to the general adversarial model that is assumed for the attack. The four categories we use are transfer attacks, score based attacks, decision based attacks and non-traditional attacks. We next describe what defines the different categorizations and also mention the primary original attack paper in each category.

Transfer Attacks: One of the first of black-box attacks was called the local substitute model attack (Papernot et al., 2017). In this attack, the adversary was allowed access to part of the original training data used to train the classifier, as well as query access to the classifier. The idea behind this attack was that the adversary would query the classifier to label the training data. After this was accomplished, the attacker would train their own independent classifier, which it is often referred to as the synthetic model (Mahmood et al., 2019). Once the synthetic model was trained, the adversary could run any number of white-box attacks on the synthetic model to create adversarial examples. These examples were then submitted to the unseen classifier in the hopes the adversarial examples would transfer over. Here transferability is defined in the sense that adversarial examples that are misclassified by the synthetic model will also be misclassified by the unseen classifier.

Recent advances in transfer based attacks include not needing the original training data like in the DaST attack (Zhou et al., 2020) and using methods that generate adversarial example with higher transferability (Adaptive (Mahmood et al., 2019) and PO-TI (Li et al., 2020)).

Score Based Attacks: The zeroth order optimization based black-box attack (ZOO) (Chen et al., 2017)

was one of the first accepted works to rely on a query based approach to creating adversarial examples. Unlike transfer attacks which require a synthetic model, score based attacks repeatedly query the unseen classifier to try and craft the appropriate adversarial noise. As the name implies, for score based attacks to work, they require the output from the classifier to be the score vector (either probabilities or in some cases the pre-softmax logits output).

Score based attacks represent an improvement over transfer attacks in the sense that no knowledge of the dataset is needed since no synthetic model training is required. In very broad terms, the recent developments in score based attacks mainly focus on reducing the number of queries required to conduct the attack and/or reducing the magnitude of the noise required to generate a successful adversarial example. New score based attacks include qMeta (Du et al., 2020), P-RGF (Cheng et al., 2019a), ZO-ADMM (Zhao et al., 2019), TREMBA (Huang and Zhang, 2019), Square attack (Andriushchenko et al., 2020), ZO-NGD (Zhao et al., 2020) and PPBA (Li et al., 2020b).

Decision Based Attacks: We consider the type of attack that does not rely on a synthetic model and does not require the score vector output to be a decision based attack. Compared to either transfer based or score based attacks, decision based attacks represent an even more restricted adversarial model, as only the hard label output from the unseen classifier is required. The first prominent decision based attack paper was the Boundary Attack (Brendel et al., 2017). Since then, numerous decision based attacks have been proposed to improve upon the number of queries to successfully attack the unseen classifier, or reduce the noise required in the adversarial examples. The new decision attacks we cover in this paper include qFool (Liu et al., 2019b), HSJA (Chen et al., 2020), GeoDA (Rahmati et al., 2020), QEBA (Li et al., 2020a), RayS (Chen and Gu, 2020), SurFree (Maho et al., 2021) and NonLinear-BA (Li et al., 2021).

Non-traditional Attacks: The last category of attacks that we cover in this paper are called non-traditional black-box attacks. Here, we use this category to group the attacks that do not use standard black-box adversarial models. Transfer based attacks, score based attacks, and decision based attacks typically focus on designing the attack with respect the and/or the norm. Specifically, these attacks either directly or indirectly seek to satisfy the following condition: where is the original clean example, is the maximum allowed perturbation and . However, there are attacks that work outside of this traditional scheme.

CornerSearch (Croce and Hein, 2019) proposes a black-box attack based on finding an adversarial example with respect to the norm. Abandoning norm based constraints completely, Patch Attack (Yang et al., 2020) replaces a certain area of the image with an adversarial patch. Likewise, ColorFool (Shamsabadi et al., 2020) disregards norms and instead recolors the image to make it adversarial. While the non-traditional norm category is not strictly defined, it gives us a concise grouping that highlights the advances being made outside of the and based black-box attacks.

Score based Attacks
Attack Name Date Author
qMeta 6-Jun-19 Du et al. (Du et al., 2020)
P-RGF 17-Jun-19 Cheng et al. (Cheng et al., 2019a)
ZO-ADMM 26-Jul-19 Zhao et al. (Zhao et al., 2019)
TREMBA 17-Nov-19 Huang et al. (Huang and Zhang, 2019)
Square 29-Nov-19 Andriushchenko et al. (Andriushchenko et al., 2020)
ZO-NGD 18-Feb-20 Zhao et al. (Zhao et al., 2020)
PPBA 8-May-20 Liu et al. (Li et al., 2020b)
Decision based Attacks
Attack Name Date Author
qFool 26-Mar-19 Liu et al. (Liu et al., 2019b)
HSJA 3-Apr-19 Chen et al. (Chen et al., 2020)
GeoDA 13-Mar-20 Rahmati et al. (Rahmati et al., 2020)
QEBA 28-May-20 Li et al. (Li et al., 2020a)
RayS 23-Jun-20 Chen et al. (Chen and Gu, 2020)
SurFree 25-Nov-20 Maho et al. (Maho et al., 2021)
NonLinear-BA 25-Feb-21 Li et al. (Li et al., 2021)
Transfer based Attacks
Attack Name Date Author
Adaptive 3-Oct-19 Mahmood et al. (Mahmood et al., 2019)
DaST 28-Mar-20 Zhou et al. (Zhou et al., 2020)
PO-TI 13-Jun-20 Li et al. (Li et al., 2020)
Non-traditional Attacks
Attack Name Date Author
CornerSearch 11-Sep-19 Croce et al. (Croce and Hein, 2019)
ColorFool 25-Nov-19 Shamsabadi et al. (Shamsabadi et al., 2020)
Patch 12-Apr-20 Yang et al. (Yang et al., 2020)
Table 1. Attacks covered in this survey, their corresponding attack categorization, publication date (when the first e-print was released) and author.

1.3. Paper Organization and Major Contributions

In this paper we survey state-of-the-art black-box attacks that have recently been published. We provide three major contributions in this regard:

  1. In-Depth Survey:

    We summarize and distill the knowledge from 20 recent significant black-box adversarial machine learning papers. For every paper, we include explanation of the mathematics necessary to conduct the attacks and describe the corresponding adversarial model. We also provide an experimental section that brings together the results from all 20 papers, reported on three datasets (MNIST, CIFAR-10 and ImageNet).

  2. Attack Categorization: We organize the attacks into four different categories based on the underlying adversarial model used in each attack. We present this organization so the reader can clearly see where advances are being made under each of the four adversarial threat models. Our break down concisely helps new researchers interpret the rapidly evolving field of black-box adversarial machine learning.

  3. Attack Analysis Framework: We analyze how the attack success rate is computed based on different adversarial models and their corresponding constraints. Based on this analysis, we develop an intuitive way to define the threat model used to compute the attack success rate. Using this framework, it can clearly be seen when attack results reported in different papers can be compared, and when such evaluations are invalid.

The rest of our paper is organized as follows: in Section 2, we summarize score based attacks. In Section 3, we cover the papers that propose new decision based attacks. In Section 4, we discuss transfer attacks. The last type of attack, non-traditional attacks are described in Section 5. After covering all the new attacks, we turn our attention to analyzing the attack success rate in Section 6. Based on this analysis, we compile the experimental results for all the attacks in Section 7, and give the corresponding threat model developed from our new adversarial model framework. Finally, we offer concluding remarks in Section 8.

2. Score based Attacks

In this section we summarize recent advances in adversarial machine learning with respect to attacks that are score based or logit based. The adversarial model for these attacks allow the attacker to query the defense with input and receive the corresponding probability outputs , where is the number of classes. We also include logit based black-box attacks in this section. The logits are the pre-softmax outputs from the model, .

We cover 7 recently proposed score type attacks. These attacks include the square attack (Andriushchenko et al., 2020), the Zeroth-Order Natural Gradient Descent attack (ZO-NGD) (Zhao et al., 2020), the Projection and Policy Driven Attack (PPBA) (Li et al., 2020b), the Zeroth-order Optimization Alternating Direction Method of Multiplers (ZO-ADMM) attack (Zhao et al., 2019), the prior-guided random gradient-free (P-RGF) attack (Cheng et al., 2019a), the TRansferable EMbedding based Black-box Attack (TREMBA) (Huang and Zhang, 2019) and the qMeta attack (Du et al., 2020).

2.1. Square Attack

The Square attack is a score based, black-box adversarial attack proposed in (Andriushchenko et al., 2020) that focuses primarily on being query efficient while maintaining a high attack success rate. The novelty of the attack comes in the usage of square shaped image perturbations which have a particularly strong impact on the predicted outputs of CNNs. This works in tandem with the implementation of the randomized search optimization protocol. The protocol is independent of model gradients and greedily adds squares to the current image perturbation if they lead to an increase in the target model’s error. The attack solves the following optimization problem:


Where is the classifier function, is the number of classes, is the adversarial input, is the clean input, is the ground truth label, and is the maximum perturbation.


The attack algorithm begins by first applying random noise to the clean image. Then an image perturbation, , is generated according to a perturbation generating algorithm defined by the attacker. If is applied to the current . This step is done iteratively until the targeted model outputs the desired label or until the max number of iterations are reached.

The distributions used for the iterative and initial image perturbations are chosen by the attacker. In (Andriushchenko et al., 2020) two different initial and iterative perturbation algorithms algorithms are proposed for the and norm attacks.

For the norm the perturbation is initialized by applying one pixel wide vertical stripes to the clean image. The color of each stripe is sampled uniformly from where c is the number of color channels. The distribution used in the iterative step generates a square of a given size at a random location such that the magnitude of the perturbation in each color channel is chosen randomly from . The resulting, clipped adversarial image will then differ from the clean image by either or at each modified point.

The norm attack is initialized by generating a grid-like tiling of squares on the clean image. The perturbation is then rescaled to have norm and is clipped to . The iterative perturbation is motivated by the realization that classifiers are particularly susceptible to large, localized perturbations rather than smaller, more sparse ones. Thus the iterative attack places two squares of opposite sign either vertically or horizontally in line with each other, where each square has a large magnitude at its center that swiftly drops off but never reaches zero. After each iteration of the attack the current is clipped such that and , where d is the dimensionality of the clean image.

The attack is tested on contemporary models like ResNet-50, Inception v3, and VGG-16-BN which are trained on ImageNet. It achieves a lower attack failure rate while requiring significantly less queries to complete than attacks like Bandits, Parsimonious, DFO-MCA, and SignHunter. Similarly the square attack is compared to the white box Projected Gradient Descent (PGD) attacks on the MNIST and CIFAR-10 datasets where it performs similarly to PGD in terms of attack success rate despite operating within a more difficult threat model.

2.2. Zeroth-Order Natural Gradient Descent Attack

The Zeroth-Order Natural Gradient Descent (ZO-NGD) attack is a score-based, black box attack proposed in (Zhao et al., 2020) as a query efficient attack utilizing a novel attack optimization technique. In particular the attack approximates a Fisher information matrix over the distribution of inputs and subsequent outputs of the classifier. The attack solves the following optimization problem:


Where is the clean image, is an image perturbation, is the maximum allowed image perturbation, t is the clean image’s ground truth label, is the classifier’s predicted score for class given input , and is the attack’s loss. The attack is an iterative algorithm that initializes the image perturbation, , as a matrix of all zeros. At each step the algorithm first approximates the gradient of the loss function, , according to the following equation:


Where each is a random perturbation chosen i.i.d. from the unit sphere, is a smoothing parameter, and is a hyper parameter for the number of queries used in the approximation. Next, the attack approximates the gradient of the log-likelihood function. This is necessary for calculating the Fisher information matrix and subsequently the perturbation update.


Here the notation is consistent with the notation seen in Equation 5. This can be calculated using the same queries that were used in Equation 5. The Fisher information matrix is approximated and is updated according to the following equations:


Where is a constant and is the attack learning rate. is the projection function which projects its input onto the set . It is also worth recognizing that is represented as a matrix since images, like , are also represented as matrices. This makes the addition seen in Equation 8 valid. The iterative process can be continued for a predetermined number of iterations or until the perturbation yields a satisfactory result. The Fisher information matrix is a powerful tool, however its size can prove it impractical for use on datasets with larger inputs, thus an approximation of may be necessary.

The attack is tested on the MNIST, CIFAR-10, and ImageNet datasets where it achieves a similar attack success rate to the ZOO, Bandits, and NES-PGD attacks while requiring less queries to be successful. The attack is then also shown to have an extremely high attack success rate within 1200 queries on all three aforementioned datasets.

2.3. Projection and Probability Driven Attack

The Projection and Probability-driven Black-box Attack (PPBA) proposed in (Li et al., 2020b) is a score based, black box attack that achieves high attack success rates while being query efficient. It achieves this by shrinking the solution space of possible adversarial inputs to those which contain low-frequency perturbations. This is motivated by an observation that contemporary neural networks are particularly susceptible to low frequency perturbations. The attack solves the following optimization problem:


Where is the model’s predicted probability that the input is of class , is the clean image, is the ground truth label, is the adversarial perturbation, and is shorthand for . The attack utilizes a sensing matrix, , which is composed of a Discrete Cosine Transform matrix, , and a Measurement matrix, , along with the corresponding measurement vector, . The exact design of the measurement matrix varies according to practice (Abolghasemi et al., 2010) (Ravelomanantsoa et al., 2015). The relationship between all these variables is as follows: , , .

One point to note is that should be an orthonormal matrix which allows to be true. Once

is calculated the attack utilizes a query efficient version of the random walk algorithm. In particular, the attack stores a Confusion matrix

for each dimension of , which is the change in at each iteration. can be seen below:

Where is a predefined step size, is the number of times the loss function descended when , and is the number of times the loss function increased or remained the same when for . The algorithm then uses to determine its sampling probability for as seen below:


Where is a probabilistic variable that is true when the step is determined to be effective. The attack algorithm begins by first calculating and then initializing all values of to be 1. The iterative part of the algorithm then begins, at each step the algorithm generates a new

according to the probability distribution described in Equation 

11. If then is updated as . Here the clip function forces to remain within the clean image’s input space, . If at any point the perturbation generated causes the model to output an incorrect class label the attack terminates and returns the penultimate perturbation.

PPBA is tested on the ImageNet dataset with the classifiers ResNet50, Inception v3 and VGG-16. PPBA achieves high attack success rates while maintaining a low query count. It is also tested on Google Cloud Vision API where it achieves a high attack success rate in this more realistic setting.

2.4. Alternating Direction Method of Multiplers Based Black-Box Attacks

A new black-box attack framework is proposed in (Zhao et al., 2019) based on the distributed convex optimization technique, the Alternating Direction Method of Multiplers (ADMM). The advantage of using the ADMM technique is that it can be directly combined with the zeroth-order optimization attack (ZOO-ADMM) or Bayesian optimization (BO-ADMM) to create a query-efficient, gradient free black-box attack. The attack can be run with score based or decision based output from the defense.

The main concept presented in (Zhao et al., 2019) is the conversion of the black-box attack optimization problem from a traditional constrained optimization problem, into an unconstrained objective function that can be iteratively solved using ADMM. The original formulation of the black-box attack optimization problem can be written as:

subject to

where is the loss function of the classifier, is the perturbation added to the original input , is the target class that the adversarial example should be misclassified as and is a distortion function to limit the difference between the adversarial example and . In Equation 12, controls the weight given to the distortion function and specifies the maximum tolerated perturbation.

Instead of directly solving Equation 12, the constraints can be moved into the objective function and an auxiliary variable can be introduced in order to write the optimization problem in an ADMM style form:

subject to

where is if and otherwise. The augmented Lagrangian of Equation 13 is written as:


where is the Lagrangian multiplier and is a pentalty parameter. Equation 14 can be iteratively solved using ADMM in the step through the following update equations:


While Equation 15 has a closed form solution, minimizing Equation 16 requires a gradient descent technique like stochastic gradient decent, as well as access to the gradient of

. In the black-box setting this gradient is not available to the adversary and hence must be estimated using a special approach. If the gradient is estimated using the random gradient estimation technique, then the attack is referred to as ZOO-ADMM. Similarly, if the gradient is estimated using bayesian optimization, the attack is denoted as BO-ADMM.

The new attack framework is experimentally verified on the CIFAR-10 and MNIST datasets. The results of the paper (Zhao et al., 2019) show ZOO-ADMM outperforms both BO-ADMM and the original boundary attack presented in (Brendel et al., 2018). This performance improvement comes in the form of smaller distortions for the , and threat models and in terms of less queries used for the ZOO-ADMM attack.

2.5. Improving Black-box Adversarial Attacks with Transfer-based Prior

Initial adversarial machine learning black-box attacks were developed based on one of two basic principles. In query based black-box attacks (Brendel et al., 2018), the gradient is directly estimated through querying. In transfer based attacks, the gradient is computed based on a trained model’s gradient that is available to the attacker (Papernot et al., 2017). In (Cheng et al., 2019a) they propose combining the query and transfer based attacks to create a more query efficient attack which they call the prior-guided random gradient-free method (P-RGF).

The P-RGF attack is developed around accurately and efficiently estimating the gradient of the target model . The original random gradient-free method (Nesterov and Spokoiny, 2017) estimates the gradient as follows:


where is the number of queries used in the estimate,

is a parameter to control the sampling variance,

is the input with corresponding label and are random vectors sampled from distribution . It is important to note that by selecting carefully (according to priors) we can create a better estimate of . In P-RGF this choice of is done by biasing the sampling using a transfer gradient . The transfer gradient comes from a surrgoate model that has been independently trained on the same data as the model whose gradient is currently being estimated. In the attack it is assumed that we have white-box access to the surrogate model such that is known.

The overall derivation of the rest of the attack from  (Cheng et al., 2019a) goes as follows: first we discuss the appropriate loss function for . We then discuss how to pick such that is minimized. To determine how closely (the estimated gradient) follows (the true model gradient) the following loss function is used (Cheng et al., 2019a):


where is a scaling factor included to compensate for the change in magnitude caused by and the expectation is taken over the randomness of the estimation algorithm. For notational convenience we write as in the remainder of this subsection. It can be proven that if is differentiable at then the loss function given in Equation 19 can be expressed as:


where and . Through careful choice of C, can be minimized to accurately estimate the gradient, thereby making the attack query efficient. C can be decomposed in terms of the transfer gradient as:


where and

are the eigenvalues and orthonormal eigenvectors of

C. To exploit the gradient information of the transfer model, is then randomly generated in terms of to satisfy Equation 21:


where controls the magnitude of the transfer gradient and

is a random variable sampled uniformly from the unit hypersphere.

The overall P-RGF method for estimating the gradient is as follows: First

, the cosine similarity between the transfer gradient

and the model gradient is estimated through a specialized query based algorithm (Cheng et al., 2019a). Next is computed as a function of , and the input dimension size . Note we omitted the equation and explanation in our summary for brevity. After computing , the estimate of the gradient is iteratively done times in a two step process. In the first step of the iteration, is generated using Equation 22. In the second step is calculated as: , where denotes the iteration. After iterations have been complete, the final gradient estimate is given as .

The P-RGF attack is tested on ImageNet. The surrogate model to get the transfer gradient in the attack is set as ResNet-152. Attacks are done on different ImageNet CNNs which include Inception v3, VGG-16 and ResNet50. The P-RGF attack outperforms other completing techniques in terms of having a higher attack success rate and lower number of queries for most networks.

2.6. Black-Box Adversarial Attack with Transferable Model-based Embedding

The TRansferable EMbedding based Black-box Attack (TREMBA) (Huang and Zhang, 2019) is an attack that uniquely combines transfer and query based black-box attacks. In conventionally query based black-box attacks, the adversarial image is modified by iteratively fine tuning the noise that is directly added to the pixels of the original image. In TREMBA, instead of directly altering the noise, the embedding space of a pre-trained model is modified. Once the embedding space is modified, this is translated into noise for the adversarial image. The advantage of this approach is that by using the pre-trained model’s embedding as a search space, the amount of queries needed for the attack can be reduced and the attack efficiency can be increased.

The attack generates the perturbation for input using a generator network . The generator network is comprised of two components, an encoder and a decoder . The encoder maps to , a latent space i.e., . The decoder takes as input. The outputs of the decoder is used to compute the perturbation which is defined as . The tanh function is used to normalize the output of the decoder between and such that the final adversarial perturbation is bounded i.e. .

To begin the untargeted version of the attack, the generator network is first trained. For an individual sample , we denote the probability score associated with the correct class label during training as:


where is the maximum allowed perturbation, is the output from the generator and is the component of the output vector of the source model . In this attack formulation the adversary is assumed to have white-box access to a pre-trained source model which is different from the target model under attack. The incorrect class label with the maximum probability during training is:


Using Equation 23 and Equation 24 the loss function for training the generator for an untargeted attack is given as:


where are individual training samples in the training dataset and is a transferability parameter (higher makes the adversarial examples more transferable to other models (Carlini and Wagner, 2017)).

Once is trained the perturbation can be calculated as a function of the embedding space . The embedding space is iteratively computed:


where is the iteration number, is the learning rate, is the sample size,

is a sample from the gaussian distribution

and is the gradient of estimated using the Natural Evolution Strategy (NES) (Ilyas et al., 2018).

Experimentally TREMBA is tested on both the MNIST and ImageNet datasets. The attack is also tested on the Google Cloud Vision API. In general, TREMBA achieves a higher attack success rate and uses less queries for MNIST and ImageNet, as compared to other attack methods. These other attack methods compared in this work include P-RGF, NES and AutoZOOM.

2.7. Query-Efficient Meta Attack

In the query-efficient meta attack (Du et al., 2020), high query-efficiency is achieved through the use of meta-learning to observe previous attack patterns. This prior information is then leveraged to infer new attack patterns through a reduced number of queries. First, a meta attacker is trained to extract information from the gradients of various models, given specific input, with the goal being to infer the gradient of a new target model using few queries. That is, an image x is input to models and a max-margin logit classification loss is used to calculate losses as follows:


where is the true label, is the index of other classes, is the probability score produced by the model , and refers to the probability scores of the subsequent classes.

After one step back-propagation is performed, training groups for the universal meta attacker are assembled, consisting of input images and gradients where . In each training iteration, samples are drawn from a task . For meta attacker model with parameters , the updated parameters are computed as: , where is the loss corresponding to task .

The meta attack parameters are optimized by incorporating across all tasks according to:


The training loss of this meta attacker employs mean-squared error, as given below:


where the set refers to the samples selected for training from for to .

The high-level objective of such a meta attacker model is to produce a helpful gradient map for attacking that is adaptable to the gradient distribution of the target model. To accomplish this efficiently, a subsection of the total gradient map coordinates are used to fine-tune every iterations  (Du et al., 2020), where . In this manner, is trained to be able to produce the gradient distribution of various input images and learns to predict the gradient from only a few samples through this selective fine-tuning. It is of importance to note that query efficiency is further reinforced by performing the typically query-intensive zeroth-order gradient estimation only every iterations.

Empirical results on MNIST, CIFAR-10, and tiny-ImageNet attain comparable attack success rates to other untargeted black-box attacks. However, the attack significantly outperforms prior attacks in terms of the number of queries required in the targeted setting (Du et al., 2020).

3. Decision based Attacks

In this section, we discuss recent developments in adversarial machine learning with respect to attacks that are decision based. The adversarial model for these attacks allows the attacker to query the defense with input and receive the defense’s final predicted output. In contrast to score based attacks, the attacker does not receive any probabilistic or logit outputs from the defense.

We cover 7 recently proposed decision based attacks. These attacks include the Geometric decision-based attack  (Rahmati et al., 2020), Hop Skip Jump Attack  (Chen et al., 2020), RayS Attack  (Chen and Gu, 2020), Nonlinear Black-Box Attack  (Li et al., 2021), Query-Efficient Boundary-Based Black-box Attack  (Li et al., 2020a), SurFree attack  (Maho et al., 2021), and the qFool attack  (Liu et al., 2019b).

3.1. Geometric Decision-based Attacks

Geometric decision-based attacks (GeoDA) are a subset of decision based black box attacks proposed in (Rahmati et al., 2020)

that can achieve high attack success rates while requiring a small number of queries. The attack exploits a low mean curvature in the decision boundary of most contemporary classifiers within the proximity of a data point. In particular the attack uses a hyperplane to approximate the decision boundary in the vicinity of a data point to effectively find the local normal vector of the decision boundary. The normal vector can then be used to modify the clean image in such a way that the model outputs an incorrect class label. Thus the attack solves the following optimization problem:


Where is a normal vector to the decision boundary, and is point on the decision boundary and close to the clean image, . can be found by adding random noise, , to until the classifier’s predicted label changes, then performing a binary search in the direction of to get as close to the decision boundary as possible:


Where returns the top-1 label of the target classifier. The normal vector to the decision boundary is found in the following way: image perturbations,

, are randomly drawn from a multi-variate normal distribution

(Liu et al., 2019a). The model is then queried on the top-1 label of each where is a boundary point close to the clean image, . Each is then classified as follows:


From here the normal vector to the decision boundary can then be estimated as:


Finally the image can be modified using the following update:


Here refers to the point-wise product and . This process is done iteratively, at each iteration the previous iteration’s is used to calculate which is then added to the original to find the current iteration’s as seen above.

The attack is experimentally tested on the ImageNet dataset. The experiments show GeoDA outperforms the Hop Skip Jump Attack, Boundary Attack, and qFool by producing smaller image perturbations and requiring less iterations, and thus less queries, to complete.

3.2. Hop Skip Jump Attack

The Hop Skip Jump Attack (HSJA) is a decision based, black-box attack proposed in (Chen et al., 2020) that achieves both a high attack success rate and a low number of queries. The attack is an improvement on the previously developed Boundary Attack (Brendel et al., 2017) in that it implements gradient estimation techniques at the edge of a model’s decision boundary in order to more efficiently create adversarial inputs to the classifier. Similarly to many other adversarial attacks, HSJA attempts to change the predicted class label of a given input, , while minimizing the perturbation applied to the input. Thus the following optimization problem is proposed:


Here is the predicted probability of class , is the adversarial input, is the clean input, and is a distance metric. This unique optimization formulation allows HSJA to approximate the gradient of Equation 40 and thus more accurately and efficiently solve the optimization problem.

The attack algorithm starts by adding random noise, , to the clean image, , until the model’s predicted class label changes to the desired label. Once a desired random perturbation is found the iterative process is initiated and is stored in which becomes an iterative parameter written as for step number . From here a binary search is performed to find the decision boundary between and . At the decision boundary the following operation is used to approximate the gradient of the decision boundary:


Where and is a small, positive parameter. Each

is randomly drawn i.i.d. from the uniform distribution over the d-dimensional sphere. The additional term,

, is used to attempt to mitigate the bias induced into the estimation by . Once the gradient of the decision boundary is found an update direction is found using the following formulation:


Once this update direction is found a step size must be determined. The step size is initialized as and is halved until . Then is updated by and is updated by . This process is continued for a predetermined iterations.

In (Chen et al., 2020) HSJA is tested on the MNIST, CIFAR-10, CIFAR-100, and ImageNet datasets. HSJA outperforms the Boundary Attack and Opt Attack in terms of median perturbation magnitude and attack success rate. HSJA is also tested against multiple defenses on the MNIST dataset, where it performs better than Boundary Attack and Opt Attack when all attacks are given an equal number of queries.

3.3. RayS Attack

The RayS attack is a query efficient, decision based, black-box attack proposed in (Chen and Gu, 2020) as an alternative to zeroth-order gradient attacks. The attack employs an efficient search algorithm to find the nearest decision boundary that requires less queries then other contemporary decision based attacks while maintaining a high attack success rate. Specifically, the attack formulation turns the continuous problem of finding the closest decision boundary into a discrete optimization problem:


Where is the clean sample which is assumed to be a vector without loss of generality, is the ground truth label of the clean sample, is the classifier’s prediction function, is a direction vector determining the direction of the perturbation in the input space, is a scalar projected onto determining the magnitude of the perturbation, and is the dimensionality of the input. This converts the continuous problem of finding the direction to the closest decision boundary into a discrete optimization problem over which contains possible options.

The attack algorithm finds a direction, , and a radius, , in the input space as its final output for the attack. They can then be converted into a perturbation by projecting onto . The attack begins by choosing some initial direction vector, , and setting . The iterative process comes in multiple stages, , where at each stage is cut into equal and uniformly placed blocks. The algorithm then iterates through each of these blocks, swapping the sign of each value in the current block at a given iteration and storing the modified into . If the algorithm skips searching as it requires a larger perturbation than to change the classifier’s predicted label. If the algorithm performs a binary search in the direction of to find the smallest such that remains true. Finally is updated to and is updated to be the smallest radius found in the binary search.

The RayS attack is experimentally tested in (Chen and Gu, 2020) on the MNIST, CIFAR-10, and ImageNet datasets. It outperforms other black-box attacks like HSJA and SignOPT in terms of both average number of queries and attack success rate on the MNIST and CIFAR-10 datasets. On the ImageNet dataset, HSJA achieves a lower number of average queries than RayS, but attains a significantly lower attack success rate. The RayS attack is also compared to white box attacks like Projected Gradient Descent (PGD) where it outperforms the attack on the MNIST and CIFAR-10 datasets, in terms of the attack success rate.

3.4. Nonlinear Projection Based Gradient Estimation for Query Efficient Blackbox Attacks

The Nonlinear Black-box Attack (NonLinear-BA) is a query efficient, nonlinear gradient projection-based boundary blackbox attack (Li et al., 2021). This attack innovatively overcomes the gradient inaccessibility of blackbox attacks by utilizing vector projection for gradient estimation. AE, VAE, and GAN are used to perform efficient projection-based gradient estimation.  (Li et al., 2021) shows that NonLinear-BA can outperform the corresponding linear projections of HSJA and QEBA, as NonLinear-BA provides a higher lower bound of cosine similarity between the estimated and true gradients of the target model.
There are three components of NonLinear-BA: the first is gradient estimation at the target model’s decision boundary. While high-dimensional gradient estimation is computationally expensive, requiring numerous queries  (Li et al., 2021), projecting the gradient to lower dimensional supports greatly improves the estimation efficiency of NonLinear-BA. This desired low dimensionality is achieved through the latent space representations of generative models, e.g., AE, VAE, and GAN.

The gradient projection function f is defined as , which maps the lower-dimensional representative space to the original, high-dimensional space , where . The sample unit latent vectors ’s in are randomly sampled to generate the perturbation vectors .

Thus, the gradient estimator is as follows:


where is the estimated gradient, is the boundary image at iteration t, is the difference function that indicates whether the image has been successfully perturbed from the original label to the malicious label, the function denotes the sign of this difference function, and is the size of the random perturbation to control the gradient estimation error.

The second component of NonLinear-BA is moving the boundary-image along the estimated gradient direction:


where is a step size chosen by searching with queries.

Finally, in order to enable the gradient estimation in the next iteration and move closer to the target image, the adversarial image is mapped back to the decision boundary through binary search. This search is aided by queries which seek to find a fitting weight :


where is the target image, i.e., the original image whose correct label seeks to achieve with a crafted perturbed image.

TNonLinear-BA is evaluated on both offline model ImageNet, CelebA, CIFAR10 and MNIST datasets, as well as commercial online APIs. The nonlinear projection-based gradient estimation black-box attacks achieve better performance compared with the state-of-the-art baselines. The authors in (Li et al., 2021) discover that when the gradient patterns are more complex, the NonLinear-BA-GAN method fails to keep reducing the MSE after a relatively small number of queries and converges to a poor local optima.

3.5. QEBA: Query-Efficient Boundary-Based Blackbox Attack

Black-box attacks can be query-free or query-based. Query-free attacks are transferability based; query access is not required, as this type of attack assumes the attacker has access to the training data such that a substitute model may be constructed. Query-based attacks can be further categorized into score-based or boundary-based attacks. In a score-based attack, the attacker can access the class probabilities of the model. In a boundary-based attack, only the final model prediction label, rather than the set of prediction confidence scores, is made accessible to the attacker. Both score-based and boundary-based attacks require a substantial number of queries.

One challenge of reducing the number of queries needed for a boundary-based attack is that it is difficult to explore the decision boundary of high-dimensional data without making many queries. The Query-Efficient Boundary-based Blackbox Attack (QEBA) seeks to reduce the queries needed by generating queries through adding perturbations to an image 

(Li et al., 2020a). Thus, probing the decision boundary is reduced to searching a smaller, representative subspace for each generated query. Three representative subspaces are studied by  (Li et al., 2020a): spatial transformed subspace, low frequency subspace, and intrinsic component subspace. The optimality analysis of gradient estimation query efficiency in these subspaces is shown in  (Li et al., 2020a).

QEBA performs an iterative algorithm comprised of three steps: first, estimate the gradient at the decision boundary, which is based on the given representative subspace, second, move along the estimated gradient, and third, project to the decision boundary with the goal of moving towards the target adversarial image. These steps follow the same mathematical details as given in Equation 45 to 47 in Section 3.4. Representative subspace optimizations from spatial, frequency, and intrinsic component perspectives are then consequently explored; these subspace-based gradient estimations are shown to be optimal as compared to estimation over the original space (Li et al., 2020a).

Results for the attack are provided for models trained on ImageNet and models trained on the CelebA dataset. The results show the MSE vs the number of queries, indicating that the three proposed query efficient methods outperform HSJA significantly. The authors also show that the proposed QEBA significantly reduces the required number of queries. In addition, the attack yields high quality adversarial examples against both offline models (i.e. ImageNet) and online real-world APIs such as Face++ and Azure.

3.6. SurFree: a Fast Surrogate-free Blackbox Attack

Many black-box attacks rely on substitution, i.e., a surrogate model is used in place of the target model, the aim being that adversarial examples crafted to attack this surrogate model will effectively transfer to the target classifier. Accordingly, an accurate gradient estimate to create the substitute model requires a substantial number of queries.

By contrast, SurFree is a geometry-based black-box attack that does not query for a gradient estimate (Maho et al., 2021). Instead, SurFree assumes that the boundary is a hyperplane and exploits subsequent geometric properties as follows. Consider the pre-trained classifier to be . A given input image x produces the label where is the predicted probability of class class . The goal of an untargeted attack is to find an adversarial image that is similar to a classified image such that . Thus, an outside region is defined as The desired, optimal adversarial image is then:


A key assumption of SurFree is that if a point , then there exists a point which can be found that lies on the boundary, denoted as . Further, it is assumed that the boundary is an affine hyperplane that passes through in with normal vector N. Considering a random basis with span composed of vectors , the inner product between N and can be iteratively increased by:


where is the vector that spans the plane containing , and and is colinear with N, which points to the projection of along the boundary of the hyperplane.

Additionally, restricting perturbations to a low dimensional subspace improve the estimation of the projected gradient. The low dimensional subspace is carefully chosen to incorporate meaningful, prior information about the visual content of the image. This further aids in implementing a low query budget.

It is experimentally shown that SurFree bests state-of-the-art techniques for limited query amounts (e.g., one thousand queries) while attaining competitive results in unlimited query scenarios  (Maho et al., 2021). The geometric details of approximating a hyperplane surrounding a boundary point are left to  (Maho et al., 2021).

The authors present attack results using the criteria of number of queries, and the resulting distortion on the attacked image, on the MNIST and ImageNet datasets. SurFree drops significantly faster than other compared attacks (QEBA and GeoDA) to lower distortions (most notably from 1 to 750 queries.

3.7. A Geometry-Inspired Decision-Based Attack

qFool is a decision-based attack that requires few queries for both non-targeted and targeted attacks  (Liu et al., 2019b). qFool relies on exploiting the locally flat decision boundary around adversarial examples. In the non-targeted attack case, the gradient direction of the decision boundary is estimated based upon the top-1 label result of each query. An adversarial example is then sought in the estimated direction from the original image. In the targeted attack case, gradient estimations are made iteratively from multiple boundary points from a starting target image. Query efficiency is further improved by seeking perturbations in low-dimensional subspace.

Prior literature  (Fawzi et al., 2016) has shown that the decision boundary has only a small curvature near the presence of adversarial examples. This observation is thus exploited by  (Liu et al., 2019b) to compute an adversarial perturbation . It conceptually follows that the direction of the smallest adversarial perturbation for the input sample is the gradient direction of the decision boundary at . Due to the blackbox nature of attack, this gradient cannot be computed directly; however, from the knowledge that the boundary is relatively flat, the classifier gradient at point will be nearly identical to the gradient of other neighboring points along the boundary. Therefore, the direction of can be suitably approximated by , the gradient estimated at a neighbor point . Thus, an adversarial example from is sought along .

The three components of the untargeted qFool attack involve an initial point, gradient estimation, and a directional search. To begin with, the original image is perturbed by a small, random Gaussian noise to produce a starting point on the boundary:


Noise continues to be added () until the image is misclassified. Next, the top-1 label of the classifier is used to estimate the gradient of the boundary :


where are randomly generated vectors with the same norm to perturb and is the label produced by querying the classifier.

For the final step of qFool, the gradient direction at point can be approximated by the gradient direction at point , i.e., . The adversarial example can thus be found by perturbing the decision boundary in the direction of until the decision boundary is reached. Using binary search, this costs only a few queries to the classifier.

For a targeted attack, the objective becomes perturbing the input image to be classified as a particular target class, i.e., for a target class . Thus, the starting point of this attack is selected to be an arbitrary image that belongs to the target class . Due to the potentially large distance between and

, the assumption of a flat decision boundary between the initial and targeted adversarial regions no longer holds. Instead, a linear interpolation in the direction of (

) is utilized to find a starting point :


The gradient direction estimation of at follows the same method as outlined for untargeted attacks.

The qFool attack is experimentally demonstrated on the ImageNet dataset by attacking VGG-19, ResNet50 and Inception v3. The results show that qFool is able to achieve a smaller distortion in terms of MSE, as compared to the Boundary Attack when both attacks use the same number of queries. However, the overall attack success rate for qFool is not reported. The authors also test qFool on the Google Cloud Vision API.

4. Transfer Attacks

In this section, we explore recent advances in adversarial machine learning with respect to transfer attacks. The adversarial model for these attacks allows the attacker to query the target defense and or access some of the target defense’s training dataset. The attacker then uses this information to create a synthetic model which the attacker then attacks using a white box attack. The adversarial inputs generated from the white box attack on the synthetic model are then transferred to the targeted defense.

We cover 3 recently proposed transfer attacks. These attacks include the Adaptive Black-Box Transfer attack (Mahmood et al., 2019), DaST attack (Zhou et al., 2020) and the Transferable Targeted attack (Li et al., 2020).

4.1. The Adaptive Black-box Attack

A new transfer based black-box attack is developed in (Mahmood et al., 2019) that is an extension of the original Papernot attack proposed in (Papernot et al., 2017). Under this threat model the adversary has access to the training dataset , and query access to the classifier under attack, . In the original Papernot formulation of the attack, the attacker labels the training data to create a new training dataset . The adversary is then able to train synthetic model on while iteratively augmenting the dataset using a synthetic data generation technique. This results in a trained synthetic model . In the final step of the attack, a white-box attack generation method is used in conjunction with the trained synthetic model in order to create adversarial examples :


where are clean testing examples and is a white-box attack method i.e. FGSM (Goodfellow et al., 2014).

The enhanced version of the Papernot attack is called the mixed (Mahmood et al., 2019) or adaptive black-box attack (Mahmood et al., 2020). Where as in the original Papernot attack of the training data is used, the adaptive version increases the strength of the adversary by using anywhere from to of the original training data. Beyond this, the attack generation method is varied to account for newer white-box attack generation methods that have better transferability. In general the most effective version of the attack replaces with , the Momentum Iterative Method (MIM) (Dong et al., 2018). The MIM attack computes an accumulated gradient (Dong et al., 2018):


where is the loss function, is the decay factor and is the adversarial sample at attack iteration . For a bounded attack, the adversarial example at iteration is:


where represents the total number of iterations in the attack and represents the maximum allowed perturbation.

In (Mahmood et al., 2019), the attack is tested using the CIFAR-10 and Fashion-MNIST datasets. The adaptive black-box attack is shown to be effective against vanilla (undefended) networks, as well as a variety of adversarial machine learning defenses.

4.2. DaST: Data-free Substitute Training for Adversarial Attacks

As described in the SurFree attack in 3.6

, substitute models can be difficult or unrealistic to obtain, particularly if a substantial amount of real data labeled by the target model is needed. DaST is a data-free substitute training method that utilizes generative adversarial networks (GANs) to train substitute models without the use of real data  

(Zhou et al., 2020). To address the potentially uneven distribution of GAN-produced samples, a multi-branch architecture and label-control loss for the GAN model is employed.
To describe the necessary context for DaST, let X denote samples from the target model , and denote the true labels and target labels of the samples X, respectively, and let denote the target model parameterized by . Then, the objective of a targeted attack becomes:


where and are the sample and upper bounds of the perturbation, respectively, and refer to the adversarial examples that lead the target model to misclassify a sample with a selected wrong label.

To further provide adequate context for DaST, a white-box attack under these settings would have full access to the gradient construction of the target model and thus leverage this information to generate adversarial examples. In a black-box substitute attack under these settings, a substitute model would stand-in for the target model, and the adversarial examples generated to attack would then be transferred to attack . Thus, coming to the settings of a data-free black-box substitute attack, DaST utilizes a GAN to synthesize a training set for that is as similar as possible to the training set of the target model .
To this end, the substitute training set crafted by the GAN aims to be evenly distributed across all categories of labels, which are produced from . To accomplish this, for categories, the generative network in  (Zhou et al., 2020) is designed to contain upsampling deconvolutional components, which then share a post-processing convolutional network. The generative model randomly samples a noise vector z from the input space as well as the variable label . z then enters the -th upsampling deconvolutional network and the shared convolutional network to produce the adversarial sample . The label-control loss for is given as:


where CE is the cross-entropy.

To approximate the gradient information of to train a label-controllable generative model, the following objective function is used:


For the same inputs, the outputs of will approach the outputs of for the same inputs as training proceeds. Thus, replaces in Equation 57:


The loss of G is then updated as: