NATTACK: Learning the Distributions of Adversarial Examples for an Improved Black-Box Attack on Deep Neural Networks

05/01/2019 ∙ by Yandong Li, et al. ∙ 10

Powerful adversarial attack methods are vital for understanding how to construct robust deep neural networks (DNNs) and for thoroughly testing defense techniques. In this paper, we propose a black-box adversarial attack algorithm that can defeat both vanilla DNNs and those generated by various defense techniques developed recently. Instead of searching for an "optimal" adversarial example for a benign input to a targeted DNN, our algorithm finds a probability density distribution over a small region centered around the input, such that a sample drawn from this distribution is likely an adversarial example, without the need of accessing the DNN's internal layers or weights. Our approach is universal as it can successfully attack different neural networks by a single algorithm. It is also strong; according to the testing against 2 vanilla DNNs and 13 defended ones, it outperforms state-of-the-art black-box or white-box attack methods for most test cases. Additionally, our results reveal that adversarial training remains one of the best defense techniques, and the adversarial examples are not as transferable across defended DNNs as them across vanilla DNNs.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper is concerned with the robustness of deep neural networks (DNNs). We aim at providing a strong adversarial attack method that can universally defeat a variety of DNNs and associated defense techniques. Our experiments mainly focus on attacking the recently developed defense methods, following (Athalye et al., 2018). Unlike their work, however, we do not need to tailor our algorithm to various forms for tackling different defenses. Hence, it may generalize better to new defense methods in the future. Progress on powerful adversarial attack algorithms will significantly facilitate the research toward more robust DNNs that are deployed in uncertain or even adversarial environments.

Szegedy et al. (2013) found that DNNs are vulnerable to adversarial examples whose changes from the benign ones are imperceptible and yet can mislead DNNs to make wrong predictions. A rich line of work furthering their finding reveals more worrisome results. Notably, adversarial examples are transferable, meaning that one can design adversarial examples for one DNN and then use them to fail others (Papernot et al., 2016a; Szegedy et al., 2013; Tramèr et al., 2017b). Moreover, adversarial perturbation could be universal in the sense that a single perturbation pattern may convert many images to adversarial ones (Moosavi-Dezfooli et al., 2017).

The adversarial examples raise a serious security issue as DNNs become increasingly popular (Silver et al., 2016; Krizhevsky et al., 2012; Hinton et al., 2012; Li et al., 2018; Gan et al., 2017). Unfortunately, the cause of the adversarial examples remains unclear. Goodfellow et al. (2014b) conjectured that DNNs behave linearly in the high dimensional input space, amplifying small perturbations when their signs follow the DNNs’ intrinsic linear weights. Fawzi et al. (2018) experimentally studied the topology and geometry of adversarial examples and Xu et al. (2019) provide the image-level interpretability of adversarial examples. Ma et al. (2018) characterized the subspace of adversarial examples. Nonetheless, defense methods (Papernot et al., 2015; Tramèr et al., 2017a; Rozsa et al., 2016; Madry et al., 2018) motivated by them were broken in a short amount of time (He et al., 2017; Athalye et al., 2018; Xu et al., 2017; Sharma & Chen, 2017), indicating that better defense techniques are yet to be developed, and there may be unknown alternative factors that play a role in the DNNs’ sensitivity.

Powerful adversarial attack methods are key to better understanding of the adversarial examples and for thorough testing of defense techniques.

In this paper, we propose a black-box adversarial attack algorithm that can generate adversarial examples to defeat both vanilla DNNs and those recently defended by various techniques. Given an arbitrary input to a DNN, our algorithm finds a probability density over a small region centered around the input such that a sample drawn from this density distribution is likely an adversarial example, without the need of accessing the DNN’s internal layers or weights — thus, our method falls into the realm of black-box adversarial attack (Papernot et al., 2017; Brendel et al., 2017; Chen et al., 2017; Ilyas et al., 2018).

Our approach is strong; tested against two vanilla DNNs and 13 defended ones, it outperforms state-of-the-art black-box or white-box attack methods for most cases, and it is on par with them for the remaining cases. It is also universal as it attacks various DNNs by a single algorithm. We hope it can effectively benchmark new defense methods in the future — code is available at Additionally, our study reveals that adversarial training remains one of the best defenses (Madry et al., 2018), and the adversarial examples are not as transferable across defended DNNs as them across vanilla ones. The latter somehow weakens the practical significance of white-box methods which otherwise could fail a black-box DNN by attacking a substitute.

Our optimization criterion is motivated by the natural evolution strategy (NES) (Wierstra et al., 2008). NES has been previously employed by Ilyas et al. (2018)

to estimate the gradients in the projected gradient search for adversarial examples. However, their algorithm leads to inferior performance to what we proposed (cf. Table 

1). This is probably because, in their approach, the gradients have to be estimated relatively accurately for the projected gradient method to be effective. However, some of the neural networks are not smooth, so that the NES estimation of the gradient is not reliable enough.

In this paper, we opt for a different methodology using a constrained NES formulation as the objective function instead of using NES to estimate gradients as in Ilyas et al. (2018)

. The main idea is to smooth the loss function by a probability density distribution defined over the

-ball centered around a benign input to the neural network. All adversarial examples of this input belong to this ball111It is straightforward to extend our method to other constraints bounding the offsets between inputs and adversarial examples.. In this frame, assuming that we can find a distribution such that the loss is small, then a sample drawn from the distribution is likely adversarial. Notably, this formulation does not depend on estimating the gradient any more, so it is not impeded by the non-smoothness of DNNs.

We adopt parametric distributions in this work. The initialization to the distribution parameters plays a key role in the run time of our algorithm. In order to swiftly find a good initial distribution to start from, we train a regression neural network such that it takes as input the input to the target DNN to be attacked and its output parameterizes a probability density as the initialization to our main algorithm.

Our approach is advantageous over existing ones in multiple folds. First, we can designate the distribution in a low-dimensional parameter space while the adversarial examples are often high-dimensional. Second, instead of questing an “optimal” adversarial example, we can virtually draw an infinite number of adversarial examples from the distribution. Finally, the distribution may speed up the adversarial training for improving DNNs’ robustness because it is more efficient to sample many adversarial examples from a distribution than to find them using gradient based optimization.

2 Approach

Consider a DNN classifier

, where is an input to the neural network . We assume softmax is employed for the output layer of the network and let denote the -th dimension of the softmax output. When this DNN correctly classifies the input, i.e., , where is the groundtruth label of the input , our objective is to find an adversarial example for such that they are imperceptibly close and yet the DNN classifier labels them distinctly; in other words, . We exclude the inputs for which the DNN classifier predicts wrong labels in this work, following the convention of previous work (Carlini & Wagner, 2017).

We bound the distance between an input and its adversarial counterparts: . We omit from the argument and the subscript when it does not cause ambiguity. Let denote the projection of onto .

We first review the NES based black-box adversarial attack method (Ilyas et al., 2018). We show that its performance is impeded by unstable estimation of the gradients of certain DNNs, followed by our approach which does not depend at all on the gradients of the DNNs.

2.1 A Black-box Adversarial Attack by NES

Ilyas et al. (2018) proposed to search for an optimal adversarial example in the following sense,


given a benign input and its label correctly predicted by the neural network , where is a small region containing defined above, and is a loss function defined as . In (Ilyas et al., 2018), this loss is minimized by the projected gradient method,


where is a sign function. The main challenge here is how to estimate the gradient with derivative-free methods, as the network’s internal architecture and weights are unknown in the black-box adversarial attack. One technique for doing so is by NES (Wierstra et al., 2008):



is an isometric normal distribution with mean

and variance

. Therefore, the stochastic gradient descent (SGD) version of eq. (

2) becomes:

where is the size of a mini-batch and is sampled from the normal distribution. The performance of this approach hinges on the quality of the estimated gradient. Our experiments show that its performance varies on attacking different DNNs probably because non-smooth DNNs lead to unstable NES estimation of the gradients (cf. eq. (3)).

2.2 Attack

We propose a different formulation albeit still motivated by NES. Given an input and a small region that contains (i.e., defined earlier), the key idea is to consider a smoothed objective as our optimization criterion:


where is a probability density with support defined on . Compared with problem (1), this frame assumes that we can find a distribution over such that the loss is small in expectation. Hence, a sample drawn from this distribution is likely adversarial. Furthermore, with appropriate , the objective is a smooth function of , and the optimization process of this formulation does not depend on any estimation of the gradient . Therefore, it is not impeded by the non-smoothness of neural networks. Finally, the distribution over can be parameterized in a much lower dimensional space (), giving rise to more efficient algorithms than eq. (2) which directly works in the high-dimensional input space.

2.2.1 The distribution on

In order to define a distribution on , we take the following transformation of variable approach:


where is an isometric normal distribution whose mean and variance are to be learned and the function maps a normal instance to the space of the neural network input. We leave it to future work to explore the other types of distributions.

In this work, we implement the transformation of the normal variable by the following steps:

  1. draw , compute as

  2. clip , , and

  3. return as

Step 1 draws a “seed” and then maps it by to the space of the same dimension as the input . In our experiments, we let lie in the space of the CIFAR10 images (Krizhevsky & Hinton, 2009) (i.e., ), so the function

is an identity mapping for the experiments on CIFAR10 and a bilinear interpolation for the ImageNet images 

(Deng et al., 2009). We further transform to the same range as the input by and then compute the offset

between the transformed vector and the input. Steps 2 and 3 detail how to project

onto the set , where the clip functions are respectively


with the thresholds and given by users.

Thus far, we have fully specified our problem formulation (eq. (5)). Before discussing how to solve this problem, we recall that the set is the -ball centered at : . Since problem (5) is formulated for a particular input to the targeted DNN, the input also determines the distribution via the dependency of on . In other words, we will learn personalized distributions for different inputs.

2.2.2 Optimization

Let be steps 1–3 in the above variable transformation procedure. We can rewrite the objective function in problem (5) as

where are the unknowns. We use grid search to find a proper bandwidth for the normal distribution and NES to find its mean :


whose SGD version over a mini-batch of size is

In practice, we sample

from a standard normal distribution and then use a linear transformation

to make it follow the distribution . With this notion, we can simplify .

Algorithm 1 summarizes the full algorithm, called Attack, for optimizing our smoothed formulation in eq. (5). In line 6 of Algorithm 1

, the z-score operation is to subtract from each loss quantity

the mean of the losses

and divide it by the standard deviation of all the loss quantities. We find it stablizes

Attack; the algorithm converges well with a constant learning rate . Otherwise, one would have to schedule more sophisticated learning rates as reported in (Ilyas et al., 2018). Regarding the loss function in line 5, we employ the C&W loss (Carlini & Wagner, 2017) in the experiments: .

In order to generate an adversarial example for an input to the neural network classifier , we use the Attack algorithm to find a probability density distribution over and then sample from this distribution until arriving at an adversarial instance such that .

Note that our method differs from that of Ilyas et al. (2018) in that we allow an arbitrary data transformation which is more flexible than directly seeking the adversarial example in the input space, and we absorb the computation of into the function evaluation before the update of (line 7 of Algorithm 1). On the contrary, the projection of Ilyas et al. (2018) is after the computation of the estimated gradient (which is similar to line 7 in Algorithm 1) because it is an estimation of the projected gradient. The difference in the computational order of projection is conceptually important because, in our case, the projection is treated as part of the function evaluation, which is more stable than treating it as an estimation of the projected gradient. Practically, this also makes a major difference, which can be seen from our experimental comparisons of the two approaches.

Input: DNN , input and its label , initial mean , standard deviation , learning rate , sample size , and the maximum number of iterations
Output: , mean of the normal distribution

1:  for  do
2:     Sample ,…,
3:     Compute by Step 1
4:     Obtain by steps 2–3,
5:     Compute losses
6:     Z-score
7:     Set
8:  end for
Algorithm 1 Black-box adversarial Attack

2.3 Initializing Attack by Regression

The initialization to the mean in Algorithm 1 plays a key role in terms of run time. When a good initialization is given, we often successfully find adversarial examples in less than 100 iterations. Hence, we propose to boost the Attack algorithm by using a regression neural network. It takes a benign example as the input and outputs to initialize Attack. In order to train this regressor, we generate many (input, adversarial example) pairs by running Attack on the training set of benchmark datasets. The regression network’s weights are then set by minimizing the loss between the network’s output and ; in other words, we regress for the offset between the adversarial example and the input in the space of the distribution parameters. The supplementary materials present more details about this regression network.

3 Experiments

We use the proposed Attack to attack 13 defense methods for DNNs published in 2018 or 2019 and two representative vanilla DNNs. For each defense method, we run experiments using the same protocol as reported in the original paper, including the datasets and distance (along with the threshold) to bound the differences between adversarial examples and inputs — this experiment protocol favors the defense method. In particular, CIFAR10 (Krizhevsky & Hinton, 2009) is employed in the attack on nine defense methods and ImageNet (Deng et al., 2009) is used for the remaining four. We examine all the test images of CIFAR10 and randomly choose 1,000 images from the test set of ImageNet. 12 of the defenses concern the distance between the adversarial examples and the benign ones and one works with the distance. We threshold the distance in the normalized input space. The distance is normalized by the number of pixels.

In addition to the main comparison results, we also investigate the defense methods’ robustness versus the varying strengths of Attack (cf. Section 3.2). Specifically, we plot the attack success rate versus the attack iteration. The curves provide a complementary metric to the overall attack success rate, uncovering the dynamic traits of the competition between a defense and an attack.

Finally, we examine the adversarial examples’ transferabilities between some of the defended neural networks (cf. Section 3.3). Results show that, unlike the finding that many adversarial examples are transferable across different vanilla neural networks, a majority of the adversarial examples that fail one defended DNN cannot defeat the others. In some sense, this weakens the practical significance of white-box attack methods which are often thought applicable to unknown DNN classifiers by attacking a substitute neural network instead (Papernot et al., 2017).

3.1 Attacking 13 Most Recent Defense Techniques

We consider 13 defenses recently developed: adversarial training (Adv-train(Madry et al., 2018), adversarial training of Bayesian DNNs (Adv-BNN(Liu et al., 2019), Thermometer encoding (Therm(Buckman et al., 2018), Therm-Adv (Athalye et al., 2018; Madry et al., 2018), Adv-GAN (Wang & Yu, 2019), local intrinsic dimensionality (LID(Ma et al., 2018), stochastic activation pruning (SAP(Dhillon et al., 2018), random self-ensemble (RSE(Liu et al., 2018), cascade adversarial training (Cas-adv(Na et al., 2018), randomization (Xie et al., 2018), input transformation (Input-Trans(Guo et al., 2018), pixel deflection (Prakash et al., 2018), and guided denoiser (Liao et al., 2018). We describe them in detail in the supplementary materials. Additionally, we also include Wide Resnet-32 (WResnet-32(Zagoruyko & Komodakis, 2016) and Inception V3 (Szegedy et al., 2016), two vanilla neural networks for CIFAR10 and ImageNet, respectively.

Implementation Details. In our experiments, the defended DNNs of SAP, LID, Randomization, Input-Trans, Therm, and Therm-dav come from (Athalye et al., 2018), the defended models of Guided denoiser and Pixel deflection are based on (Athalye & Carlini, 2018), and the models defended by RSE, Cas-adv, Adv-train, and Adv-GAN are respectively from the original papers. For Adv-BNN, we attack an ensemble of ten BNN models. In all our experiments, we set as the maximum number of optimization iterations, for the sample size, variance of the isotropic Gaussian , and learning rate . Attack is able to defeat most of the defenses under this setting and about 90% inputs for other cases. We then fine-tune the learning rate and sample size for the hard leftovers.

Defense Technique Dataset Classification Threshold Attack Success Rate %
Accuracy % & Distance BPDA ZOO QL D-based Attack
Adv-train CIFAR10 87.3 0.031 () 46.9 16.9 40.3 47.9
 (Madry et al., 2018)
adv-bnn CIFAR10 79.7 0.035 () 48.3 75.3
 (Liu et al., 2019)
Therm-adv CIFAR10 88.5 0.031 () 76.1 0.0 42.3 91.2
 (Athalye et al., 2018)
Cas-adv CIFAR10 75.6 0.015 () 85.0* 96.1 68.4 97.7
 (Na et al., 2018)
ADV-GAN CIFAR10 90.9 0.031 () 48.9 76.4 53.7 98.3
 (Wang & Yu, 2019)
LID CIFAR10 66.9 0.031 () 95.0 92.9 95.7 100.0
 (Ma et al., 2018)
Therm CIFAR10 92.8 0.031 () 100.0 0.0 96.5 100.0
 (Buckman et al., 2018)
SAP CIFAR10 93.3 0.031 () 100.0 5.9 96.2 100.0
 (Dhillon et al., 2018)
RSE CIFAR10 91.4 0.031 () 100.0
 (Liu et al., 2018)
VANILLA WRESNET-32 CIFAR10 95.0 0.031 () 100.0 99.3 96.8 100.0
 (Zagoruyko & Komodakis, 2016)
Guided denoiser ImageNet 79.1 0.031 () 100.0 95.5
 (Liao et al., 2018)
Randomization ImageNet 77.8 0.031 () 100.0 6.7 45.9 96.5
 (Xie et al., 2018)
Input-Trans ImageNet 77.6 0.05 () 100.0 38.3 66.5 66.0 100.0
 (Guo et al., 2018)
Pixel deflection ImageNet 69.1 0.015 () 97.0 8.5 100.0
 (Prakash et al., 2018)
VANILLA INCEPTION V3 ImageNet 78.0 0.031 () 100.0 62.1 100.0 100.0
 (Szegedy et al., 2016)
Table 1: Adversarial attack on 13 recently published defense methods. (* the number reported in (Athalye et al., 2018). For all the other numbers, we obtain them by running the code released by the authors or implemented ourselves with the help of the authors. For D-based and Adv-Train, we respectively report the results on 100 and 1000 images only because they incur expensive computation costs.)

3.1.1 Attack success rates

We report in Table 1 the main comparison results evaluated by the attack success rate, the higher the better. Our Attack achieves 100% success on six out of the 13 defenses and more than 90% on five of the rest. As a single black-box adversarial algorithm, Attack is better than or on par with the set of powerful white-box attack methods of various forms (Athalye et al., 2018), especially on the defended DNNs. It also significantly outperforms three state-of-the-art black-box attack methods: ZOO (Chen et al., 2017), which adopts the zero-th order gradients to find adversarial examples; QL (Ilyas et al., 2018), a query-limited attack based on an evolution strategy; and a decision-based (D-based) attack method (Brendel et al., 2017) mainly generating -bounded adversarial examples.

Notably, Adv-train is still among the best defense methods, so is its extension to the Bayesian DNNs (i.e., Adv-BNN). However, along with Cas-Adv and Therm-Adv which are also equipped with the adversarial training, their strengths come at the price that they give worse classification performances than the others on the clean inputs (cf. the third column of Table 1). Moreover, Adv-train incurs extremely high computation cost. When the image resolutions are high, Kurakin et al. (2016) found that it is difficult to run the adversarial training at the ImageNet scale. Since our Attack enables efficient generation of adversarial examples once we learn the distribution, we can potentially scale up the adversarial training with Attack and will explore it in the future work.

We have tuned the main free parameters of the competing methods (e.g., batch size and bandwidth in QL). ZOO runs extremely slow with high-resolution images, so we instead use the hierarchical trick the authors described (Chen et al., 2017) for the experiments on ImageNet. In particular, we run ZOO starting from the attack space of , lift the resolution to after 2,000 iterations and then to after 10,000 iterations, and finally up-sample the result to the same size as the DNN input with bilinear interpolation.

3.1.2 Ablation study and run-time comparison

Attack vs. QL. We have discussed the conceptual differences between Attack and QL (Ilyas et al., 2018) in Section 2 (e.g., Attack formulates a smooth optimization criterion and offers a probability density on the -ball of an input). Moreover, the comparison results in Table 1 verify the advantage of Attack over QL in terms of the overall attack strengths. Additionally, we here conduct an ablation study to investigate two major algorithmic differences between them: Attack absorbs the projection () into the objective function and allows an arbitrary change of variable transformation . Our study concerns Therm-Adv and SAP, two defended DNNs on which QL respectively reaches 42.3% and 96.2% attack success rates. After we instead absorb the projection in QL into the objective, the results are improved to 54.7% and 97.7%, respectively. If we further apply , the change of variable procedure (cf. Steps 1–3), the success rates become 83.3% and 98.9%, respectively. Finally, with the z-score operation (line 6 of Algorithm 1), the results are boosted to 90.9%/100%, approaching Attack’s 91.2%/100%. Therefore, we say that Attack boosts QL’s performance, thanks to both the smoothed objective and the transformation .

Attack vs. the White-Box BPDA Attack. While BPDA achieves high attack success rates by different variants for handling the diverse defense techniques, Attack gives rise to better or comparable results by a single universal algorithm. Additionally, we compare them in terms of the run time in the supplementary materials; the main observations are the following. On CIFAR10, BPDA and Attack can both find an adversarial example in about 30s. To defeat an ImageNet image, it takes Attack about 71s without the regression network and 48s when it is equipped with the regression net; in contrast, BPDA only needs 4s. It is surprising to see that BPDA is almost 7 times faster at attacking a DNN for ImageNet than a DNN for CIFAR10. It is probably because the gradients of the former are not “obfuscated” as well as the latter due to the higher resolution of the ImageNet input.

3.2 Attack Success Rate vs. Attack Iteration

Figure 1: (a) Success rate versus run steps of Attack. (b) Comparison results with QL measured by the log of average number of queries per successful image. The solid lines denote Attack and the dashed lines illustrate QL.

The Attack algorithm has an appealing property as follows. In expectation, the loss (eq. (5)) decreases at every iteration and hence a sample drawn from the distribution is adversarial with higher chance. Though there could be oscillations, we find that the attack strengths do grow monotonically with respect to the evolution iterations in our experiments. Hence, we propose a new curve shown in Figure 1a featuring the attack success rate versus number of evolution iterations — strength of attack. For the experiment here, the Gaussian mean is initialized by for any input to maintain about the same starting points for all the curves.

Figure 1a plots eight defense methods on CIFAR10 along with a vanilla DNN. It is clear that Adv-Train, Adv-BNN, Therm-Adv, and Cas-Adv, which all employ the adversarial training strategy, are more difficult to attack than the others. What’s more interesting is with the other five DNNs. Although Attack completely defeats them all by the end, the curve of the vanilla DNN is the steepest while the SAP curve rises much slower. If there are constraints on the computation time or the number of queries to the DNN classifiers, SAP is advantageous over the vanilla DNN, RSE, Therm, and LID.

Note that the ranking of the defenses in Table 1 (evaluation by the success rate) is different from the ordering on the left half of Figure 1a, signifying the attack success rate and the curve mutually complement. The curve reveals more characteristics of the defense methods especially when there are constraints on the computation time or number of queries to the DNN classifier.

Figure 1b shows Attack (solid lines) is more query efficient than the QL attack (Ilyas et al., 2018) (dashed lines) on 6 defenses under most attack success rates and the difference is even amplified for higher success rates. For SAP, Attack performs better when the desired attack success rate is bigger than .

3.3 Transferability

We also study the transferability of adversarial examples across different defended DNNs. This study differs from the earlier ones on vanilla DNNs (Szegedy et al., 2013; Liu et al., 2016). We investigate both the white-box attack BPDA and our black-box Attack.

Following the experiment setup in (Kurakin et al., 2016), we randomly select 1000 images for each targeted DNN such that they are classified correctly, and yet the adversarial images of them are classified incorrectly. We then use the adversarial examples of the 1000 images to attack the other DNNs. In addition to the defended DNNs, we also include two vanilla DNNs for reference: Vanilla-1 and Vanilla-2. Vanilla-1 is a light-weight DNN classifier built by (Carlini & Wagner, 2017) with 80% accuracy on CIFAR10. Vanilla-2 is the Wide-ResNet-28 (Zagoruyko & Komodakis, 2016) which gives rise to 92.3% classification accuracy on CIFAR10. For fair comparison, we change the threshold to 0.031 for Cas-adv. We exclude RSE and Cas-Adv from BPDA’s confusion table because it is not obviously clear how to attack RSE using BPDA and the released BPDA code lacks the piece for attacking Cas-Adv.

Figure 2: Transferabilities of BPDA (Athalye et al., 2018) (left) and Attack (right). Each entry shows the attack success rate of attacking the column-wise defense by the adversarial examples that are originally generated for the row-wise DNN.

The confusion tables of BPDA and Attack are shown in Figure 2, respectively, where each entry indicates the success rate of using the adversarial examples originally targeting the row-wise defense model to attack the column-wise defense. Both confusion tables are asymmetric; it is easier to transfer from defended models to the vanilla DNNs than vice versa. Besides, the overall transferrability is lower than that across the DNNs without any defenses (Liu et al., 2016). We highlight some additional observations below.

Firstly, the transferability of our black-box Attack is not as good as the black-box BPDA attack. This is probably because BPDA is able to explore the intrinsically common part of the DNN classifiers — it has the privilege of accessing the true or estimated gradients that observe the DNNs’ architectures and weights.

Secondly, both the network architecture and defense methods can influence the transferability. Vanilla-2 is the underlying classifier of SAP, Therm-Adv, and Therm. The adversarial examples originally attacking Vanilla-2 do transfer better to SAP and Therm than to the others probably because they share the same DNN architecture, but the examples achieve very low success rate on Therm-Adv due to the defense technique.

Finally, the transfer success rates are low no matter from Therm-Adv to the other defenses or vice versa, and Adv-Train and Adv-BNN lead to fairly good results of transfer attacks on the other defenses and yet themselves are robust against the adversarial examples of the other defended DNNs. The unique result of Therm-Adv probably attributes to its use of double defense techniques, i.e., Thermometer encoding and adversarial training.

4 Related Work

There is a vast literature of adversarial attacks on and defenses for DNNs. We focus on the most related works in this section rather than a thorough survey.

White-Box Attacks. The adversary has full access to the target DNN in the white-box attack. Szegedy et al. (2013) first find that DNNs are fragile to the adversarial examples by using box-constrained L-BFGS. Goodfellow et al. (2014a) propose a fast gradient sign (FGS) method, which is featured by efficiency and high performance for generating the bounded adversarial examples. Papernot et al. (2016b) and Moosavi-Dezfooli et al. (2016) instead formulate the problems with the and metrics, respectively. (Carlini & Wagner, 2017) have proposed a powerful iterative optimization based attack. Similarly, a projected gradient descent has been shown strong in attacking DNNs (Madry et al., 2018). Most the white-box attacks rely on the gradients of the DNNs. When the gradients are “obfuscated” (e.g., by randomization), (Athalye et al., 2018) derive various methods to approximate the gradients, while we use a single algorithm to attack a variety of defended DNNs.

Black-Box Attacks. As the name suggests, some parts of the DNNs are treated as black boxes in the black-box attack. Thanks to the adversarial examples’ transferabilities (Szegedy et al., 2013), Papernot et al. (2017) train a substitute DNN to imitate the target black-box DNN, produce adversarial examples of the substitute model, and then use them to attack the target DNN. Chen et al. (2017) instead use the zero-th order optimization to find adversarial examples. Ilyas et al. (2018) use the evolution strategy (Salimans et al., 2017) to approximate the gradients. Brendel et al. (2017) introduce a decision-based attack by reading the hard labels predicted by a DNN, rather than the soft probabilistic output. Similarly, Cheng et al. (2019) also provide a formulation to explore the hard labels. Most of the existing black-box methods are tested against vanilla DNNs. In this work, we test them on defended ones along with our Attack.

5 Conclusion and Future Work

In this paper, we present a black-box adversarial attack method which learns a probability density on the -ball of a clean input to the targeted neural network. One of the major advantages of our approach is that it allows an arbitrary transformation of variable , converting the adversarial attack to a space of much lower dimensional than the input space. Experiments show that our algorithm defeats 13 defended DNNs, better than or on par with state-of-the-art white-box attack methods. Additionally, our experiments on the transferability of the adversarial examples across the defended DNNs show different results reported in the literature: unlike the high transferability across vanilla DNNs, it is difficult to transfer the attacks on the defended DNNs.

Some existing works try to characterize the adversarial examples by their geometric properties. In contrast to this macro view, we model the adversarial population of each single input from a micro view by a probabilistic density. There are still a lot to explore along this avenue. What is a good family of distributions to model the adversarial examples? How to conduct adversarial training by efficiently sampling from the distribution? These questions are worth further investigation in the future work.


This work was supported in part by NSF-1836881, NSF-1741431, and ONR-N00014-18-1-2121.


Appendix A More Details of the 13 Defense Methods

  • [leftmargin=*]

  • Thermometer encoding (Therm). To break the hypothesized linearity behavior of DNNs (Goodfellow et al., 2014a), Buckman et al. (2018) proposed to transform the input by non-differentiable and non-linear thermometer encoding, followed by a slight change to the input layer of conventional DNNs.

  • Adv-Train & Therm-Adv. Madry et al. (2018) proposed a defense using adversarial training (Adv-Train). Specially, the training procedure alternates between seeking an “optimal” adversarial example for each input by projected gradient descent (PGD) and minimizing the classification loss under the PGD attack. Furthermore, Athalye et al. (2018) find that the adversarial robust training (Madry et al., 2018) can significantly improve the defense strength of Therm (Therm-Adv). Compared with Adv-Train

    , the adversarial examples are produced by the logit-space projected gradient ascent in the training.

    Defense Dataset BPDA Attack Attack-R
     (Athalye et al., 2018)
    SAP CIFAR-10 () 33.3s 29.4s
     (Dhillon et al., 2018)
    Randomization ImageNet () 3.51s 70.77s 48.22s
     (Xie et al., 2018)
    Table 2: Average run time to find an adversarial example (Attack-R stands for Attack initialized with the regression net).
  • Cascade adversarial training (Cas-adv). Na et al. (2018) reduced the computation cost of the adversarial training (Goodfellow et al., 2014b; Kurakin et al., 2016) in a cascade manner. A model is trained from the clean data and one-step adversarial examples first. The second model is trained from the original data, one-step adversarial examples, as well as iterative adversarial examples generated against the first model. Additionally, a regularization is introduced to the unified embeddings of the clean and adversarial examples.

  • Adversarially trained Bayesian neural network (ADV-BNN). Liu et al. (2019) proposed to model the randomness added to DNNs in a Bayesian framework in order to defend against adversarial attack. Besides, they incorporated the adversarial training, which has been shown effective in the previous works, into the framework.

  • Adversarial training with adversarial examples generated from GAN (ADV-GAN). Wang & Yu (2019) proposed to model the adversarial perturbation with a generative network, and they learned it jointly with the defensive DNN as a discriminator.

  • Stochastic activation pruning (SAP). Dhillon et al. (2018)

    randomly dropped some neurons of each layer with the probabilities in proportion to their absolute values.

  • Randomization. (Xie et al., 2018)

    added a randomization layer between inputs and a DNN classifier. This layer consists of resizing an image to a random resolution, zero-padding, and randomly selecting one from many resulting images as the actual input to the classifier.

  • Input transformation (Input-Trans). By a similar idea as above, Guo et al. (2018) explored several combinations of input transformations coupled with adversarial training, such as image cropping and rescaling, bit-depth reduction, JPEG compression.

  • Pixel deflection. Prakash et al. (2018) randomly sample a pixel from an image and then replace it with another pixel randomly sampled from the former’s neighborhood. Discrete wavelet transform is also employed to filter out adversarial perturbations to the input.

  • Guided denoiser. Liao et al. (2018) use a denoising network architecture to estimate the additive adversarial perturbation to an input.

  • Random self-ensemble (RSE). Liu et al. (2018) combine the ideas of randomness and ensemble using the same underlying neural network. Given an input, it generates an ensemble of predictions by adding distinct noises to the network multiple times.

Appendix B Architecture of the Regression Network

We construct our regression neural network by using the fully convolutional network (FCN) architecture (Shelhamer et al., 2016). In particular, we adapt the FCN model pretrained on PASCAL VOC segmentation challenge (Everingham et al., 2010) to our work by changing its last two layers, such that the network outputs an adversarial perturbation of the size . We train this network by a mean square loss.

Appendix C Run Time Comparison

Compared with the white-box attack approach BPDA (Athalye et al., 2018), Attack may take longer time since BPDA can find the local optimal solution quickly being guided by the approximate gradients. However, Attack can be executed in parallel in each episode. We leave implement the parallel version of our algorithm to the future work and compare its sing-thread version with BPDA below.

We attack 100 samples on one machine with fou TITAN-XP graphic cards and calculate the average run time for reaching an adversarial example. As shown in Table 2, Attack can succeed even faster than the white-box BPDA on CIFAR-10, yet runs slower on ImageNet. The main reason is that when the image size is as small as CIFAR10 (3*32*32), the search space is moderate. However, the run time could be lengthy for high resolution images like ImageNet (3*299*299) especially for some hard cases (we can find the adversarial examples for nearly 90% test images but it could take about 60 minutes for a hard case).

We use a regression net to approximate a good initialization of and we name Attack initialized with the regression net as Attack-R. We run Attack and Attack-R on ImageNet with the mini-batch size . The success rate for Attack with random initialization is 82% and for Attack-R is 91.9%, verifying the efficacy of the regression net. The run time shown in Table 2 is calculated on the images with successful attacks. The results demonstrate that Attack-R can reduce by 22.5s attack time per image compared with the random initialization.