Adversarial attack has been a well-recognized threat to existing deep neural network based applications. It injects small amount of noise to a sample (e.g., image, speech, language) but degrades the model performance drastically[3, 11, 15]. According to the information that an adversary has of the target network, existing attack falls into two categories: white-box attack that knows all the parameters of the target network, and black-box attack that only has access to the output of the target network. However, it’s sometimes difficult or even impossible to have full access to certain networks, which makes the black-box attack practical and attract more and more attention.
Black-box attack has very limited or no information of the target network and thus is more challenging to perform. In the -bounded setting, a black-box attack is usually evaluated on two aspects: number of queries and success rate. In addition, recent work  shows that visual distortion in the adversarial examples is also an important criteria in practice. Even under a small bound, perturbing pixels in the image without considering the visual impact could make the distorted image very annoying. As shown in Fig. 1, an attack  under a small noise level () causes relatively large visual distortion and the perturbed image is more distinguishable from the original one. Therefore, under the assumption that the visual distortion caused by the noise is related to the spatial distribution of the perturbed pixels in a bounded attack, we take a different view from previous work and focus on explicitly learning a noise distribution based on its corresponding visual distortion.
In this paper, we propose a novel black-box attack that can directly minimize the induced visual distortion by learning the noise distribution of the adversarial example, assuming only loss-oracle access to the black-box network. The quantified visual distortion, which measures the perceptual distance between the adversarial example and the original image, is introduced in our loss where the gradient of the corresponding non-differentiable loss function is approximated by sampling noise from the learned noise distribution. The proposed attack can achieve a trade-off between visual distortion and query efficiency by introducing the weighted perceptual distance metric in addition to the original loss. Theoretically, we prove the convergence of our model under the assumption that the loss function is convex. The experiments demonstrate the effectiveness of our attack on ImageNet. Our attack results in much lower distortion than the other attacks and achieves success rate on ResNet50 and VGG16bn. In addition, it is shown that our attack is valid even when it’s only allowed to perturb pixels that are out of the target object in a given image.
2 Related Work
Although adversarial attack poses a big threat to existing networks, performing attacks can evaluate the robustness of a network, and further helps improve its robustness by augmenting adversarial examples in training . Recent research on the adversarial attack has made advanced progress in developing a stronger and more computationally efficient adversary. Since our method is based on black-box attack, we briefly introduce recent attack techniques in the black-box setting.
Black-box attack considers the target network as a black-box, and only assumes access to its output scores. Existing methods for the black-box attack roughly fall into three categories: 1) Methods that estimate gradient of the black-box. Some methods estimate the gradient by sampling around a certain point, which formulates the task as a problem of continuous optimization. Tuet al.  searched for perturbations in the latent space of an auto-encoder. Ilyas et al.  exploited prior information about the gradient. Al-Dujaili and O‘Reilly  reduced query complexity by estimating just the sign of the gradient.  shares similarity with our method as it also explicitly defines a noise distribution. However, the distribution in 
is assumed to be an isometric normal distribution without considering visual distortion whilst our method does not assume the distribution to be a specific form and learns a noise distribution that causes less visual distortion. Other approaches developed a substitute model[15, 4, 16] to approximate performance of the black-box. By exploiting the transferability of adversarial attack , the white-box attack technique applied to the substitute model can be transferred to the black-box. These approaches assume only label-oracle to the black-box, whereas training of the substitute model requires either access to the black-box training dataset or collection of a new dataset. 2) Methods based on discrete optimization. In [14, 1], an image is divided into regular grids and the attack is performed and refined on each grid. Meunier et al.  adopted the tiling trick by adding the same noise for small square tiles in the image. 3) Methods that leverage evolutionary strategies or random search [13, 2]. In , the noise value is updated using a square-shaped random search at each query. Meunier et al. 
developed a set of attacks using evolutionary algorithms using both continuous and discrete optimization.
Previous methods did not consider the visual impact of the induced noise, for which the adversarial example could suffer from significant visual distortion. This motivates us to consider the visual quality degradation in the attack model. Under the assumption that the visual distortion caused by the noise is related to the spatial distribution of the perturbed pixels in a bounded attack, we explicitly define a noise distribution, which is learned to minimize the visual distortion.
3.1 Learning Noise Distribution Based on Visual Distortion
An attack model is an adversary that constructs adversarial examples against certain networks. Let be the target network that accepts an input and produces an output .
is a vector andrepresents its entry, denoting the score of the class. is the predicted class. Given a valid input and the corresponding predicted class , an adversarial example  is similar to yet results in an incorrect prediction . In an additive attack, an adversarial example is a perturbed input with additive noise such that , where is bounded by an ball. Although there are several choices of (), we discuss in this paper since our method defines a sample space with a fixed range for each pixel independently. As for other values, please refer to section 4.5 for further discussions. The problem of generating an adversarial example is equivalent to produce noise that causes wrong prediction for the perturbed input. Thus a successful attack is to find such that (1) and (2) . Since the constraint (1) is highly non-linear, the problem is usually rephrased in a different form :
where is the loss function, which is defined as . The attack is successful when . It’s noted that such a loss does not take the visual impact into consideration, for which the adversarial example could suffer from significant visual distortion. In order to constrain the visual distortion caused by the difference between and , we adopt a perceptual distance metric
into the loss function with a predefined hyperparameter:
where smaller indicates less visual distortion. can be any form of metric that measures the perceptual distance between and , such as well established  or LPIPS . manages the trade-off between a successful attack and the visual distortion caused by the attack. The effects of will be further discussed in Section 4.1.
Minimizing the above loss function facing a challenge that is not differentiable since the black-box adversary does not have access to the gradients of and the predefined might be calculated in a non-diffrentiable way. To address this problem, we explicitly assume a noise distribution of and approximate the gradient of by sampling from the distribution. Suppose that follows a distribution parameterized by , i.e., . For the pixel in an image, we make its noise distribution , where is the component of . The noise value of the pixel is sampled by following . By sampling noise from the distribution, can be learned to minimize the expectation of the above loss such that the attack is successful (i.e., alters the predicted label) and the produced adversarial example is less distorted (i.e., small ). The expectation is minimized by sampling from for each pixel:
To ensure the constraint is satisfied, we define the sample space of noise for the pixel to be a set of discrete values in the range of and : , where is the sampling frequency and is the sampling interval. The noise value of the pixel is sampled from this sample space by following .
Given and the width and height of an image, respectively, since each pixel has its own noise distribution of length , the length of for the entire image is . Note that we do not consider the difference of color channels. Thus, the same noise value is sampled for each color channel of a pixel. To estimate , we adopt policy gradient  to make the above expectation differentiable with respect to . Using REINFORCE, we have the differentiable loss function :
where is introduced as a baseline in the expectation with specific meaning: 1) when , the sampled returns low
, and its probabilityincreases through gradient descent; 2) when , and remains unchanged; 3) when , the sampled returns high , and its probability decreases through gradient descent. To sum up, is forced to improve over . At the iteration , we choose such that improves over the obtained minimal loss.
The above expectation is estimated using a single Monte Carlo sampling at each iteration, and the sampling of noise is critical. Simply sampling at the iteration
on the entire image might cause large variance on the norm of the noise,i.e., . Therefore, to ensure a small variance, with , only a small proportion of the noise is randomly resampled from iteration while the others remain unchanged. Let be the proportion of the resampled noise at each iteration, the updated at an iteration is
where denotes randomly sampling proportion of the noise from . As shown in Fig. 2, at the iteration , proportion of noise is resampled by following the corresponding distribution . Then, the feedback from the black-box and the perceptual distance metric decide the update of the distribution . The iteration stops when the attack is successful, i.e., .
3.2 Proof of Convergence
Ruan et al.  shows that feed-forward DNNs (Deep Neural Networks) are Lipschitz continuous with a Lipschitz constant . Therefore, we have
Let and , where , we have
At an iteration , since only a small proportion of the noise is randomly resampled from iteration , it can be assumed that
where is a constant. Note that the learning stops when the attack is successful, i.e., . Therefore, until the learning stops. Suppose that the perceptual distance metric is normalized to . Substituting the inequalities (8) and (9) in our definition of in Eq. (2) gets the following inequality:
Note that is bounded by . Given width , height , channel of the image, and the resampled proportion of the noise from iteration , we have
Thus, the inequality (10) becomes
Ideally, accurately quantifies the difference of the perturbed image even when only one noise value for just a single pixel at the iteration is sampled differently from that at . Let represent the perturbed image with the noise value of the pixel being sampled. Note that is a vector of length , denoting that there are noise values that could be sampled for each pixel. Similarly, denotes the probability of the noise value of the pixel being sampled. By sampling every noise value for the pixel, we define and to be a vector:
Although the above equations are only meaningful under the ideal situation where can quantify the difference of just one perturbed pixel, we use these equations for a theoretical proof of convergence. In the ideal situation, instead of using a single Monte Carlo sampling to estimate as in Eq. (5), the component of can be calculated exactly as
where is the component of . According to Eq. (12) when the number of the resampled pixels =1, we have
Note that for that share the same , is equal to . Thus, replacing the inequality (18) in Eq. (17) gets
In practice, we adopt a single Monte Carlo sampling instead of sampling every noise values for every pixel, for which should be replaced by in the above inequality. The inequality (17) thus becomes:
Since the standard softmax function is Lipschitz continuous with the Lipschitz constant being 1 . We have
Finally, the inequality for becomes
The above inequality proves that is -smooth with the Lipschitz constant being . Assuming that is convex, according to the convergence theorem for gradient descent , it follows that
where is the optimal solution. When is large enough, approximates up to a small enough epsilon and the learning converges.
dataset. We use three pretrained classification networks on Pytorch as the black-box networks: InceptionV3, ResNet50  and VGG16bn 
. The attack is performed on images that were correctly classified by the pretrained network. We randomly selectimages in the validation set for test, and all images are normalized to . We quantify our success in terms of the perceptual distance ( and LPIPS) as we address the visual distortion caused by the attack. In these two metrics,  measures the degradation of structural information in the adversarial examples. Smaller indicates closer perceptual distance. LPIPS 
evaluates the perceptual similarity of two images with their normalized distance between their deep features. Smaller value of LPIPS denotes less visual distortion. Except forand LPIPS, the success rate and average number of queries are also reported as in most frameworks. The average number of queries refers to the average number of requests to the output of the black-box network.
We initialize the noise distribution
to be a uniform distribution and noiseto be . The learning rate is and is set to be . In addition, we specify the shape of the resampled noise at each iteration to be a square [13, 14, 2], and adopt the tiling trick [9, 13] with tile size. The upper bound of our attack is set to be as in previous work.
4.1 Ablation Studies
In the ablation studies, the maximum number of queries is set to be . The results are averaged on test images. In the following, we discuss the trade-off between visual distortion and query efficiency, the effects of using different perceptual distance metrics in the loss function and the results on different sampling frequencies.
Trade-off between visual distortion and query efficiency.
Under the same ball, a query-efficient way to produce an adversarial example is to perturb most pixels with the maximum noise values [14, 2]. However, such attack introduces large visual distortion, which could make the distorted image very annoying. To constrain the visual distortion, the perturbed pixels should be those who cause smaller visual difference while performing a valid attack, which takes extra queries to find. This brings the trade-off between visual distortion and query efficiency. Different from previous work, this trade-off can be controlled by in our loss function. As shown in Table 1, when and , the adversary does not consider visual distortion at all, and perturbs each pixel that is helpful for misclassification until the attack is successful. Thus, it causes the largest perceptual distance ( and ) with the least number of queries (). As increases to , both and LPIPS decrease at the cost of more queries and lower success rate. The maximum in Table 1 is since further increasing it causes the success rate to be lower than . Fig. 3 gives several visualized examples on different , where adversarial examples with larger suffer from less visual distortion.
Ablation studies on the perceptual distance metric.
The perceptual distance metric in the loss function is predefined to measure the visual distortion between the adversarial example and the original image. We adopt and LPIPS as the perceptual distance metric to optimize, respectively, and report their results in Table 1. When , optimizing shows better score on ( v.s. ) whilst optimizing LPIPS has better performance on LPIPS ( v.s. ). However, when increases to and , optimizing gives better scores on both and LPIPS. Therefore, we set the perceptual distance metric to be in the following experiments.
Sampling frequency decides the size of the sample space of . Setting higher frequency means there are more noise values to explore through sampling. In Table 1, increasing the sampling frequency from to reduces the perceptual distance to some extent at the cost of lower success rate. On the other hand, further increasing to does not essentially reduce the distortion yet lowers the success rate. To ensure a high success rate of attack, we set the sampling frequency in the following experiments. Note that the maximum sampling frequency is because the sampling interval in RGB color space (i.e., ) would be less than if . See Fig. 4 for a few adversarial examples.
4.2 Out-of-Object Attack
are based on CNN (Convolutional Neural Network), which gradually aggregates contextual information in deeper layers. Therefore, it is possible to fool the classifier by just attacking the “context”,i.e., background that is out of the target object. Attacking just the out-of-object pixels constrains the number and the position of pixels that can be perturbed, which might further reduce the visual distortion caused by the noise. To locate the object in a given image, we exploited the object bounding box provided by ImageNet. An out-of-object mask is then created according to the bounding box such that the model is only allowed to attack pixels that are out of the object, as shown in Fig. 5. In Table 2, we report results of InceptionV3, ResNet50 and VGG16bn with the maximum queries. The attack is performed on images whose masks are at least large of the image area. The results show that attacking just the out-of-object pixels can also cause misclassification of the object with over success rate. Compared with image attack, the out-of-object attack is more difficult for the adversary in that it requires more number of queries () yet has lower success rate (). On the other hand, the out-of-object attack indeed reduces visual distortion of the adversarial examples on the three networks.
4.3 Attack Effectiveness on Defended Network
|Network||Clean Accuracy||After Attack||LPIPS||Avg. Queries|
|Square Attack ||99.7%||100%||100%||0.280||0.279||0.299||0.265||0.243||0.247||237||62||30|
In the above experiments, we show that our black-box model can attack the undefended network with high success rate. To evaluate the strength of the proposed attack in defended situation, we further attack the InceptionV3 network that adopts ensemble adversarial training (i.e., v). Following , we set and randomly select images from the ImageNet validation set for test. The maximum number of queries is . The performance of the attacked network is reported in Table3, where clean accuracy is the classification accuracy before attack. Note that v is slightly different from InceptionV3 in Table 1 in that the pretrained model of v
comes from Tensorflow, which is the same platform of the pretrained model of v. Compared with undefended network, attacking defended one causes larger visual distortion. However, the proposed attack can still reduce the classification accuracy from to , which demonstrates its effectiveness under defend.
4.4 Comparison with Other Attacks
Different from previous work which focuses on query efficiency, our model addresses improving the visual similarity between the adversarial example and the original image. Therefore, the proposed method might cost more number of queries to construct a less distorted adversarial example. To show that such costs are affordable, we compare our attack to recently proposed query-efficient black-box attacks: SignHunter, NAttack , Bandits  and Square Attack . Since these attacks do not consider visual distortion, for fair comparison, we add in their objective functions accordingly with as in our method, which are represented by -SSIM In Table 4.
|Distance Metric||Sampling Frequency||Success Rate||LPIPS||Avg. Queries|
The results of the above methods are reproduced using the official codes provided by the authors. In NAttack, we set the sample size to be since the original large sample size in the paper is computationally expensive. The maximum number of queries is as in previous work. In our model, considering the trade-off between visual distortion and query efficiency, we set and the perceptual distance metric to be . In Table 4, the proposed attack reduces and LPIPS approximately by half while remaining a high success rate () within limited number of iterations. Except for Signhunter, introducing in the objective function helps reduce visual distortion in other attacks. However, our method still outperforms these attacks since the perceptual distance metric is directly minimized in our method. In addition, the number of queries of our attack is comparable to that of Bandits. Note that the success rates have a sharp decrease in Bandits-SSIM compared with Bandits. This is because Bandits attack uses estimated gradient of the black-box classifier as its prior, whereas simply adding in the loss causes inaccurate gradient. The visualized adversarial examples from different attacks are given in Fig. 6, which shows that our model produces less distorted adversarial examples. More examples can be found in Fig. 7.
We noticed that SignHunter produces adversarial examples with horizontal-stripped noise and Square Attack generates adversarial examples with vertical-stripped noise. Stripped noise is helpful in improving query efficiency since the classification network is quite sensitive to such noise . However, from the perspective of visual distortion, such noise greatly degrades the image quality. The adversarial examples of Bandits are relatively perceptible-friendly, but the perturbation affects most pixels in the image, which causes visually “noisy” effects, especially in a monocolor background. The noise produced by Nattack appear to be regular color patches all over the image due to its large tiling size in the method.
4.5 Other Attacks
Although our method in this paper is based on attack, other () distance can be regarded as the perceptual distance metric in the loss function, which is minimized with a trade-off parameter . We did not discuss it in the experiments because these distance metrics are less accurate in measuring the perceptual distance between images compared to the specifically designed metrics, such as well-established and LPIPS. In Table 5, the results of other () attacks are shown, where the distance is normalized to as the perceptual distance metric in the loss function. Specifically, , where is the distance between the original image and the perturbed image . As in the paper, we set and the maximum number of queries being . We find that the raw and scores have much higher order of magnitude compared with other metrics, and thus the normalized scores of and distances are reported in Table 5. Note that when the sampling frequency , distance is equivalent to distance in that
where is the number of perturbed pixels. and are the width, height and number of channels of a given image, respectively. Table 5 shows that optimizing distance gives better performance on both the perceptual distance metrics and the distance metrics.
We introduce a novel black-box attack based on the induced visual distortion in the adversarial example. The quantified visual distortion, which measures the perceptual distance between the adversarial example and the original image, is introduced in our loss where the gradient of the corresponding non-differentiable loss function is approximated by sampling from a learned noise distribution. The proposed attack can achieve a trade-off between visual distortion and query efficiency by introducing the weighted perceptual distance metric in addition to the original loss. The experiments demonstrate the effectiveness of our attack on ImageNet as our model achieves much lower distortion when compared to existing attacks. In addition, it is shown that our attack is valid even when it’s only allowed to perturb pixels that are out of the target object in a given image.
-  (2020) Sign bits are all you need for black-box attacks. In Proc. International Conference on Learning Representations, Cited by: §2, Figure 6, §4.4, Table 4.
-  (2019) Square attack: a query-efficient black-box adversarial attack via random search. arXiv preprint arXiv:1912.00049. Cited by: §2, Figure 6, §4.1, §4.4, §4.4, Table 4, §4.
-  (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, Cited by: §1, §3.1.
-  (2019) Improving black-box adversarial attacks with a transfer-based prior. In Proc. International Conference on Neural Information Processing Systems, Cited by: §2.
-  (2017) . arXiv preprint arXiv:1704.00805. Cited by: §3.2.
-  (2015) Explaining and harnessing adversarial examples. In Proc. International Conference on Learning Representations, Cited by: §2.
-  (2016) Deep residual learning for image recognition. In , Cited by: §4.2, §4.
-  (2018) Squeeze-and-excitation networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
-  (2019) Prior convictions: black-box adversarial attacks with bandits and priors. In Proc. International Conference on Learning Representations, Cited by: Figure 1, §1, §2, Figure 6, §4.4, Table 4, §4, §4.
-  (2019) Quantifying perceptual distortion of adversarial examples. arXiv preprint arXiv:1902.08265. Cited by: §1.
-  (2017) Adversarial machine learning at scale. In Proc. International Conference on Learning Representations, Cited by: §1.
-  (2019) NATTACK: learning the distributions of adversarial examples for an improved black-box attack on deep neural networks. In Proc. International Conference on Machine Learning, Cited by: §2, Figure 6, §4.4, Table 4.
-  (2019) Yet another but more efficient black-box adversarial attack: tiling and evolution strategies. arXiv preprint arXiv:1910.02244. Cited by: §2, §4, §4.
Parsimonious black-box adversarial attacks via efficient combinatorial optimization. In Proc. International Conference on Machine Learning, Cited by: §2, §4.1, §4.
-  (2017) Practical black-box attacks against machine learning. In Proc. ACM on Asia Conference on Computer and Communications Security, Cited by: §1, §2.
-  (2016) Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277. Cited by: §2.
Reachability analysis of deep neural networks with provable guarantees.
Proc. International Joint Conference on Artificial Intelligence, Cited by: §3.2.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §4.
-  (2015) Very deep convolutional networks for large-scale image recognition. In Proc. International Conference on Learning Representations, Cited by: §4.
-  (1998) Reinforcement learning: an introduction. MIT press Cambridge. Cited by: §3.1.
-  (2016) Rethinking the inception architecture for computer vision. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.
-  (2014) Intriguing properties of neural networks. In Proc. International Conference on Learning Representations, Cited by: §3.1.
-  (2013) Gradient descent: convergence analysis. Note: https://www.stat.cmu.edu/~ryantibs/convexopt-F13/scribes/lec6.pdf Cited by: §3.2.
-  (2018) Ensemble adversarial training: attacks and defenses. In Proc. International Conference on Learning Representations, Cited by: §2, §4.3.
AutoZOOM: autoencoder-based zeroth order optimization method for attacking black-box neural networks. In Proc. AAAI Conference on Artificial Intelligence, Cited by: §2.
-  (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1, §4.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §3.1, §4.